441474 – vgdb might eat all memory while waiting for sigstop

Bug 441474 - vgdb might eat all memory while waiting for sigstop

Summary: vgdb might eat all memory while waiting for sigstop

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	unspecified
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:	444481
	Show dependency tree / graph

Reported:	2021-08-24 11:20 UTC by Mark Wielaard
Modified:	2021-10-27 12:47 UTC (History)
CC List:	1 user (show)

See Also:	444481
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Wielaard 2021-08-24 11:20:15 UTC

In coregrind/vgdb-invoker-ptrace.c we have this code:

      /* pid received a signal which is not the signal we are waiting for.
         If we have not (yet) changed the registers of the inferior
         or we have (already) reset them, we can transmit the signal.

         If we have already set the registers of the inferior, we cannot
         transmit the signal, as this signal would arrive when the
         gdbserver code runs. And valgrind only expects signals to
         arrive in a small code portion around
         client syscall logic, where signal are unmasked (see e.g.
         m_syswrap/syscall-x86-linux.S ML_(do_syscall_for_client_WRK).

         As ptrace is forcing a call to gdbserver by jumping
         'out of this region', signals are not masked, but
         will arrive outside of the allowed/expected code region.
         So, if we have changed the registers of the inferior, we
         rather queue the signal to transmit them when detaching,
         after having restored the registers to the initial values. */
      if (pid_of_save_regs) {
         siginfo_t *newsiginfo;

         // realloc a bigger queue, and store new signal at the end.
         // This is not very efficient but we assume not many sigs are queued.
         if (signal_queue_sz >= 64) {
            DEBUG(0, "too many queued signals while waiting for SIGSTOP\n");
            return False;
         }
         signal_queue_sz++;
         signal_queue = vrealloc(signal_queue,
                                 sizeof(siginfo_t) * signal_queue_sz);
         newsiginfo = signal_queue + (signal_queue_sz - 1);

         res = ptrace (PTRACE_GETSIGINFO, pid, NULL, newsiginfo);

This is inside a while (1) loop and could run infinitely when valgrind itself is crashing (getting a SIGSEGV over and over again). I haven't identified precisely why valgrind is failing (it is only on s390x during the gdbserver_tests/nlvgdbsigqueue testcase), but I propose to limit this loop and bail out after having seen 64 non SIGSTOP signals, so that vgdb isn't stuck inside this loop slowly eating all memory:

t a/coregrind/vgdb-invoker-ptrace.c b/coregrind/vgdb-invoker-ptrace.c
index 389748960..07f3400f9 100644
--- a/coregrind/vgdb-invoker-ptrace.c
+++ b/coregrind/vgdb-invoker-ptrace.c
@@ -300,6 +300,10 @@ Bool waitstopped (pid_t pid, int signal_expected, const char *msg)
 
          // realloc a bigger queue, and store new signal at the end.
          // This is not very efficient but we assume not many sigs are queued.
+         if (signal_queue_sz >= 64) {
+            DEBUG(0, "too many queued signals while waiting for SIGSTOP\n");
+            return False;
+         }
          signal_queue_sz++;
          signal_queue = vrealloc(signal_queue,
                                  sizeof(siginfo_t) * signal_queue_sz);

Note that this is different from bug #434035 since that involved a fatal signal, in this case the signal (SIGSEGV) isn't fatal since valgrind tries to handle it (but fails).

Comment 1 Andreas Arnez 2021-10-05 18:29:25 UTC

(In reply to Mark Wielaard from comment #0)
> [...] I haven't
> identified precisely why valgrind is failing (it is only on s390x during the
> gdbserver_tests/nlvgdbsigqueue testcase), [...]
Does this mean the test case is failing for you?  It isn't for me.  If you have more information, I'd look into that.

Comment 2 Mark Wielaard 2021-10-06 15:01:33 UTC

(In reply to Andreas Arnez from comment #1)
> (In reply to Mark Wielaard from comment #0)
> > [...] I haven't
> > identified precisely why valgrind is failing (it is only on s390x during the
> > gdbserver_tests/nlvgdbsigqueue testcase), [...]
> Does this mean the test case is failing for you?  It isn't for me.  If you
> have more information, I'd look into that.

Unfortunately the issue occurs on an remote test machine that checks against latest gcc and glibc, where before this workaround it blows up the machine, because vgdb eats up all memory and afterwards the nlvgdbsigqueue does indeed FAIL.

So I would at least like to get this workaround in to not break the testing setup.

I am trying to get access to a s390x setup where this happens. It might be similar to the other issue I have seen with latest glibc, where if we get a fatal signal, try to terminate and call __libc_freeres we get a SIGSEGV.

Comment 3 Mark Wielaard 2021-10-12 21:28:18 UTC

I pushed the workaround:

commit 970820852e542506dd7a4c722fecd73e34363fde
Author: Mark Wielaard <mark@klomp.org>
Date:   Tue Oct 12 23:25:32 2021 +0200

    vgdb: only queue up to 64 pending signals when waiting for SIGSTOP
    
    We should not queue infinite pending signals so we won't run out of
    memory when the SIGSTOP never arrives.

But keep this bug open because the root cause isn't known yet.

Comment 4 Mark Wielaard 2021-10-27 12:47:39 UTC

(In reply to Mark Wielaard from comment #3)
> I pushed the workaround:
> 
> commit 970820852e542506dd7a4c722fecd73e34363fde
> Author: Mark Wielaard <mark@klomp.org>
> Date:   Tue Oct 12 23:25:32 2021 +0200
> 
>     vgdb: only queue up to 64 pending signals when waiting for SIGSTOP
>     
>     We should not queue infinite pending signals so we won't run out of
>     memory when the SIGSTOP never arrives.
> 
> But keep this bug open because the root cause isn't known yet.

Here is a bug with more logs about failing gdb_server tests on s390x:
https://bugs.kde.org/show_bug.cgi?id=444481

Lets use that one to track s390x gdb_server issues and close this one since the specific workaround for the "eat all memory" issue has been pushed.