We test our software (Knot DNS) with Valgrind heavily (on Linux/x86_64 architecture). Sometimes, very very rarely, we encounter weird bugs when running under Valgrind. I tracked those down to the point that I can see what sometimes happens: When my application calls a syscall (e.g. futex), the syscall number (202 in this case) is returned instead of proper return code (e.g. -110 == -ETIMEDOUT). In this case, it leads to a crash of an assert inside glibc. Similar behavior has been observed throughout history. For example, we created a workaround for poll() sometimes returning the value 7, which happened sometimes under valgrind and we did not understand it. Indeed, the syscall number of poll is 7. We have also another workaround for epoll() returning 232 (again, it's its syscall number) sometimes under valgrind. In any case, this happens extremely rarely, the program under valgrind must be doing stuff for minutes to trigger it, and it can be usually reproduced only on specific machine: running the same on different machine with the same system usually doesn't lead to reproduction, or far more rarely. It must be somehow related to timing of threads or dunno. Looking at Valgrind source code, this tracked me down to the file coregrind/m_syswrap/syscall-amd64-linux.S , but I'm not able to see any obvious bug in this assembler. What intrigues me that both the syscall number and the return value appear in the RAX register at some point. I tried examining strace, but this didn't give me any hint. From the kernel/strace perspective, the syscalls return proper codes. The issue must happen somewhere between kernel and the application. Thank you for any help and hints. I'd be happy to add more information as requested, or run some experimental code.
Do DRD or Helgrind ever report any errors? Does the patch in https://bugs.kde.org/show_bug.cgi?id=445743 (plus the change in comment 14) change anything?
Does strace show that there are some signals being processed close (in time) to the system call wrongly returning its syscall number ? If a signal happens when syscall-amd64-linux.S is being executed, then VG_(fixup_guest_state_after_syscall_interrupted) has some complex logic that interacts with the partially executed asm code. Alternatively, having more valgrind tracing might give some hints. You could try valgrind -v -v -v -d -d -d --trace-syscalls=yes --trace-signals=yes your_app and if your application is multi-threaded (I guess it is), you might also use --trace-sched=yes With regards to "What intrigues me that both the syscall number and the return value appear in the RAX register at some point." If you speak about the "physical RAX register", then I think this is normal. To execute a syscall, the syscall number must be set to the syscall number before the syscall instruction, and on return of the syscall instruction, the RAX register contains the syscall return value. When this syscall instruction is finished, the syscall return code (stored by the kernel in the physical register RAX) must be moved to the guest register RAX.
@Paul Helgrind reports oh so many errors, I can hardly find any clue in them. I guess this tool is not intended for debugging (mostly) lockless multi-threaded programs (where concurrency is controlled by libRCU or other tricks). Anyway, when I run `valgrind --tool=helgrind`, the described bug appears as well with roughly equal frequency as for memcheck. DRD also reports a ton of errors. It's so slow that it doesn't even pass the test scenario in time limit, so I'm not making any conclusions from it. Shall I walk through those tons of errors and look for something specific? I tried applying the patch called "Patch for FreeBSD" from #445743, as well as the hint from its 14th comment. This did not change anything, the bug persists. @Philippe If I hadn't overlooked something, strace does not report any signal received shortly before the crash, besides the SIGABRT resulting from it. Anyway, it seems to me, that the actual futex syscall is not in fact called as well, but I'm not 100% sure. I'm going to attach the valgrind output from the crashing run of the test, with the parameters `-v -v -v -d -d -d --trace-syscalls=yes --trace-signals=yes --trace-sched=yes` as you suggested. It's pretty long, please grep for the word "abort". My speculations around RAX are based on the idea that if, due to some bug, the syscall is not in fact called by valgrind, but expected to have been called, the syscall number in RAX could be easily misinterpreted to be the return value in RAX. But this is of course too simple a view.
Created attachment 152081 [details] The log of valgrind crashing glibc by mishandled futex syscall.
Thje pair of entry/exit traces are SYSCALL[260041,11](202) sys_futex ( 0x1ffefff9f4, 393, 0, 0x2633d890, 0x0 ) --> [async] ... SYSCALL[260041,11](202) ... [async] --> Success(0xca) the futex of is 393 or 0x189 That's PRIVATE, REALTIME_CLOCK TRYLOCK_PI and WAKE I'm a bit confused as to what TRYLOCK_PI and WAKE do together.
I frankly don't understand the flags of futex syscall. This is what glibc does. To possibly bring more light to the issue, I'm going to upload a log from completely different reproducer. This time, the affected syscall is epoll_wait. Again, we are facing the bug causing it to return the number 232 instead proper return value (usually zero, one or some negative for error). This actually reproduces quite easily and often, I just had to remove the workaround from our software. I also let the affected thread wait for one second before crashing on assert, so that the logging settles down.
Created attachment 152095 [details] The log of valgrind crashing knotd after mishandled epoll_wait syscall.
I took a look at both logs. First the epoll log. (tid is the an thread id number used internally in valgrind) What we see is that the tid 14 is just getting the result of a previous epoll syscall, and then starts a new epoll syscall: .... --111235-- SCHED[14]: acquired lock (VG_(client_syscall)[async]) <<<<< tid 14 acquires the valgrind lock as the epoll syscall ended SYSCALL[111235,14](232) ... [async] --> Success(0x0) <<<<< and the tid 14 reports the success of this syscall --111235-- SCHED[14]: TRC: SYSCALL <<<<< it launches a new epoll syscall SYSCALL[111235,14](232) sys_epoll_wait ( 18, 0x2610f740, 1, 1000 ) --> [async] ... --111235-- SCHED[14]: releasing lock (VG_(client_syscall)[async]) -> VgTs_WaitSys <<<<< and releases the lock waiting for the result --111235-- SCHED[35]: acquired lock (VG_(scheduler):timeslice) <<<<< and the tid 35 acquires the lock ..... Then later on, the tid 18 calls tgkill sending a (fatal) signal to itself (I believe it is to itself, the tracing of valgrind of the link between the tid and the linux thread id is not very clear). As this signal is fatal, all threads are being killed by valgrind We see that a little bit before the tgkill that tid18 does a write on fd 2. Possibly that is an indication of reporting an error/problem. The problem with the futex has a similar pattern: The tid 6 starts a futex syscall and releases the valgrind lock. Then sometime later, the tid 11 is doing an mmap, and then slightly after calls tgkill. And similarly this tid 11 does a write on fd 2 a little bit before. The processing of a fatal signal in valgrind is quite tricky : complex code, with race conditions see e.g. bug 409367. This fatal signal has to get all the threads out of their syscalls. For this, a kind of internal signal "vgkill" is sent by the valgrind scheduler to all threads. When the signal is received, valgrind detects that the thread was in a syscall and that the thread has to "interrupt" the syscall. For this, valgrind calls VG_(post_syscall). But this post_syscall assumes that the guest state is correctly "fixed", but I do not see where this is done. So, an hypothesis about what happens: * the application encounters an error condition (in tid 18 in the epoll case, in tid 11 in the futex case) * this application thread decides to call abort, generating a fatal signal * valgrind handling of a fatal signal is for sure complex and might still be racy, and might not properly reset the guest state when the fatal signal has to be handled by a thread doing e.g. epoll or futex syscall As the guest state is not properly restored, when this thread "resumes" and/or due to race conditions, instead of just dying, it continues and then itself reports a strange state as the guest thread state was not properly restored using a call to VG_(fixup_guest_state_after_syscall_interrupted). To validate this hypothesis, maybe the following could be done: * check what is this "write on fd 2" doing (maybe with strace?) * in case the application encounters a problem, instead of calling abort that sends a fatal signal, you might rather do e.g. sleep(100000). If the hypothesis is correct, then the thread doing epoll or futex should just stay blocked in their syscall, and the thread detecting the problem will sleep in this state. It might then be possible to attach using gdb+vgdb and investigate the state of the application and/or the valgrind threads. There is a way to tell valgrind to launch gdbserver in case of an abnormal valgrind exit using the option: --vgdb-stop-at=event1,event2,... invoke gdbserver for given events [none] where event is one of: startup exit valgrindabexit all none Looks like we might add abexit to ask valgrind to call gdbserver when a client thread/process does an abnormal exit.
(In reply to Philippe Waroquiers from comment #8) > So, an hypothesis about what happens: > * the application encounters an error condition (in tid 18 in the epoll > case, in tid 11 in the futex case) > * this application thread decides to call abort, generating a fatal signal > * valgrind handling of a fatal signal is for sure complex and might still > be racy, and might not properly reset the guest state > when the fatal signal has to be handled by a thread doing e.g. epoll or > futex syscall > As the guest state is not properly restored, when this thread "resumes" > and/or due to race conditions, instead of just dying, it continues > and then itself reports a strange state as the guest thread state was not > properly restored using a call to > VG_(fixup_guest_state_after_syscall_interrupted). I was thinking on similar lines of https://bugs.kde.org/show_bug.cgi?id=445743 where the syscall is supposed to never return when interrupted. Under Valgrind it does get interrupted and Valgrind doesn't resume it. It would be interesting to see if strace shows any ERESTARTNOINTR return status values.
@Philippe Thanks much for your deep analysis! I wouldn't be able to see such things in conext. However, I think you are not entirely correct. I think the observed bug is not caused by other thread aborting. The crash on assert seems to be the consequence, not the cause of the mess. I prepared another scenario, where the program does not crash at all. It just ignores the wrong 0xe8 value returned from epoll and continues as if it's zero. In this case, it happened five times during the run and the program did not crash, and operated smoohtly up to the end of test case. By the way, writing to fd 2 (stderr) is done only by Valgrind, due to the setting `-d -d -d`. I'm going to attach those logs, this time also the stderr. I hope this helps to find the definitive cause :)
Created attachment 152221 [details] The log of valgrind returning wrong syscall value five times.
And do you see any ERESTARTNOINTR with strace?
(In reply to Paul Floyd from comment #12) > And do you see any ERESTARTNOINTR with strace? No. I see only those return error from any failed syscalls: EAGAIN ECONNREFUSED EEXIST EINPROGRESS EINTR EINVAL ENOENT ERESTARTSYS I'm going to attach the strace. Please note, that this strace is not from the same run as the logs I attached earlier today, but from similar scenario, and the bug appered same way (just possibly other number of times).
Created attachment 152228 [details] Strace failing syscalls from a reproducing scenario.
OK so it looks like a problem with VG_(fixup_guest_state_after_syscall_interrupted) as described by Philippe
In one of the trace I see the below trace. It looks like the a signal SIGALRM is delivered to the thread that encounters the futex 202 result. --24048-- async signal handler: signal=14, vgtid=24051, tid=4, si_code=-6, exitreason VgSrc_None --24048-- interrupted_syscall: tid=4, ip=0x580e687e, restart=False, sres.isErr=False, sres.val=202 --24048-- at syscall instr: returning EINTR --24048-- delivering signal 14 (SIGALRM):-6 to thread 4 --24048-- push_signal_frame (thread 4): signal 14 ==24048== at 0x4C2C340: futex_wait (futex-internal.h:146) ==24048== by 0x4C2C340: __lll_lock_wait (lowlevellock.c:49) ==24048== by 0x4C32322: __pthread_mutex_cond_lock (pthread_mutex_lock.c:93) ==24048== by 0x4C2E9B3: __pthread_cond_wait_common (pthread_cond_wait.c:616) ==24048== by 0x4C2E9B3: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:627) ==24048== by 0x14184D: worker_main (pool.c:70) ==24048== by 0x1395B2: thread_ep (dthreads.c:146) ==24048== by 0x4C2FB42: start_thread (pthread_create.c:442) ==24048== by 0x4CC0BB3: clone (clone.S:100) So, this is another indication that the problem is likely linked to VG_(fixup_guest_state_after_syscall_interrupted). But it is not very clear what is special in your application. Can you also reproduce the problem with the --tool=none tool ? Or does it happen only with memcheck ? Can you check if the problem goes away when using --vex-iropt-register-updates=allregs-at-each-insn ? If the problem cannot be reproduced with this setting, can you see if it reproduces with --vex-iropt-register-updates=allregs-at-mem-access ?
Thank you much for looking at this issue constantly! I confirm that with --tool=none, the issue reproduces as well. This is also true for --tool=helgrind, as I said earlier. Even with --vex-iropt-register-updates=allregs-at-each-insn , the issue still reproduces with the same frequency. I took much closer look at the signals in our application. SIGALRM can be sometimes sent by the program to itself in order to notify all the threads about something. However, the issue reproduces even in cases when neither SIGALRM nor any other signal arrives around the time the issue happens. So it must be caused (also) by something else. I'm going to attach the logs from two more different test-cases. In one case, both epoll and futex return the invalid values (0xe8 and 0xca, respectively), which results in that assert in glibc and following SIGABRT, which is however the consequence, not the cause of the issue. In the second case, only epoll returns wrong 0xe8 value and the program continues running to the end, when only at the very end SIGTERM is received. The valgrind params were "--tool=none -v -v -v -d -d -d --vex-iropt-register-updates=allregs-at-each-insn --trace-syscalls=yes --trace-signals=yes --trace-sched=yes --num-callers=20 --trace-children=yes --trace-children-skip=/usr/*sh,/bin/*sh" in both cases. It seems to me that different threads seem to suffer the same in one moment. Sometimes the issue happens only once for one thread, sometimes multiple threads receive the wrong syscall retcodes in close succession. But mostly without any signal observed, if I'm not overloking something terribly. Finally, I confirm that even with --vex-iropt-register-updates=allregs-at-mem-access , the issue still reproduces.
Created attachment 152437 [details] Logs of wrong syscall retvals with no signals observed.
I took a look at the attached logs. A first observation: * We have 2 groups of 3 threads that get the 0xe8 syscall return. * For each of these 2 groups, we see a little bit before these 0xe8 return that there is a connection to the embedded gdbserver of Valgrind. Here are the line number and occurrences of the 0xe8 syscall return: 6 matches for "0xe8" in buffer: valgrind 26734:SYSCALL[61639,20](232) ... [async] --> Success(0xe8) 26774:SYSCALL[61639,17](232) ... [async] --> Success(0xe8) 26789:SYSCALL[61639,15](232) ... [async] --> Success(0xe8) 31141:SYSCALL[61639,14](232) ... [async] --> Success(0xe8) 31176:SYSCALL[61639,16](232) ... [async] --> Success(0xe8) 31206:SYSCALL[61639,13](232) ... [async] --> Success(0xe8) And here are the 3 matches for the gdbserver: 3 matches for "TO DEBUG" in buffer: valgrind 286:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this 26668:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this 31121:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this where the first one is the message produced at startup. Maybe this is a modified executable that triggers a call to vgdb/gdb when it encounters this syscall problem ? Or is there something that attaches to the valgrind gdbserver or sends a command to it ? Because in this last case, we could possibly have an interaction between vgdb and many threads blocked in syscalls. We see in the stderr trace the following: --61639:2: gdbsrv stored register 0 size 8 name rax value 0000000000000007 tid 1 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 15 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 17 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 20 status VgTs_WaitSys .... --61639:1: gdbsrv stop_pc 0x4CAC04E changed to be resume_pc 0x4C9CD7F: poll (poll.c:29) --61639:2: gdbsrv stored register 0 size 8 name rax value 0000000000000007 tid 1 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 13 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 14 status VgTs_WaitSys --61639:2: gdbsrv stored register 0 size 8 name rax value 00000000000000e8 tid 16 status VgTs_WaitSys --61639:1: gdbsrv VG core calling VG_(gdbserver_report_signal) vki_nr 15 SIGTERM gdb_nr 15 SIGTERM tid 1 - So, for the 2 groups of 3 threads that got 0xe8 syscall return, we see that the valgrind gdbserver was instructed to put 0xe8 in the rax register. It is however difficult to relate the stderr output with the valgrind output. If you could redo the trace with the none tool, but keep together the stderr and the valgrind output (i.e. let valgrind do its output to stderr together with its debug log) + add --time-stamp=yes that might help to see what happens in which order. I have to say that at this state, I have not much idea what other things to look at. To further investigate and possibly find the bug, likely we will need an (easy to run) reproducer.
Thank you for your observations! Based on this, we actually found out that the issue happens exactly (sometimes!) when we attach vgdb to the running process, like this: ``` /usr/bin/gdb -ex "set confirm off" -ex "target remote | /usr/bin/vgdb --pid=5944" -ex "info threads" -ex "thread apply all bt full" -ex q /home/peltan/master_knot/src/knotd ``` I apologize that we overlooked this improtant fact earlier. (Our test environment performs this automatically when a routine error occurs.) We will continue working on minimizing the reproducer in following days.
(In reply to Libor Peltan from comment #20) > Thank you for your observations! Based on this, we actually found out that > the issue happens exactly (sometimes!) when we attach vgdb to the running > process, like this: > > ``` > /usr/bin/gdb -ex "set confirm off" -ex "target remote | /usr/bin/vgdb > --pid=5944" -ex "info threads" -ex "thread apply all bt full" -ex q > /home/peltan/master_knot/src/knotd > ``` > > I apologize that we overlooked this improtant fact earlier. (Our test > environment performs this automatically when a routine error occurs.) > > We will continue working on minimizing the reproducer in following days. That starts to clarify where the problem could originate from. Valgrind should stop by itself when it finds an error (when using --vgdb-error argument) and invoke its gdbserver, waiting for gdb/vgdb to connect. Do you mean that the above command is launched (somewhat asynchronously) when an error is detected via other ways ? If yes, as a bypass for this bug, you might try to have valgrind invoking gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb 'interrupting' a (still) running process. But in any case, what you are doing should not cause a problem. When I have a little bit of time, I will dig again in the vgdb logic and see if/where it could create such wrong interaction.
(In reply to Philippe Waroquiers from comment #21) > Valgrind should stop by itself when it finds an error (when using > --vgdb-error argument) The error mentioned by me is an error in application logic. Valgrind has no way to detect it, no reason to stop in that case and it does not. > If yes, as a bypass for this bug, you might try to have valgrind invoking > gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb > 'interrupting' a (still) running process. Yes, we can use a workaround that we don't invoke vgdb at all in such cases. But in any case, it would be nice to fix what's going wrong in Valgrind. > But in any case, what you are doing should not cause a problem. When I have > a little bit of time, I will dig again in the vgdb logic > and see if/where it could create such wrong interaction. Thank you for continuous effor in this issue! I tried to create a tiny small program that does only listens to incomming UDP packets while using poll or epoll_wait syscalls, and frequently attach vgdb to it. Unfortunately, the issue did not reproduce in this simple scenario. The only reproducer we have so far (at least pretty reliable) is the whole Knot DNS and its test-case. If you wish, we can give you instructions how to build it and launch the test-case, but it'll be several steps to do.
(In reply to Libor Peltan from comment #22) > (In reply to Philippe Waroquiers from comment #21) > > Valgrind should stop by itself when it finds an error (when using > > --vgdb-error argument) > > The error mentioned by me is an error in application logic. Valgrind has no > way to detect it, no reason to stop in that case and it does not. > > > If yes, as a bypass for this bug, you might try to have valgrind invoking > > gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb > > 'interrupting' a (still) running process. > > Yes, we can use a workaround that we don't invoke vgdb at all in such cases. > But in any case, it would be nice to fix what's going wrong in Valgrind. > > > But in any case, what you are doing should not cause a problem. When I have > > a little bit of time, I will dig again in the vgdb logic > > and see if/where it could create such wrong interaction. > > Thank you for continuous effor in this issue! I tried to create a tiny small > program that does only listens to incomming UDP packets while using poll or > epoll_wait syscalls, and frequently attach vgdb to it. Unfortunately, the > issue did not reproduce in this simple scenario. You might need to have a mixture of threads executing some cpu instructions, some threads blocked in syscalls, and some signals such as SIGALRM. > The only reproducer we have > so far (at least pretty reliable) is the whole Knot DNS and its test-case. > If you wish, we can give you instructions how to build it and launch the > test-case, but it'll be several steps to do. If the instructions are not too long to describe, it will not harm to have them. If simple enough and building Knot DNS does not need too much dependencies etc ..., I can give it a try.
Hi Philippe, if you have a mood to try out with Knot DNS, these are the instruction to reproduce: 1) download Knot DNS sources from https://gitlab.nic.cz/knot/knot-dns . It's OK to stay at the `master` branch. 2) install building prerequisites. They're listed in README. 3) build it `autoreconf- if && ./configure && make`. It's not necessary to `make install`. 4) go to `test-extra` subdirectory and install testing prerequisites by `pip3 install -r requirements.txt` 5) in the file `tools/dnstest/params.py` edit the variable `valgrind_flags` to your liking. At least add `--trace-syscalls=yes`. 6) run `./runtests.py ixfr/redundant_ctl_deny`. Find the valgrind output in `~/knottest-last/ixfr/redundant_ctl_deny/knot1/valgrind` (where `knottest-last` is always a symlink somehwere into `/tmp`). Grep for `0xe8` to see if the issue reproduced. You will probably need to run the test several time until it reproduces. It may also happen on some machines that it does never reproduce. For me, it easily reproduces on my laptop, but hardly on a powerful server.
(In reply to Libor Peltan from comment #24) > You will probably need to run the test several time until it reproduces. It > may also happen on some machines that it does never reproduce. For me, it > easily reproduces on my laptop, but hardly on a powerful server. Thanks for the detailed instructions. I was able to reproduce the bug, happens more or less 1 on 5 trials. I will further work on this time permitting (valgrind is a week-end activity, so only a few hours from time to time). One question: the test outputs valgrind messages in the file 'valgrind' and outputs the valgrind debug output in stderr. This setup is very special, I would like to have both outputs in the same file. I have not seen where is the piece of code that really launches valgrind and makes these 2 separate files (I even do not too much understand how this is done, as valgrind messages and debug output are normally both written on the same fd). Is there an easy change to the test framework that would ensure all the output is either in stderr or valgrind file ? Thanks Philippe
(In reply to Philippe Waroquiers from comment #25) > Is there an easy change to the test framework that would ensure all the > output is either in stderr or valgrind file ? Hello Philippe, I work on this issue together with Libor. If you checkout this branch: https://gitlab.nic.cz/knot/knot-dns/-/commits/valgrind_vgdb_bug , it includes a few changes that should help you with the debugging. 1. All output from valgrind is written to knot1/stderr (the split output you saw before was apparently made by valgrind when option `--log-file` was used). 2. The options for valgrind you have suggested are already set in the framework. If you need to change them, it's obvious from the last commit where they are. 3. There is a new "test" `ctl/valgrind` that is simple and triggers the bug very fast. You can run it as `./runtests.py ctl/valgrind` from knot-dns/tests-extra, same as the test before. 4. When knotd encounters the bug, it calls abort now. Then it's easy to find the place in the valgrind output (knot1/stderr) by searching for 'abort'. If you don't need it, you remove the two asserts added in the last commit.
(In reply to David Vasek from comment #26) > I work on this issue together with Libor. If you checkout this branch: > https://gitlab.nic.cz/knot/knot-dns/-/commits/valgrind_vgdb_bug , it > includes a few changes that should help you with the debugging. Thanks for the above. With this, I was able to identify the likely culprit for the bug. I have pushed a fix for this as 348775f34 The valgrind regression tests were run on debian/amd64, centos/ppc64 and ubunty/arm64. Also, the ctl/valgrind knot test has run 80 times without encountering the abort (while before the fix, each run of ctl/valgrind was triggering the bug). It would be nice if you could check with the latest state of the git repository if everything works as expected now. Thanks for the clear and detailed instructions to reproduce the bug, which was critically needed to reproduce this race condition.
I confirm that the issue is fixed in Valgrind built from current Git master branch. Thank you very much for your cooperation and patience.