Like so: --- hg05_race2.stderr.exp 2015-03-13 14:29:40.635448659 +0100 +++ hg05_race2.stderr.out 2015-03-13 15:36:25.899281148 +0100 @@ -26,8 +26,7 @@ at 0x........: th (hg05_race2.c:17) by 0x........: mythread_wrapper (hg_intercepts.c:...) ... - Location 0x........ is 0 bytes inside foo.poot[5].plop[11], - declared at hg05_race2.c:24, in frame #x of thread x + Address 0x........ is on thread #x's stack ---------------------------------------------------------------- @@ -42,8 +41,7 @@ at 0x........: th (hg05_race2.c:17) by 0x........: mythread_wrapper (hg_intercepts.c:...) ... - Location 0x........ is 0 bytes inside foo.poot[5].plop[11], - declared at hg05_race2.c:24, in frame #x of thread x + Address 0x........ is on thread #x's stack Observed on x86-64 with r15009 and VEX 3103 Frequency of failure about 1 in 6
I ran an outer memcheck on an inner helgrind running said test. Like so: /tmp/outer/bin/valgrind --trace-children=yes --smc-check=all-non-file --run-libc-freeres=no /tmp/inner/bin/valgrind --tool=helgrind --vgdb-prefix=whatever --read-var-info=yes helgrind/tests/hg05_race2 Nothing interesting (other than a warning about the client switching stacks). It takes a couple of minutes to run. I did 3 runs and in all of them the output contains: Address 0x..... is on thread #1's stack instead of the expected Location 0x........ is 0 bytes inside foo.poot[5].plop[11], declared at hg05_race2.c:24, in frame #x of thread x
Once this testcsae is fixed, check memcheck/tests/varinfo6 on ppc64 whether the stderr.exp-ppc64 file is still needed.
On Ubuntu 14.04 amd64, the test sometimes fails due to the following reason: A description for an address is searched on max 8 frames of all stacks. Sometimes, when helgrind detects the error, the main stack is not yet out of the clone syscall (or is just after the syscall ?) In such a case gdb+vgdb shows the following stack trace: #0 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:81 #1 0x0000000004e4319a in do_clone (pd=pd@entry=0x6820700, attr=attr@entry=0xffefff5c0, stackaddr=<optimized out>, stopped=<optimized out>, fct=0x4e440c0 <start_thread>, clone_flags=4001536) at ../nptl/sysdeps/pthread/createthread.c:75 #2 0x0000000004e448bb in create_thread (stackaddr=<optimized out>, attr=0xffefff5c0, pd=0x6820700) at ../nptl/sysdeps/pthread/createthread.c:245 #3 __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at pthread_create.c:611 #4 0x0000000004c306e8 in pthread_create_WRK (thread=0x3d0f00, attr=0x681ffb0, start=<optimized out>, arg=0x5154441 <clone+49>) at hg_intercepts.c:270 #5 0x0000000004c3162a in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, attr=<optimized out>, start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:301 #6 0x0000000000400726 in main () at hg05_race2.c:29 #7 0x000000000507bec5 in __libc_start_main (main=0x4006cb <main>, argc=1, argv=0xffefffd08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0xffefffcf8) at libc-start.c:287 #8 0x00000000004005d9 in _start () Valgrind scheduler state however shows the following: (gdb) mo v.info sch host stacktrace: ==11365== at 0x38047A48: show_sched_status_wrk (m_libcassert.c:319) ==11365== by 0x38091BE3: handle_gdb_valgrind_command (server.c:401) ==11365== by 0x38091FC3: handle_gdb_monitor_command (server.c:557) ==11365== by 0x38092BDF: handle_query (server.c:704) ==11365== by 0x38092BDF: server_main (server.c:1045) ==11365== by 0x3808D4EF: call_gdbserver (m_gdbserver.c:724) ==11365== by 0x3808E099: vgPlain_gdbserver (m_gdbserver.c:791) ==11365== by 0x380782BB: vgPlain_get_data_description (debuginfo.c:3761) ==11365== by 0x38040EBF: vgPlain_describe_addr (m_addrinfo.c:101) ==11365== by 0x3802B31A: vgHelgrind_describe_addr (hg_addrdescr.c:84) ==11365== by 0x3802B3D6: vgHelgrind_get_and_pp_addrdescr (hg_addrdescr.c:110) ==11365== by 0x38091741: handle_gdb_valgrind_command (server.c:434) ==11365== by 0x38091FC3: handle_gdb_monitor_command (server.c:557) ==11365== by 0x38092BDF: handle_query (server.c:704) ==11365== by 0x38092BDF: server_main (server.c:1045) ==11365== by 0x3808D4EF: call_gdbserver (m_gdbserver.c:724) ==11365== by 0x3808E099: vgPlain_gdbserver (m_gdbserver.c:791) ==11365== by 0x380782BB: vgPlain_get_data_description (debuginfo.c:3761) ==11365== by 0x38040EBF: vgPlain_describe_addr (m_addrinfo.c:101) ==11365== by 0x3802B31A: vgHelgrind_describe_addr (hg_addrdescr.c:84) ==11365== by 0x3802BF98: vgHelgrind_update_extra (hg_errors.c:423) ==11365== by 0x38043B2B: vgPlain_maybe_record_error (m_errormgr.c:829) ==11365== by 0x3802C2D9: vgHelgrind_record_error_Race (hg_errors.c:525) ==11365== by 0x3800170D: record_race_info.isra.23 (libhb_core.c:4911) ==11365== by 0x380023E3: msmcread (libhb_core.c:4956) ==11365== by 0x380026BC: zsm_sapply32__msmcread (libhb_core.c:5213) ==11365== by 0x807533F0D: ??? ==11365== by 0x809747F1F: ??? ==11365== by 0x80668782F: ??? ==11365== by 0x681FEBF: ??? ==11365== by 0x80668782F: ??? ==11365== by 0x37FF: ??? ==11365== by 0x7: ??? sched status: running_tid=2 Thread 1: status = VgTs_Yielding ==11365== at 0x5154441: clone (clone.S:81) client stack range: [0xFFEFFE000 0xFFF000FFF] client SP: 0xFFEFFF538 valgrind stack top usage: 13176 of 1048576 Thread 2: status = VgTs_Runnable ==11365== at 0x4006B1: th (hg05_race2.c:17) ==11365== by 0x4C3086D: mythread_wrapper (hg_intercepts.c:234) ==11365== by 0x4E44181: start_thread (pthread_create.c:312) ==11365== by 0x515447C: clone (clone.S:111) client stack range: [0x6021000 0x681FFFF] client SP: 0x681FEA0 valgrind stack top usage: 14824 of 1048576 (gdb) So, Valgrind unwinder cannot unwind properly the stack when positionned on the address 0x5154441, which is just after the syscall instruction: 0x000000000515443a <+42>: mov $0x38,%eax 0x000000000515443f <+47>: syscall => 0x0000000005154441 <+49>: test %rax,%rax 0x0000000005154444 <+52>: jl 0x5154485 <clone+117> Without unwinding past this clone call, the var cannot be described via debug info, and it is later described as being just on the stack of the thread
I'm sure there's some godawful hack in the amd64-linux unwinder (or maybe it's in Helgrind?) to move the PC along by 3 bytes so as to get it into an area covered by CFI, so the unwinding can work. Not sure where to start looking for it though.
(In reply to Julian Seward from comment #4) > I'm sure there's some godawful hack in the amd64-linux unwinder (or maybe > it's > in Helgrind?) to move the PC along by 3 bytes so as to get it into an area > covered > by CFI, so the unwinding can work. Not sure where to start looking for it > though. There is effectively a hack in helgrind hg_main.c:1579 But this hack only works for the unwind done for a post thread create in the parent. So, a 'normal' unwind (as done for an error) fails.
Hello, I wanted to update this thread, as we are experiencing the failure of hg05_race2 currently as well. The test is failing in a similar manner, the diff is the same as the one posted earlier. --- hg05_race2.stderr.exp 2021-05-28 14:34:34.633721415 -0400 +++ hg05_race2.stderr.out 2021-05-29 18:00:20.561265855 -0400 @@ -26,8 +26,7 @@ at 0x........: th (hg05_race2.c:17) by 0x........: mythread_wrapper (hg_intercepts.c:...) ... - Location 0x........ is 0 bytes inside foo.poot[5].plop[11], - declared at hg05_race2.c:24, in frame #x of thread x + Address 0x........ is on thread #x's stack ---------------------------------------------------------------- @@ -42,8 +41,7 @@ at 0x........: th (hg05_race2.c:17) by 0x........: mythread_wrapper (hg_intercepts.c:...) ... - Location 0x........ is 0 bytes inside foo.poot[5].plop[11], - declared at hg05_race2.c:24, in frame #x of thread x + Address 0x........ is on thread #x's stack ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) We first experienced this when tests were run in qemu, but I have now reproduced it on multiple systems including on an Intel 4770 with Ubuntu 20.04 and a Ryzen 3500u with Fedora 34. Running it with the vg_regtest script causes it to fail within a few runs of the script, particularly if the system has a high load or is on a power-efficient mode, though it is rather undeterministic. perl tests/vg_regtest --keep-unfiltered --loop-till-fail helgrind/tests/hg05_race2 We found that running the test with strace -f causes it to always produce the incorrect result. The incorrect result after running with strace has the same diff compared to the expected output as described above. This was tested with the following command line: strace -f -o hg05.log -ttt perl tests/vg_regtest --keep-unfiltered helgrind/tests/tc09_bad_unlock tc09_bad_unlock: valgrind --fair-sched=try ./tc09_bad_unlock *** tc09_bad_unlock failed (stderr) *** == 1 test, 1 stderr failure, 0 stdout failures, 0 stderrB failures, 0 stdoutB failures, 0 post failures == helgrind/tests/tc09_bad_unlock (stderr) We noticed about this issue a few week ago when Yocto project builds started to fail occasionally due to the Valgrind tests. We believe we only started seeing it after SMP was enabled in the Qemu runs. Finally, we have seen a similar issue with test tc09_bad_unlock, where occasionally, we get an invalid result produced that differs from the expected result by 1 line. I can add the diff to this thread if it is appropriate. It also fails randomly when running it with perl, but consistently if we try to strace the output. Thanks!