Bug 345121 - helgrind/tests/hg05_race2 fails intermittently
Summary: helgrind/tests/hg05_race2 fails intermittently
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: helgrind (show other bugs)
Version: 3.10 SVN
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-13 14:41 UTC by Florian Krohm
Modified: 2021-05-29 22:23 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Florian Krohm 2015-03-13 14:41:29 UTC
Like so:

--- hg05_race2.stderr.exp	2015-03-13 14:29:40.635448659 +0100
+++ hg05_race2.stderr.out	2015-03-13 15:36:25.899281148 +0100
@@ -26,8 +26,7 @@
    at 0x........: th (hg05_race2.c:17)
    by 0x........: mythread_wrapper (hg_intercepts.c:...)
    ...
- Location 0x........ is 0 bytes inside foo.poot[5].plop[11],
- declared at hg05_race2.c:24, in frame #x of thread x
+ Address 0x........ is on thread #x's stack
 
 ----------------------------------------------------------------
 
@@ -42,8 +41,7 @@
    at 0x........: th (hg05_race2.c:17)
    by 0x........: mythread_wrapper (hg_intercepts.c:...)
    ...
- Location 0x........ is 0 bytes inside foo.poot[5].plop[11],
- declared at hg05_race2.c:24, in frame #x of thread x
+ Address 0x........ is on thread #x's stack


Observed on x86-64 with r15009  and VEX 3103
Frequency of failure about  1 in 6
Comment 1 Florian Krohm 2015-03-13 20:42:56 UTC
I ran an outer memcheck on an inner helgrind running said test. Like so:

/tmp/outer/bin/valgrind --trace-children=yes --smc-check=all-non-file --run-libc-freeres=no /tmp/inner/bin/valgrind --tool=helgrind --vgdb-prefix=whatever --read-var-info=yes helgrind/tests/hg05_race2

Nothing interesting (other than a warning about the client switching stacks).
It takes a couple of minutes to run.

I did 3 runs and in all of them the output contains:

Address 0x..... is on thread #1's stack

instead of the expected
Location 0x........ is 0 bytes inside foo.poot[5].plop[11], 
declared at hg05_race2.c:24, in frame #x of thread x
Comment 2 Florian Krohm 2015-03-14 09:45:33 UTC
Once this testcsae is fixed, check memcheck/tests/varinfo6 on ppc64 whether the stderr.exp-ppc64 file is still needed.
Comment 3 Philippe Waroquiers 2015-03-14 13:49:42 UTC
On Ubuntu 14.04 amd64, the test sometimes fails due to the following reason:
A description for an address is searched on max 8 frames of all stacks.
Sometimes, when helgrind detects the error, the main stack is not yet out of the clone
syscall (or is just after the syscall ?)

In such a case gdb+vgdb shows the following stack trace:
 #0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:81
#1  0x0000000004e4319a in do_clone (pd=pd@entry=0x6820700, attr=attr@entry=0xffefff5c0, 
    stackaddr=<optimized out>, stopped=<optimized out>, fct=0x4e440c0 <start_thread>, 
    clone_flags=4001536) at ../nptl/sysdeps/pthread/createthread.c:75
#2  0x0000000004e448bb in create_thread (stackaddr=<optimized out>, attr=0xffefff5c0, pd=0x6820700)
    at ../nptl/sysdeps/pthread/createthread.c:245
#3  __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, 
    start_routine=<optimized out>, arg=<optimized out>) at pthread_create.c:611
#4  0x0000000004c306e8 in pthread_create_WRK (thread=0x3d0f00, attr=0x681ffb0, start=<optimized out>, 
    arg=0x5154441 <clone+49>) at hg_intercepts.c:270
#5  0x0000000004c3162a in _vgw00000ZZ_libpthreadZdsoZd0_pthreadZucreateZAZa (thread=<optimized out>, 
    attr=<optimized out>, start=<optimized out>, arg=<optimized out>) at hg_intercepts.c:301
#6  0x0000000000400726 in main () at hg05_race2.c:29
#7  0x000000000507bec5 in __libc_start_main (main=0x4006cb <main>, argc=1, argv=0xffefffd08, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0xffefffcf8)
    at libc-start.c:287
#8  0x00000000004005d9 in _start ()


Valgrind scheduler state however shows the following:
(gdb) mo v.info sch

host stacktrace:
==11365==    at 0x38047A48: show_sched_status_wrk (m_libcassert.c:319)
==11365==    by 0x38091BE3: handle_gdb_valgrind_command (server.c:401)
==11365==    by 0x38091FC3: handle_gdb_monitor_command (server.c:557)
==11365==    by 0x38092BDF: handle_query (server.c:704)
==11365==    by 0x38092BDF: server_main (server.c:1045)
==11365==    by 0x3808D4EF: call_gdbserver (m_gdbserver.c:724)
==11365==    by 0x3808E099: vgPlain_gdbserver (m_gdbserver.c:791)
==11365==    by 0x380782BB: vgPlain_get_data_description (debuginfo.c:3761)
==11365==    by 0x38040EBF: vgPlain_describe_addr (m_addrinfo.c:101)
==11365==    by 0x3802B31A: vgHelgrind_describe_addr (hg_addrdescr.c:84)
==11365==    by 0x3802B3D6: vgHelgrind_get_and_pp_addrdescr (hg_addrdescr.c:110)
==11365==    by 0x38091741: handle_gdb_valgrind_command (server.c:434)
==11365==    by 0x38091FC3: handle_gdb_monitor_command (server.c:557)
==11365==    by 0x38092BDF: handle_query (server.c:704)
==11365==    by 0x38092BDF: server_main (server.c:1045)
==11365==    by 0x3808D4EF: call_gdbserver (m_gdbserver.c:724)
==11365==    by 0x3808E099: vgPlain_gdbserver (m_gdbserver.c:791)
==11365==    by 0x380782BB: vgPlain_get_data_description (debuginfo.c:3761)
==11365==    by 0x38040EBF: vgPlain_describe_addr (m_addrinfo.c:101)
==11365==    by 0x3802B31A: vgHelgrind_describe_addr (hg_addrdescr.c:84)
==11365==    by 0x3802BF98: vgHelgrind_update_extra (hg_errors.c:423)
==11365==    by 0x38043B2B: vgPlain_maybe_record_error (m_errormgr.c:829)
==11365==    by 0x3802C2D9: vgHelgrind_record_error_Race (hg_errors.c:525)
==11365==    by 0x3800170D: record_race_info.isra.23 (libhb_core.c:4911)
==11365==    by 0x380023E3: msmcread (libhb_core.c:4956)
==11365==    by 0x380026BC: zsm_sapply32__msmcread (libhb_core.c:5213)
==11365==    by 0x807533F0D: ???
==11365==    by 0x809747F1F: ???
==11365==    by 0x80668782F: ???
==11365==    by 0x681FEBF: ???
==11365==    by 0x80668782F: ???
==11365==    by 0x37FF: ???
==11365==    by 0x7: ???

sched status:
  running_tid=2

Thread 1: status = VgTs_Yielding
==11365==    at 0x5154441: clone (clone.S:81)
client stack range: [0xFFEFFE000 0xFFF000FFF] client SP: 0xFFEFFF538
valgrind stack top usage: 13176 of 1048576

Thread 2: status = VgTs_Runnable
==11365==    at 0x4006B1: th (hg05_race2.c:17)
==11365==    by 0x4C3086D: mythread_wrapper (hg_intercepts.c:234)
==11365==    by 0x4E44181: start_thread (pthread_create.c:312)
==11365==    by 0x515447C: clone (clone.S:111)
client stack range: [0x6021000 0x681FFFF] client SP: 0x681FEA0
valgrind stack top usage: 14824 of 1048576

(gdb) 

So, Valgrind unwinder cannot unwind properly the stack when positionned
on the address  0x5154441, which is just after the syscall instruction:
   0x000000000515443a <+42>:	mov    $0x38,%eax
   0x000000000515443f <+47>:	syscall 
=> 0x0000000005154441 <+49>:	test   %rax,%rax
   0x0000000005154444 <+52>:	jl     0x5154485 <clone+117>

Without unwinding past this clone call, the var cannot be described via debug
info, and it is later described as being just on the stack of the thread
Comment 4 Julian Seward 2015-03-14 14:30:45 UTC
I'm sure there's some godawful hack in the amd64-linux unwinder (or maybe it's
in Helgrind?) to move the PC along by 3 bytes so as to get it into an area covered
by CFI, so the unwinding can work.   Not sure where to start looking for it though.
Comment 5 Philippe Waroquiers 2015-03-14 15:29:17 UTC
(In reply to Julian Seward from comment #4)
> I'm sure there's some godawful hack in the amd64-linux unwinder (or maybe
> it's
> in Helgrind?) to move the PC along by 3 bytes so as to get it into an area
> covered
> by CFI, so the unwinding can work.   Not sure where to start looking for it
> though.

There is effectively a hack in helgrind hg_main.c:1579
But this hack only works for the unwind done for a post thread create in the parent.
So, a 'normal' unwind (as done for an error) fails.
Comment 6 Tony Tascioglu 2021-05-29 22:23:29 UTC
Hello,

I wanted to update this thread, as we are experiencing the failure of hg05_race2 currently as well.

The test is failing in a similar manner, the diff is the same as the one posted earlier.


--- hg05_race2.stderr.exp       2021-05-28 14:34:34.633721415 -0400
+++ hg05_race2.stderr.out       2021-05-29 18:00:20.561265855 -0400
@@ -26,8 +26,7 @@
    at 0x........: th (hg05_race2.c:17)
    by 0x........: mythread_wrapper (hg_intercepts.c:...)
    ...
- Location 0x........ is 0 bytes inside foo.poot[5].plop[11],
- declared at hg05_race2.c:24, in frame #x of thread x
+ Address 0x........ is on thread #x's stack
 
 ----------------------------------------------------------------
 
@@ -42,8 +41,7 @@
    at 0x........: th (hg05_race2.c:17)
    by 0x........: mythread_wrapper (hg_intercepts.c:...)
    ...
- Location 0x........ is 0 bytes inside foo.poot[5].plop[11],
- declared at hg05_race2.c:24, in frame #x of thread x
+ Address 0x........ is on thread #x's stack
 
 
 ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)


We first experienced this when tests were run in qemu, but I have now reproduced it on multiple systems including on an Intel 4770 with Ubuntu 20.04 and a Ryzen 3500u with Fedora 34.

Running it with the vg_regtest script causes it to fail within a few runs of the script, particularly if the system has a high load or is on a power-efficient mode, though it is rather undeterministic.

perl tests/vg_regtest --keep-unfiltered --loop-till-fail helgrind/tests/hg05_race2


We found that running the test with strace -f causes it to always produce the incorrect result. The incorrect result after running with strace has the same diff compared to the expected output as described above.

This was tested with the following command line:

strace -f -o hg05.log -ttt perl tests/vg_regtest --keep-unfiltered helgrind/tests/tc09_bad_unlock
tc09_bad_unlock: valgrind   --fair-sched=try ./tc09_bad_unlock 
*** tc09_bad_unlock failed (stderr) ***

== 1 test, 1 stderr failure, 0 stdout failures, 0 stderrB failures, 0 stdoutB failures, 0 post failures ==
helgrind/tests/tc09_bad_unlock           (stderr)


We noticed about this issue a few week ago when Yocto project builds started to fail occasionally due to the Valgrind tests. We believe we only started seeing it after SMP was enabled in the Qemu runs.

Finally, we have seen a similar issue with test tc09_bad_unlock, where occasionally, we get an invalid result produced that differs from the expected result by 1 line. I can add the diff to this thread if it is appropriate. It also fails randomly when running it with perl, but consistently if we try to strace the output.

Thanks!