| Summary: | I would really like helgrind to work again... | ||
|---|---|---|---|
| Product: | [Developer tools] valgrind | Reporter: | François PELLEGRINI <pelegrin> |
| Component: | helgrind | Assignee: | Julian Seward <jseward> |
| Status: | RESOLVED FIXED | ||
| Severity: | wishlist | CC: | dooglus, trapni |
| Priority: | NOR | ||
| Version First Reported In: | 3.2.1 | ||
| Target Milestone: | --- | ||
| Platform: | Compiled Sources | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
| Attachments: |
thrcheck output calling g_thread_create_full()
suppressions I had to add to stop complaints about libraries |
||
I second that. -Brian *** This bug has been confirmed by popular vote. *** A replacement for Helgrind is under active development and nearing beta status. Try this: svn co svn://svn.valgrind.org/valgrind/branches/THRCHECK cd THRCHECK ./autogen.sh && ./configure --prefix=`pwd`/Inst make && make install ./Inst/bin/valgrind --tool=thrcheck my_threaded_app Feedback welcomed. I just tried it, and it seems to work. Here are some comments that I found after fooling around with it for only a bit: a) is there documentation available ? I only glanced cursorly in the sources, but nothing seemed to catch my eye b) it might need some default suppressions. I seem to be getting possible dataraces for unlocking a mutex for example (for some internal variable (data.lock or something) of the mutex itself). I doubt this is actually an error on my part. c) the output is good, the "reason" line is very helpful. What might be nice would be "last access by <stacktrace> at <time> (read/write)". d) since i have no docu: does it do deadlock detection ? would be nice to have. e) I don't know if it possible, but it would be nice to automatically disregard cases where thread A initializes some data, then starts a workerthread B to work on them as "possible data race". Since accesses before thread B actually came alive can never really be a data race and therefore do not need protection. same ofcourse for access after thread B dies. but good work, I am sure this will help me a lot with my work. Thanks alot. Thanks for the feedback. > a) is there documentation available ? I only glanced cursorly in the > sources, but nothing seemed to catch my eye thrcheck/docs/tc-manual.xml. Try this: (cd docs && make html-docs) konqueror docs/html/tc-manual.html > b) it might need some default suppressions. Yes. It already has a lot, but probably not enough. What Linux distro are you using (and glibc version and what cpu ?) > c) the output is good, the "reason" line is very helpful. What might be nice > would be last access by <stacktrace> at <time> (read/write)" Problem is, capturing a stacktrace is expensive, and doing that for every memory reference would slow it down enormously. It would be possible to take a stacktrace every time a lock changes state (is locked/unlocked); but not sure how that would be useful. > d) since i have no docu: does it do deadlock detection ? It does potential deadlock detection; do 'make check' and then run thrcheck/tests/{tc13_laog,tc14_laog_dinphils,tc15_laog_lockdel}. > e) I don't know if it possible, but it would be nice to automatically > disregard cases where thread A initializes some data, then starts a > workerthread B to work on them as "possible data race". Since accesses > before thread B actually came alive can never really be a data race and > therefore do not need protection. same ofcourse for access after thread B > dies. Both of these cases should be handled correctly. See thrcheck/tests/tc02_simple_tls.c thrchech/tests/tc03_re_excl.c If they are not handled correctly it is a bug in thrcheck which should be fixed if possible. Can you look a bit more at this? Maybe produce a small test case? Julian Seward wrote: [bugs.kde.org quoted mail] Thx. >> b) it might need some default suppressions. > > > Yes. It already has a lot, but probably not enough. What Linux > distro are you using (and glibc version and what cpu ?) > This system was a P4 3.2GHz HT (one of the old prescott HT ones) Gentoo, with glibc-2.5: GNU C Library stable release version 2.5, by Roland McGrath et al. Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8). Compiled on a Linux >>2.6.14-gentoo-r4<< system on 2007-09-17. Available extensions: C stubs add-on version 2.1.2 crypt add-on version 2.1 by Michael Glad and others Gentoo patchset 1.8 GNU Libidn by Simon Josefsson GNU libio by Per Bothner NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Native POSIX Threads Library by Ulrich Drepper et al Support for some architectures added on, not maintained in glibc core. BIND-8.2.3-T5B Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. >> c) the output is good, the "reason" line is very helpful. What might be nice >> would be last access by <stacktrace> at <time> (read/write)" > > > Problem is, capturing a stacktrace is expensive, and doing that for > every memory reference would slow it down enormously. It would be > possible to take a stacktrace every time a lock changes state > (is locked/unlocked); but not sure how that would be useful. > maybe it is possible to make a macro similar to CALLGRIND_TOGGLE_COLLECT that toggles collection of these stacktraces for small segments of code ? then it would be possible to only turn it on for the interesting parts. >> d) since i have no docu: does it do deadlock detection ? > > > It does potential deadlock detection; do 'make check' and then run > thrcheck/tests/{tc13_laog,tc14_laog_dinphils,tc15_laog_lockdel}. > Hmm, I do get "Lock acquisition order is inconsistent" errors. I haven't check the testcase source, are those programs actually deadlocking ? (they seem to terminate when called normally). >> e) I don't know if it possible, but it would be nice to automatically >> disregard cases where thread A initializes some data, then starts a >> workerthread B to work on them as "possible data race". Since accesses >> before thread B actually came alive can never really be a data race and >> therefore do not need protection. same ofcourse for access after thread B >> dies. > > > Both of these cases should be handled correctly. See > thrcheck/tests/tc02_simple_tls.c > thrchech/tests/tc03_re_excl.c > > If they are not handled correctly it is a bug in thrcheck which > should be fixed if possible. Can you look a bit more at this? > Maybe produce a small test case? > Both testcases run through valgrind without error. I will have to analyze the thing I saw in my program in more detail. I know that it is impossible to be an actual race, but the program is quite complex and I am not sure if this is actually possible to tell for valgrind or just obvious from what the program does. > This system was a P4 3.2GHz HT (one of the old prescott HT ones) > Gentoo, with glibc-2.5: 32-bit mode or 64-bit mode? > Hmm, I do get "Lock acquisition order is inconsistent" errors. I haven't > check the testcase source, are those programs actually deadlocking ? > (they seem to terminate when called normally). What this detects is cases where a deadlock *might* happen, depending on how the threads are scheduled. In tc14_laog_dinphils.c, if all 5 philosopher threads do pthread_mutex_lock(&chop[left]); exactly at the same time, and then do the next line pthread_mutex_lock(&chop[right]); exactly at the same time, then the program will deadlock. That's what the tool is telling you. I will improve the error message -- this is very new functionality. More generally, if all threads must acquire two locks L1 and L2 before accessing a shared resource R, then all threads must acquire L1 and L2 in the same order (L1 then L2 or L2 then L1). Doing otherwise risks deadlock, since one thread could acquire L1 and another L2, and then they deadlock when trying to acquire L2 and L1 respectively. > Both testcases run through valgrind without error. I will have to > analyze the thing I saw in my program in more detail. I know that it is > impossible to be an actual race, but the program is quite complex and I > am not sure if this is actually possible to tell for valgrind or just > obvious from what the program does. Let me know when you have more details. Julian Seward wrote: [bugs.kde.org quoted mail] 32bit. tis cpu doesn't do 64bit at all. If you want me to test some stuff, I also have various flavors of gentoo on a Core2 Quad, Dual Opteron, and two AthlonXP's around here. Plus some other distros in Xen on the Core2 quad. >> Hmm, I do get "Lock acquisition order is inconsistent" errors. I haven't >> check the testcase source, are those programs actually deadlocking ? >> (they seem to terminate when called normally). > > > What this detects is cases where a deadlock *might* happen, depending on > how the threads are scheduled. In tc14_laog_dinphils.c, if all 5 > philosopher threads do > > pthread_mutex_lock(&chop[left]); > > exactly at the same time, and then do the next line > > pthread_mutex_lock(&chop[right]); > > exactly at the same time, then the program will deadlock. That's what > the tool is telling you. I will improve the error message -- this is > very new functionality. > > More generally, if all threads must acquire two locks L1 and L2 before > accessing a shared resource R, then all threads must acquire L1 and L2 > in the same order (L1 then L2 or L2 then L1). Doing otherwise risks > deadlock, since one thread could acquire L1 and another L2, and then > they deadlock when trying to acquire L2 and L1 respectively. > I see. I found one such potential deadlock in my app, but since it is in the logging class the number of potential threads and places where it is called is huge. Could you perhaps output which threads tried the potential deadlock and where ? Something like: Threads X and Y now locked in order L1 <stacktrace>, L2 <stacktrace> and it was previously locked in order L2 <stacktrace>, L1 <stacktrace> by Threads A and B. >> Both testcases run through valgrind without error. I will have to >> analyze the thing I saw in my program in more detail. I know that it is >> impossible to be an actual race, but the program is quite complex and I >> am not sure if this is actually possible to tell for valgrind or just >> obvious from what the program does. > > > Let me know when you have more details. > Ok, I figured out what happens: a) Thread #1 creates a variable and initializes it b) Thread #2 is created and reads the variable c) Thread #1 later overwrites the variable <- this triggers the potential race. In my case this is impossible to race, because there is network traffic involved between (b) and (c) making it impossible to happen at the same time ((c) is in response to a request sent at (b)). Which leads me to the question: what kinds of synchronization points are handled correctly. Take the following example: a) Create Threads #1 and #2 b) Thread #2 waits on synchronization Primitive X c) Thread #1 writes shared variable A d) Thread #1 triggers X to release Thread #2 e) Thread #2 writes shared variable A This sequence could never race, but for which kinds of X is this detected ? It obviously works for mutexes, but would it work for example for a condition variable, a sempahore or memory barrier ? I use constructs such as those described quite a lot, so it would be nice if that worked ;-) Ofc, some of synchronzie via network traffic, pipes, or other things which will probably never work ... mucki > I see. I found one such potential deadlock in my app, but since it is in > the logging class the number of potential threads and places where it is > called is huge. Could you perhaps output which threads tried the > potential deadlock and where ? Something like: Threads X and Y now > locked in order L1 <stacktrace>, L2 <stacktrace> and it was previously > locked in order L2 <stacktrace>, L1 <stacktrace> by Threads A and B. I'll look into doing that; won't be this week though. Maybe next week. > Which leads me to the question: what kinds of synchronization points are > handled correctly. [...] > detected ? It obviously works for mutexes, but would it work for example > for a condition variable, a sempahore or memory barrier ? Mutexes and condition variables. However it might be possible to extend it. One way you could help is to send some small example programs using semaphores and memory barriers, that demonstrate the problem. Then I can consider if it is possible to fix the tool to handle them. Julian Seward wrote:
[bugs.kde.org quoted mail]
No worries :) I am happy already to see that something is working and
that valgrind is not really dead (the news section looks very quiet ;-) )
>> Which leads me to the question: what kinds of synchronization points are
>> handled correctly. [...]
>> detected ? It obviously works for mutexes, but would it work for example
>> for a condition variable, a sempahore or memory barrier ?
>
>
> Mutexes and condition variables. However it might be possible to extend it.
> One way you could help is to send some small example programs using semaphores
> and memory barriers, that demonstrate the problem. Then I can consider if
> it is possible to fix the tool to handle them.
>
I'm, pretty swamped with work atm, and all my synchronization and
threading stuff is buried in utility libs, so it might take a while, but
I will see what I can do. It might take a while however.
Created attachment 21813 [details]
thrcheck output calling g_thread_create_full()
I'm seeing the attached output when I create a thread using ubuntu feisty:
libc6-dev 2.5-0ubuntu14
libgtk2.0-dev 2.10.11-0ubuntu3
libgtkmm-2.4-dev 2.10.8-0ubuntu
Is this a case of a missing suppression? Or am I misunderstanding the output?
I also see a data race while obtaining a read lock: ==6031== Possible data race during write to 0x172B8DC4 ==6031== at 0x46C4DB0: pthread_rwlock_rdlock (in /lib/tls/i686/cmov/libpthread-2.5.so) ==6031== by 0x4380C60: synfig::RWLock::reader_lock() (mutex.cpp:180) libpthread-2.5.so is in this package: libc6-i686 2.5-0ubuntu14 GNU C Library: Shared libraries [i686 optimized] > > I see. I found one such potential deadlock in my app, but since it is in
> > the logging class the number of potential threads and places where it is
> > called is huge. Could you perhaps output which threads tried the
> > potential deadlock and where ? Something like: Threads X and Y now
> > locked in order L1 <stacktrace>, L2 <stacktrace> and it was previously
> > locked in order L2 <stacktrace>, L1 <stacktrace> by Threads A and B.
svn up to r7005 and try again. I improved the messages. They now look
like this:
Thread #1: lock order "0x7FEFFFAA0 before 0x7FEFFFA70" violated
at 0x4C23A55: pthread_mutex_lock (tc_intercepts.c:379)
by 0x40081F: main (tc13_laog1.c:24)
Required order was established by acquisition of lock at 0x7FEFFFAA0
at 0x4C23A55: pthread_mutex_lock (tc_intercepts.c:379)
by 0x400748: main (tc13_laog1.c:17)
followed by a later acquisition of lock at 0x7FEFFFA70
at 0x4C23A55: pthread_mutex_lock (tc_intercepts.c:379)
by 0x400773: main (tc13_laog1.c:18)
Kind-of verbose, and maybe not exactly what you wanted; but doable, and
should help you track down the root cause of the problem.
Created attachment 21834 [details]
suppressions I had to add to stop complaints about libraries
I've attached a list of suppressions I've added to enable me to use thrcheck in
ubuntu feisty without warnings about libraries I use.
I'm not really understanding the output from thrcheck however, so it's possible
that some of these shouldn't be included in the default install.
On Tuesday 16 October 2007 23:36, Chris Moore wrote: > I've attached a list of suppressions I've added to enable me to use > thrcheck in ubuntu feisty without warnings about libraries I use. Thanks for the suppressions. > I'm not really understanding the output from thrcheck however, so it's > possible that some of these shouldn't be included in the default install. There is some documentation. See comment #5 for details on how to build it. I suggest you svn up (again), to at least r7008. That reduces the flood of lock-order messages somewhat. Overall, though, I am concerned to present the information in a way which is understandable - that's a big concern here, as it is with all the tools. So it would be useful to know what it is you found difficult to understand - can you summarise? It's not the format of the output that's the problem, it's what it really means. I have a class which uses a single mutex around every read and every write of a certain member variable, and yet thrcheck still complains of a possible race. I don't see how there can be a possible race when everything is protected by the same mutex, but thrcheck says there is. That's what makes me think I don't understand what thrcheck is telling me. I've not been able to run thrcheck at all for the last couple of days. Since I started using Boehm's garbage collector, thrcheck crashes like this: ==16444== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==16444== Access not within mapped region at address 0xBEC85190 ==16444== at 0x549843D: GC_mark_from (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x5498C2F: GC_mark_some (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x549075A: GC_stopped_mark (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x5490B0B: GC_try_to_collect_inner (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x549AC3D: GC_init_inner (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x5495EE5: GC_generic_malloc_inner (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x5495FE4: GC_generic_malloc (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x5496246: GC_malloc (in /usr/lib/libgc.so.1.0.2) ==16444== by 0x82DB2CC: gc::operator new(unsigned) (gc_cpp.h:281) ==16444== by 0x8389BCB: studio::App::App(int*, char***) (app.cpp:1104) ==16444== by 0x83671D0: main (main.cpp:85) ==16444== ==16444== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Segmentation fault (core dumped) I don't think this is a thrcheck bug specifically. Valgrind crashes similarly. Boehm's garbage collector is here: http://www.hpl.hp.com/personal/Hans_Boehm/gc/ > I have a class which uses a single mutex around every read and every write
> of a certain member variable, and yet thrcheck still complains of a
> possible race. I don't see how there can be a possible race when
> everything is protected by the same mutex, but thrcheck says there is.
> That's what makes me think I don't understand what thrcheck is telling me.
The tool certainly has its fair share of false positives and limitations.
So it could be wrong. I'd like to know more. Are you able to show any
more details of the problem (ideally a test case) ?
Sorry for the late reply, but i was out of office for a week. This looks exactly like the kind of output i had in mind. I was infact able to determine the source of one of the deadlock error messages by this. It seems libstdc++ locks a mutex when creating a static variable for the first time. Certainly something i would have never guessed to be remotely a problem. Julian Seward wrote: [bugs.kde.org quoted mail] Some seem to be quite specific, but the first suppression (__lll_mutex_unlock_wake) is one that I was seeing a lot also. For some reason this seems to be detected as a race many times someone waits on a mutex. Interestingly enough, the data that causes the race is the "lock" member of the pthread_mutex_t variable (which is internal to the library and therefore not documented). I doubt that the pthread library is actually faulty, so maybe this is just thrcheck not realizing that the access is safe ? (maybe done via cmpxchg assembly or something?) Chris Moore wrote: [bugs.kde.org quoted mail] > On Monday 22 October 2007 10:21, Michael Lacher wrote:
> This looks exactly like the kind of output i had in mind. I was infact able
> to determine the source of one of the deadlock error messages by this.
Good.
On Monday 22 October 2007 10:25, Michael Lacher wrote:
> Some seem to be quite specific, but the first suppression
> (__lll_mutex_unlock_wake) is one that I was seeing a lot also. For some
> reason this seems to be detected as a race many times someone waits on a
> mutex. Interestingly enough, the data that causes the race is the "lock"
> member of the pthread_mutex_t variable (which is internal to the library
> and therefore not documented). I doubt that the pthread library is
> actually faulty, so maybe this is just thrcheck not realizing that the
> access is safe ? (maybe done via cmpxchg assembly or something?)
Because of the way thrcheck is constructed, it cannot properly make sense
of what is going on inside libpthread itself. So there will be a bunch of
races falsely reported inside libpthread, and I simply have to suppress them.
So you should ignore this. btw, it should handle LOCK cmpxchg etc correctly.
> Because of the way thrcheck is constructed, it cannot properly make sense > of what is going on inside libpthread itself. So there will be a bunch of > races falsely reported inside libpthread, and I simply have to suppress them. > So you should ignore this. btw, it should handle LOCK cmpxchg etc correctly. > I see, thanks for the information. > ------- Additional Comments From dooglus gmail com 2007-10-17 14:33 > I have a class which uses a single mutex around every read and every write > of a certain member variable, and yet thrcheck still complains of a > possible race. I don't see how there can be a possible race when > everything is protected by the same mutex, but thrcheck says there is. I've discovered that one possible cause of this is gcc's thread-unsafe code generation. See enormously long discussions starting at http://gcc.gnu.org/ml/gcc/2007-10/msg00266.html http://lkml.org/lkml/2007/10/24/673 In short, under some conditions gcc generates code which actually has a race (which is then detected by Thrcheck) when the original source program has no race (!). This most often happens for conditional stores: if (cond) x = ... where x lives in memory. See for example thrcheck/tests/tc17_sembar.c. And so I wonder if your class is also affected by this problem. Can you provide any further details? Fixed. Will be in 3.3.0. |
I am trying to debug multi-threaded MPI programs, and being able to use helgrind would be a big plus. Maybe a Christmas gift ? 9^) Anyway, thanks a lot to all of the Valgrind team for the tremendous job you have done and are doing ! f.p.