| Summary: | [coregrind] Thread scheduling regression? | ||
|---|---|---|---|
| Product: | [Developer tools] valgrind | Reporter: | jeweljar |
| Component: | general | Assignee: | Julian Seward <jseward> |
| Status: | REPORTED --- | ||
| Severity: | normal | CC: | bart.vanassche+kde, jeff.wegher, kde4, ntvdm, pjfloyd |
| Priority: | NOR | ||
| Version First Reported In: | 3.8.0 | ||
| Target Milestone: | --- | ||
| Platform: | unspecified | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
| Attachments: | spinlock test code | ||
Your test program tests whether "val" is a multiple of hundred thousand before printing the value of that variable. That looks like a race condition to me. Shouldn't it test the value of the variable "i" instead ? Note: I do not know whether that's related to the issue reported. But after I made that change and after having changed "100000" into "100", the test program runs fine here both with --fair-sched=no and with --fair-sched=yes: [ ... ] thread 0x67f5700: val = 2256500 thread 0x67f5700: val = 2256600 thread 0x67f5700: val = 2256700 thread 0x67f5700: val = 2256800 thread 0x67f5700: val = 2256900 thread 0x67f5700: val = 2257000 thread 0x67f5700: val = 2257100 thread 0x67f5700: val = 2257200 thread 0x67f5700: val = 2257300 thread 0x5bf4700: val = 2236100 thread 0x5bf4700: val = 2257400 thread 0x5bf4700: val = 2257500 thread 0x5bf4700: val = 2257600 thread 0x5bf4700: val = 2257700 thread 0x5bf4700: val = 2257800 thread 0x5bf4700: val = 2257900 [ ... ] Yes, it was my mistake. I meant "i" instead of "val". But, The both results are the same. The thread in pthread_spin_unlock() is starving. It's very interesting that the change of printf() condition from 100000 to 100 makes the code run. I guess that some instruction in pthread_spin_lock() extends its own cpu time slice. (x86 PAUSE instruction, maybe?) Did you try --fair-sched=yes? It can make a big difference in spinlock code. I ran into this same problem using spinlocks - the OP's example program distills it down nicely (after changing "val" to "i" in the conditional.)
Running the example program without the fair scheduler shows output from only a single thread on my dual-quad Nehalem Fedora 18 box with valgrind 3.8.1.
Enabling --fair-sched=yes will quickly lock up the example program with one thread spinning trying to acquire the lock while another thread is hung in the unlock function:
(gdb) info threads
[New Thread 22941]
[New Thread 22942]
[New Thread 22943]
Id Target Id Frame
4 Thread 22943 (tid 4 VgTs_Runnable) pthread_spin_lock ()
at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
3 Thread 22942 (tid 3 VgTs_Yielding) pthread_spin_lock ()
at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24
2 Thread 22941 (tid 2 VgTs_Yielding) pthread_spin_unlock ()
at ../nptl/sysdeps/x86_64/pthread_spin_unlock.S:23
* 1 Thread 22940 (tid 1 VgTs_WaitSys) 0x0000003e29e08e60 in pthread_join (threadid=88241920,
thread_return=0x0) at pthread_join.c:92
Hi, I use ubuntu 12.4 32bit. I have almost always the same regression in scheduling (one thread hangs) in my application when using 3.8.0, 3.8.1 or trunk (2013-10-2). With 3.7.0 or without valgrind it never hangs I've tried to produce a simple testcase, but I haven't managed to reproduce it without including a lot of code. I'm not sure in my case it's due to pthread_spin_lock, since running with ltrace I cannot see it. I only know that I can only reproduce the problem when I call system() or popen() calls. hope this helps thanks After some digging, I found out the hang was in a ACE ThreadMutex which should never deadlock since it's protected with Guard and it's only used in 1 place. Replacing with RecursiveThreadMutex or RWThreadMutex did not help. Replacing to ProcessMutex I get a "errno 4: Interrupted system call" instead of the hang. So Maybe the problem is that valgrind by mistake sometimes (after popen or system) mixes up a thread, and then the mutex is executed in the wrong thread? This is still happening with valgrind 3.9.0. Either (1) output ceases after a few seconds with memcheck-amd64 pegged at 100% CPU and not responding to signals (this happens on a two-core VM as well as occasionally happens on a 48-core Xeon E5-2697), OR (2) I only get output from one thread (happens the rest of the time on the 48-core Xeon E5-2697). This is as described by the reporter. The status of this bug is still "UNCONFIRMED" several years after being reported, and I find it hard to believe that ANYONE would be unable to easily reproduce this within seconds with the provided test program. I can't reproduce the problem on RHEL 7.9 and Valgrind git head. With no options it does seem to get "stuck" on one thread for extended periods but not indefinitely. Reducing --scheduling-quantum= obviously increases the chance of a switch. --fair-sched gives me regular but slow thread switches. Obvioudly this is still likely to be a problem on FreeBSD/Solaris/Darwin as they don't have a fair scheduler. |
Created attachment 75484 [details] spinlock test code The attached test program hangs if you run it with valgrind-3.8.1 (tool=memcheck) But, runs fine with valgrind-3.7.0 or older version. With v3.8.1 a thread spins forever in pthread_spin_lock() and the other thread yields cpu in pthread_spin_unlock() without releasing the spin lock. I guess there is a thread scheduling regression for version 3.8.1