Created attachment 58543 [details] sleepers.c reproducing (un-)fairness A gdbserver regression test was failing on some linux OS, and working on others. After investigation, I believe the cause is poor fairness of the Valgrind scheduler. The unfairness disappear when setting Valgrind affinity for one single cpu. The below was tested on the gcc compile farm machine gcc10 (a debian amd64 24 cores) Running the attached program natively gives: ./sleepers 1 0 1000000000 B-B-B-B- loops/sleep ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- Brussels ready to sleep and/or burn London ready to sleep and/or burn Petaouchnok ready to sleep and/or burn main ready to sleep and/or burn main finished to sleep and/or burn Brussels finished to sleep and/or burn Petaouchnok finished to sleep and/or burn London finished to sleep and/or burn Running it under valgrind (tested with the standard installed 3.3.1 and 3.7.0SVN) gives the below. You see some threads go to the end, before letting other threads go on. vg-in-place --tool=none ./sleepers 1 0 1000000000 B-B-B-B- ==16378== Nulgrind, the minimal Valgrind tool ==16378== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote. ==16378== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info ==16378== Command: ./sleepers 1 0 1000000000 B-B-B-B- ==16378== loops/sleep ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- Brussels ready to sleep and/or burn Brussels finished to sleep and/or burn London ready to sleep and/or burn London finished to sleep and/or burn Petaouchnok ready to sleep and/or burn main ready to sleep and/or burn main finished to sleep and/or burn Petaouchnok finished to sleep and/or burn ==16378== Running with affinity on one single cpu re-introduces fairness: taskset 1 vg-in-place --tool=none ./sleepers 1 0 1000000000 B-B-B-B- ==29679== Nulgrind, the minimal Valgrind tool ==29679== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote. ==29679== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info ==29679== Command: ./sleepers 1 0 1000000000 B-B-B-B- ==29679== loops/sleep ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- Brussels ready to sleep and/or burn London ready to sleep and/or burn Petaouchnok ready to sleep and/or burn main ready to sleep and/or burn London finished to sleep and/or burn Petaouchnok finished to sleep and/or burn Brussels finished to sleep and/or burn main finished to sleep and/or burn ==29679== From various trials, it seems the more cpu there is, the less Valgrind scheduler is fair. I suppose that this is because the sched_yield and/or read on pipe do not guarantee any fairness on a multi_cpu system with plenty of cpu ready to run (effectively, why would the kernel not re-run directly the thread which has done sched_yield: if this is the only thread in the waiting queue, it will run directly, "re-reading the lock" on the pipe and proceed. By setting the affinity of the process to one single cpu, there is more than one thread in the queue, and the yielding thread goes at the end, letting other threads running.
Am 03.04.2011 16:07, schrieb Philippe Waroquiers: > From various trials, it seems the more cpu there is, the less Valgrind > scheduler > is fair. > I suppose that this is because the sched_yield and/or read on pipe > do not guarantee any fairness on a multi_cpu system with plenty of > cpu ready to run (effectively, why would the kernel not re-run directly > the thread which has done sched_yield: if this is the only thread in the > waiting queue, it will run directly, "re-reading the lock" on the pipe > and proceed. > By setting the affinity of the process to one single cpu, there is more than > one thread in the queue, and the yielding thread goes at the end, letting > other threads running. > For completeness, I have also seen scheduler unfairness on a big s390 system. My understanding was that an interprocessor signal is often slower than a pipe read, giving an extra bonus to the old thread. The problem was not that critical, but a real round robin scheduler might be necessary in the future. (We could also change valgrind to be multi-threaded, but this looks like a herculean task) Christian
I'm not surprised to hear of these reports. Just to be clear about this, Valgrind doesn't schedule the threads at all. (It used to in version 1.x, which is why the code is called m_scheduler/scheduler.c, but not any more -- definitely not in the last 6 years.) Now, Valgrind merely implements locking, using a pipe, as you both have seen. The idea is to make the kernel decide which thread to run next. This is critical, because only the kernel knows for sure which threads are blocked in system calls and which are runnable. If Valgrind made that choice, we have no 100% reliable way to know which threads are blocked in syscalls, and so we risk deadlock, if we choose one of them to run. It also makes handling of signals simpler. -------- That said ... Philippe, thanks for the demo about fairness. I have often thought that it might be unfair, but never investigated. For various reasons it would be nice to implement something fairer, maybe a round robin. For one thing it would make Helgrind (and, I presume, DRD) behave more repeatably, since they are both scheduling-sensitive. The difficulty is I never figured out a way to do this, that preserves the properties listed above (the kernel makes the final decision) Any suggestions, I would be happy to hear. Really, we want to be able to say, when a thread comes to the end of its timeslice (or more generally when we drop the BigLock), that this thread should now go to the back of the scheduling queue for this process. As you've probably seen in the sources, I did some experimentation with calling sched_yield, but I never investigated enough to find out if that helps, or whether it helps reliably. Maybe it is worth chasing? Sorry if this is long and unfocussed. /me needs more coffee.
(In reply to comment #2) > Any suggestions, I would be happy to hear. Really, we want to be able > to say, when a thread comes to the end of its timeslice (or more > generally when we drop the BigLock), that this thread should now go to > the back of the scheduling queue for this process. As you've probably I believe the problem here is that the scheduling queue maintained by the linux kernel scheduler is not a queue per process, but rather a queue per cpu (with some logic to "move" processes from one cpu to another). If on a SMP machine, there are many cpus available, then each thread will be on its own cpu. Calling sched_yield will have no effect : it will directly re-run. The only current bypass I have is to ensure the process being run is "locked" on one single cpu. Then all the threads of the process are on one single cpu, and then fairness re-appears (I believe because sched_yield will put the calling thread at the end of the linux scheduler queue on this cpu, behind the other threads). The other threads have then a chance to acquire the lock and run. Whenever I have some more time, I will dig more in details on the effect of sched_yield on multiple cpu but I am not optimistic. There are for sure ways to ensure a certain scheduling order, but that implies to go to real time scheduling priorities on linux, which does not look appropriate for a process running under valgrind.
(In reply to comment #3) > The only current bypass I have is to ensure the process being > run is "locked" on one single cpu. Then all the threads of the process > are on one single cpu, and then fairness re-appears (I believe because > sched_yield will put the calling thread at the end of the linux scheduler > queue on this cpu, behind the other threads). > The other threads have then a chance to acquire the lock and run. OK. So it sounds like a combination of * at startup, we do a schedsetaffinity request to the kernel (I assume there is a system call to do this) that we want to run on only one CPU * use sched_yield whenever we drop the BigLock might work?
(In reply to comment #4) > OK. So it sounds like a combination of > > * at startup, we do a schedsetaffinity request to the kernel > (I assume there is a system call to do this) that we want to > run on only one CPU > > * use sched_yield whenever we drop the BigLock > > might work? I first tried with the shell command line taskset 1 vg-in-place ... which has re-introduced fairness. Then I have put inside the test program the following piece of code: // We will lock ourselves on one single cpu. // This bypasses the unfairness of the Valgrind scheduler // when a multi-cpu machine has enough cpu to run all the // threads wanting to burn cpu. static void setaffinity(void) { cpu_set_t single_cpu; CPU_ZERO(&single_cpu); CPU_SET(1, &single_cpu); (void) sched_setaffinity(0, sizeof(single_cpu), &single_cpu); } Both tricks are working (in effect, they do the same). However, this is somewhat ugly e.g. if you have a big computer with plenty of CPUs, all the Valgrind processes will run on CPU nr 1. To apply this trick as part of Valgrind, we might maybe randomly choose a cpu in the list of cpu so as to somewhat distribute the Valgrind processes load. (at work, we are starting a system with many processes under Valgrind. Hardcoding Valgrind to run on CPU nr 1 looks not appropriate).
The trouble with setting the affinity is that it binds all the tasks to a specific cpu, which could get overloaded with valgrind processes if they all bind to the same one. What you really want is a constraint to make all the tasks run (mostly) on the same cpu, but without really constraining which CPU is used. Unfortunately, I don't think there's a useful mechanism for doing that. The use of the pipe could be confusing the scheduler, because it tries to run the two ends of the pipe in parallel. It would be better to use a futex where available, though I don't know what effect that would have on the scheduling decisions. sched_yield() is very deprecated on Linux, and I wouldn't recommend using it.
(In reply to comment #6) Greetings. > The trouble with setting the affinity is that it binds all the tasks to a > specific cpu, which could get overloaded with valgrind processes if they all > bind to the same one. Is it possible to change the affinity repeatedly as the program runs? I am thinking of some nasty kludge where, mostly, affinity is enabled, but is occasionally disabled, in the hope of allowing long-running valgrind processes to migrate away from very busy CPUs. Or .. it could be command line controlled? I don't care at all about fairness when running Memcheck, but for Helgrind I do. > the two ends of the pipe in parallel. It would be better to use a futex where > available, though I don't know what effect that would have on the scheduling > decisions. I'd be willing to give that a try. If you happen to know off the top of your head, any chance you can sketch the calls to sys_futex needed to declare/initialise/acquire/release/destroy such a lock?
On 04/04/2011 04:48 PM, Julian Seward wrote: > Is it possible to change the affinity repeatedly as the program runs? > I am thinking of some nasty kludge where, mostly, affinity is enabled, > but is occasionally disabled, in the hope of allowing long-running > valgrind processes to migrate away from very busy CPUs. You could disable it to allow migration, but how would arrange for all the threads to migrate to the same other CPU? > Or .. it could be command line controlled? I don't care at all about > fairness when running Memcheck, but for Helgrind I do. "taskset" lets you start a process with a given CPU affinity, so that would be useful for benchmarking what effect it actually has. >> the two ends of the pipe in parallel. It would be better to use a futex where >> available, though I don't know what effect that would have on the scheduling >> decisions. > I'd be willing to give that a try. If you happen to know off the top > of your head, any chance you can sketch the calls to sys_futex needed > to declare/initialise/acquire/release/destroy such a lock? http://people.redhat.com/drepper/futex.pdf is a pretty good introduction. You could use a futex to implement something like a ticket lock to get guaranteed FIFO scheduling of the threads. J
(In reply to comment #8) > You could disable it to allow migration, but how would arrange for all > the threads to migrate to the same other CPU? sched_setaffinity(0, ...) will set the affinity for all the threads of the process, and so the threads will all move to a single cpu if specifying a singleton cpuset. So, the nasty suggested by Julian can be done by from time to time sched_setaffinity to the complete set of cpu available (to let the linux scheduler move the threads to a non-busy cpu) then sched_setaffinity again to this one single cpu. But for sure, an explicit fairness done e.g. with a futex looks more sane/predictible than a set of (random?) sched_setaffinity calls.
(In reply to comment #8) > introduction. You could use a futex to implement something like a > ticket lock to get guaranteed FIFO scheduling of the threads. That's interesting -- I never heard of a ticket lock before. The ability to enforce FIFO is something I'd really like to have. If anyone can come up with a patch that works, I'd be interested to see it.
Created attachment 61782 [details] Patch that makes Valgrind scheduling fair Can anyone who has access to a multi-core Linux system test the attached patch for me ? That patch should make Valgrind scheduling entirely fair. Review comments are welcome to. Notes: - The attached patch has been developed on a Linux system but has not yet been ported to Darwin. - On my test setup regression test output was unaffected.
I'm happy to test it, but I need a test case first... I tried the program from the first comment but I get the same behaviour with and without (an unmodified) valgrind. Namely this: bristol [~/vgtest] % ./sleepers 1 0 1000000000 B-B-B-B- loops/sleep_ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- Brussels ready to sleep and/or burn Brussels finished to sleep and/or burn London ready to sleep and/or burn London finished to sleep and/or burn Petaouchnok ready to sleep and/or burn Petaouchnok finished to sleep and/or burn main ready to sleep and/or burn main finished to sleep and/or burn and this: bristol [~/vgtest] % /tmp/valgrind/bin/valgrind --tool=none --quiet ./sleepers 1 0 1000000000 B-B-B-B- loops/sleep_ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- Brussels ready to sleep and/or burn Brussels finished to sleep and/or burn London ready to sleep and/or burn London finished to sleep and/or burn Petaouchnok ready to sleep and/or burn Petaouchnok finished to sleep and/or burn main ready to sleep and/or burn main finished to sleep and/or burn This is on a six core machine (with HT so twelve threads in all).
Created attachment 61795 [details] sleepers.c reproducing (un-)fairness The second version of sleepers.c should align the "burn" loop start times better. If this change by itself is not yet sufficient, please try to increase the loop count on the command line.
Well this is what I get with the new version. Without valgrind: loops/sleep_ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- [1310462109.584349] main ready to sleep and/or burn [1310462109.584358] main finished to sleep and/or burn [1310462109.584351] London ready to sleep and/or burn [1310462109.584352] Petaouchnok ready to sleep and/or burn [1310462109.584371] Brussels ready to sleep and/or burn [1310462109.584399] Brussels finished to sleep and/or burn [1310462109.584376] London finished to sleep and/or burn [1310462109.584387] Petaouchnok finished to sleep and/or burn With unpatched valgrind: loops/sleep_ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- [1310462112.285960] main ready to sleep and/or burn [1310462112.289425] main finished to sleep and/or burn [1310462112.290703] Petaouchnok ready to sleep and/or burn [1310462112.290750] Petaouchnok finished to sleep and/or burn [1310462112.288432] Brussels ready to sleep and/or burn [1310462112.293177] Brussels finished to sleep and/or burn [1310462112.288122] London ready to sleep and/or burn [1310462112.293283] London finished to sleep and/or burn With patched valgrind: loops/sleep_ms/burn/threads_spec: 1 0 1000000000 B-B-B-B- [1310462390.421093] Petaouchnok ready to sleep and/or burn [1310462390.424389] Petaouchnok finished to sleep and/or burn [1310462390.423341] London ready to sleep and/or burn [1310462390.427027] London finished to sleep and/or burn [1310462390.423415] main ready to sleep and/or burn [1310462390.427160] main finished to sleep and/or burn [1310462390.423200] Brussels ready to sleep and/or burn [1310462390.428220] Brussels finished to sleep and/or burn Is that good?
(In reply to comment #11) > Created an attachment (id=61782) [details] > Patch that makes Valgrind scheduling fair I read though the core implementation (ML_(acquire_ticketlock), ML_(release_ticketlock)). A question: how does it preserve the property that the kernel makes the final decision about which thread to run, and at the same time encourage the kernel to be more fair in choosing which one to run? Can you give a summary of how you're using sys_futex and the head/tail values in the_BigLock to achieve that?
(In reply to comment #15) > A question: how does it preserve the > property that the kernel makes the final decision about which thread > to run Ah, no, that's the wrong question. The right question is: how does it preserve the property that only the kernel knows for sure which threads are blocked in syscalls and which are runnable?
(In reply to comment #14) > Is that good? Although I'm not sure that the results of a single run are statistically relevant, what I see is: - With unpatched Valgrind, the difference between lowest and highest start time is 4.743 ms. - With patched Valgrind, the difference between lowest and highest thread start time is 2.322 ms. So the thread start time delta has been reduced significantly, which is good and which is what we are aiming for.
(In reply to comment #16) > The right question is: how does > it preserve the property that only the kernel knows for sure which > threads are blocked in syscalls and which are runnable? The attached patch shouldn't change anything in this area. The Valgrind scheduler still holds the "big lock" as long as it is running client code and still releases the big lock before invoking a system call and still reacquires the big lock after returning from a system call. What the attached patch changes though is that if there is contention on the big lock, e.g. because multiple threads are running computations, that thread scheduling happens in fair way.
(In reply to comment #18) > the big lock after returning from a system call. What the attached patch > changes though is that if there is contention on the big lock, e.g. because > multiple threads are running computations, that thread scheduling happens in > fair way. Ok, good, but how does it ensure fairness? I see that t->head and t->tail are used to make a queue of threads wanting to acquire the lock. And it is implied that a thread releasing the lock can use sys_futex(VKI_FUTEX_WAIT, ...) to cause the thread at the head of the queue (with the lowest ticket number) to proceed. What I can't quite see, from reading the sys_futex man page, is how you can use futex to signal to one specific thread.
(In reply to comment #19) > Ok, good, but how does it ensure fairness? I see that t->head and > t->tail are used to make a queue of threads wanting to acquire the > lock. And it is implied that a thread releasing the lock can use > sys_futex(VKI_FUTEX_WAIT, ...) to cause the thread at the head of the > queue (with the lowest ticket number) to proceed. What I can't quite > see, from reading the sys_futex man page, is how you can use futex to > signal to one specific thread. The FUTEX_WAKE call used in the ticket lock implementation wakes up all threads waiting on the futex. Fairness is ensured by granting the lock to the thread that started waiting first on the ticket lock. All threads that were woken up needlessly will continue waiting.
> waiting on the futex. Fairness is ensured by granting the lock to the thread > that started waiting first on the ticket lock. All threads that were woken up > needlessly will continue waiting. Ah, so that's what the while loop in ML_(acquire_ticketlock) is for, right? I guess this means it suffers from the "thundering herd" problem.
(In reply to comment #21) > Ah, so that's what the while loop in ML_(acquire_ticketlock) is for, right? > > I guess this means it suffers from the "thundering herd" problem. Isn't that already the case with the current semaphore implementation ? Doesn't writing to a pipe wake up all threads waiting to read from that pipe ? Anyway, I'll have a look at addressing this.
(In reply to comment #22) > > I guess this means it suffers from the "thundering herd" problem. > Isn't that already the case with the current semaphore implementation ? > Doesn't writing to a pipe wake up all threads waiting to read from that pipe ? > > Anyway, I'll have a look at addressing this. Some quick feedback: * on a few runs on a 4-core systems, the patch seems to increase the fairness but the (worst) fairness was encountered when the computer has more cores than the nr of thread to run, and that there are enough free cores to always run each thread wanting to run. Otherwise, I think the linux scheduler is ensuring some fairness. * on RHEL4, the patch compiles and links in 64 bits, but fails with unknown symbols when compiled in 32 bits: ticketlock-linux.c:61: undefined reference to `__sync_fetch_and_add_4' * when running the sleepers with tool=none, the patch seems significantly slower (about 50% slower). Maybe the thundering herd ? Or maybe futex is slower than pipe write + read ? (or of course, maybe me took wrong measurement ?). * some small details about the patch: - priv_ticketlock.h defines head and tail as unsigned, but the implementation uses UInt in the c file. - the assert verifying that the size of tail and head are 4 could be done in the init phase, rather than each time the lock is taken.
(In reply to comment #23) > Some quick feedback: > > * on a few runs on a 4-core systems, the patch seems to increase the fairness That's good ! > * on RHEL4, the patch compiles and links in 64 bits, but fails with unknown > symbols when compiled in 32 bits: > ticketlock-linux.c:61: undefined reference to `__sync_fetch_and_add_4' Not good. We might be hit by this (http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Atomic-Builtins.html): Not all operations are supported by all target processors. If a particular operation cannot be implemented on the target processor, a warning will be generated and a call an external function will be generated. The external function will carry the same name as the builtin, with an additional suffix `_n' where n is the size of the data type. > * when running the sleepers with tool=none, the patch seems significantly > slower (about 50% slower). Maybe the thundering herd ? Or maybe futex > is slower than pipe write + read ? (or of course, maybe me took wrong > measurement ?). Can you please verify whether that result is reproducible ? A futex system call should take less time than a pipe write + read. Even more, with the ticket lock implementation in the uncontended case no system call is issued at all by the waiter. > * some small details about the patch: > - priv_ticketlock.h defines head and tail as unsigned, but the > implementation uses UInt in the c file. > - the assert verifying that the size of tail and head are 4 could be > done in the init phase, rather than each time the lock is taken. Thanks, will take this into account in v2.
(In reply to comment #24) > Can you please verify whether that result is reproducible ? A futex system call > should take less time than a pipe write + read. Even more, with the ticket lock > implementation in the uncontended case no system call is issued at all by the > waiter. I have re-done the measurement (run 3 times each) Without the patch, running the sleepers takes about 35 seconds. With the patch, the same command takes about 45 seconds to 48 seconds. (there is a bigger variation with the futext patch : twice 48 seconds, once 45 seconds), there is almost no variation with the pipe version)
(In reply to comment #25) > I have re-done the measurement (run 3 times each) > Without the patch, running the sleepers takes about 35 seconds. > With the patch, the same command takes about 45 seconds to 48 seconds. I wonder if this could (partly?) be the result of poorer spatial and temporal cache locality that is caused (unavoidably) by the strict round-robin scheduling. Philippe, if you re-run the experiment but use cachegrind instead of none, can you see any interesting variants in cache miss numbers?
(In reply to comment #22) > > I guess this means it suffers from the "thundering herd" problem. > > Isn't that already the case with the current semaphore implementation ? Doesn't > writing to a pipe wake up all threads waiting to read from that pipe? I don't know how the kernel handles this case, but I can imagine that the kernel can simply decide which thread is going to win the race to read from the pipe, and resume only that thread. At least on the user-space (Valgrind) side, each thread tries to read one byte from the pipe, and if it succeeds then it owns the lock. There is no loop around multiple attempts to read from the pipe. Anyway, for the ticketlock, we might be able to work around it by using a small set (eg, 17) futexes instead of just one. When a thread waits for ticket number N, it waits for futex_array[N % 17], and when a thread signals for ticket number M, it signals on futex_array[M % 17]. This doesn't completely remove the problem, but it does reduce it by a factor of (eg) 17.
(In reply to comment #26) > I wonder if this could (partly?) be the result of poorer spatial and > temporal cache locality that is caused (unavoidably) by the strict > round-robin scheduling. Philippe, if you re-run the experiment but > use cachegrind instead of none, can you see any interesting variants > in cache miss numbers? With cachegrind, there is (irrelevant i.e. < 10) differences in the cache miss numbers between the "futex" and the "pipe" scheduler. It is also confirmed that futex is slower (depending on the tool, between 30 to 50% more cpu needed to run sleepers). But the fairness is for sure much better, see below the difference in the termination time with the "pipe" version, compared to the difference in termination time with the "futex" version: pipe: [1310548450.699138] Brussels finished to sleep and/or burn [1310548474.973137] Petaouchnok finished to sleep and/or burn [1310548484.868193] London finished to sleep and/or burn [1310548489.757102] main finished to sleep and/or burn (I am proud Brussels finished first :). futex: [1310549573.337527] main finished to sleep and/or burn [1310549573.595657] London finished to sleep and/or burn [1310549573.772526] Brussels finished to sleep and/or burn [1310549573.798663] Petaouchnok finished to sleep and/or burn but when introducing fairness, Brussels finishes just before Petaouchnok :(.
Created attachment 61851 [details] Patch that makes Valgrind scheduling fair Attached version two of the patch that makes Valgrind scheduling fair. The performance issue should be resolved: make -s regtest completes now in the same time with and without the patch on my setup. Expanded the number of futexes from 1 to 16. Addressed Philippe's review comments about unsigned <> UInt and vg_assert().
(In reply to comment #29) (note that the link problem in 32 bits/x86 is still there). > Attached version two of the patch that makes Valgrind scheduling fair. The > performance issue should be resolved: make -s regtest completes now in the same > time with and without the patch on my setup. Expanded the number of futexes > from 1 to 16. Addressed Philippe's review comments about unsigned <> UInt and > vg_assert(). It looks like the thundering herd and/or performance problem is not solved. What I see with the new version of the patch is that there are big variation in the cpu needed to finish: the fastest was about 32 seconds CPU, the slowest was around 50 seconds (always using around 99% of cpu). This is on a RHEL5.5, 4 cores Q9650 @ 3.00GHz. From what I can see, the more the computer is idle (i.e. no other activities), the more cpu/the longer the test runs. When I run two tests in parallel (so, 2 Valgrinds, running each 4 threads), both tests finish earlier and with less CPU than if they are run one after each other. I tried multiple times to run for i in 1 2 3; do; time ../vg-in-place --tool=none ./sleepers 1 0 2222222222222 B-B-B-B- & ; done If I keep the &, they all run in parallel, each in about 32seconds. If I remove the &, they each run in sequence, each taking 43 to 48 seconds. I have some difficulties to interpret this, the best explanation I have is: when there is always a core available to run each of the 4 threads, there is a high level of contention, and this contention causes threads to be rescheduled, without doing much interesting work. The paper of Ulrich Drepper (see comment #8) is mentionning some bad effect of cache line when having 4 cores or more, and was explaining how to handle that. Maybe a similar problem and so a similar solution might apply ?
The thundering herd is going to be a big performance problem in a system with a large number of threads. As an alternative to using a ticket lock, you could just have per-thread futexes, and link them onto an explicit FIFO list structure. Then the release operation could explicitly WAKE the next thread's futex. It'll be a bit fiddly managing the list, but it can probably be all done with compare-and-swap.
> but when introducing fairness, Brussels finishes just before Petaouchnok :(. Where is Petaouchnok? Somewhere in Siberia maybe?
(In reply to comment #29) > make -s regtest completes now in the same time with and without the patch > on my setup. make regtest isn't a good performance test for this, alas. Almost all of the programs do very little real work, and what you're really measuring is how long it takes for JIT to translate the ld.so and libc startup and exit code, 570 times (or however many tests there are now).
(In reply to comment #31) > The thundering herd is going to be a big performance problem in a system with a > large number of threads. Even if we buy it off by a factor of N, by using N futexes instead of 1, as per comment #27 ? Also, the size of the thundering herd is the number of runnable threads; non-runnables aren't in it because they are blocked in syscalls and haven't arrived yet at the futex-wait point. And if we have zillions of runnable threads then we have a severe performance problem anyway due to the single-threading. (not that that's a good reason to make performance even worse, but still ..)
(In reply to comment #30) > It looks like the thundering herd and/or performance problem is not > solved. Can we get some profiling data on this? At the moment we are more or less guessing what the problem might be. Either oprofile, or profile with cachegrind or callgrind; details below. Also .. is ./sleepers deterministic? iow, do you expect that the number of instructions taken to run it depends on the scheduling, or not? To profile w/ cachegrind or callgrind (I think callgrind doesn't work so well when running V itself): Configure the V to be profiled with --enable-inner and put it in a directory "inner" (let's say). Then run: ./trunk/Inst/bin/valgrind \ --smc-check=all-non-file \ -v --trace-children=yes --sim-hints=enable-outer \ --tool=cachegrind [args for cachegrind] ./inner/Inst/bin/valgrind -v --tool=none \ ./sleepers [args for sleepers] and wait a long long time :-) Results can be interesting though.
(In reply to comment #35) > Also .. is ./sleepers deterministic? iow, do you expect that the > number of instructions taken to run it depends on the scheduling, or > not? I think it is deterministic : when run under cachegrind, there were only ultra small differences (e.g. 10 on xxxxxxxxxxxxx instructions) on 2 or 3 runs. > and wait a long long time :-) Results can be interesting though. I will take a look at this.
Not sure what cachegrind/callgrind can show here. They only simulate one cache hierarchy which is accessed by all threads. This can not show what happens in reality. Even if we would have a multi-core cache simulation, we would need to simulate the scheduling behavior of the Linux kernel. (from comment #30) > From what I can see, the more the computer is idle (i.e. no other > activities), the more cpu/the longer the test runs. If a core gets idle, it will be clocked down by the on-demand governor (or on modern Intel processors with TurboBoost, active cores will automatically have clock frequency increased, and decreased if getting idle). So, if the serialized Valgrind threads run on different cores, they perhaps always start with reduced clock frequency, in contrast to a more utilized system?
(In reply to comment #30) > What I see with the new version of the patch is that there are big > variation in the cpu needed to finish: the fastest was about 32 seconds > CPU, the slowest was around 50 seconds (always using around 99% of cpu). > This is on a RHEL5.5, 4 cores Q9650 @ 3.00GHz. > > From what I can see, the more the computer is idle (i.e. no other activities), > the more cpu/the longer the test runs. > When I run two tests in parallel (so, 2 Valgrinds, running each 4 threads), > both tests finish earlier and with less CPU than if they are run one after > each other. > I tried multiple times to run > for i in 1 2 3; do; time ../vg-in-place --tool=none ./sleepers 1 0 > 2222222222222 B-B-B-B- & ; done > > If I keep the &, they all run in parallel, each in about 32seconds. > If I remove the &, they each run in sequence, each taking 43 to 48 seconds. There is an implementation aspect of the ticket lock implementation in the attached patch that might be an explanation for the above. With two or more idle cores, each Valgrind context switch triggers two futex syscalls - one FUTEX_WAIT and one FUTEX_WAKE call. If the Linux scheduler wakes up the thread that wants to acquire Valgrinds big lock after it has been released, neither the releasing thread nor the acquiring thread will invoke the futex syscall.
Created attachment 61962 [details] Patch that makes Valgrind scheduling fair The following test allowed me to reproduce the slowdown caused by the fair scheduling patch: for ((i=0;i<3;i++)); do time ./vg-in-place --tool=none ~/test/sleepers 1 0 1000000000 B-B-B-B-; done With the third version of the attached patch the slowdown is gone on my test setup (dual core). Does that third version also help on a system with more than two cores ?
Does this third version avoid the slowdowns even if you set SCHEDULING_QUANTUM back to its original value (100000) ? I don't want to have a large increase in this value, since I think it could cause very poor behaviour for interactive programs in the worst case. The new value (10 million) means each thread could run for several seconds before another thread gets a turn.
(In reply to comment #40) > Does this third version avoid the slowdowns even if you set > SCHEDULING_QUANTUM back to its original value (100000) ? I don't want I retested the patch on an Ubuntu 11.04, 4 core, Q6600, 2.4GHz. On this setup, there is no performance difference between the pipe version, the futex(100000) and the futex(10000000) version. The best fairness is ensured by the futex(100000). The increase in quantum causes differences of a few seconds in thread termination time, so it looks effectively not good for interactive applications. There is significant performance difference caused by cpu freq scaling (see Josef's comment #37). When doing sudo cpufreq-set -r -g performance (instead of the default -g ondemand), the time taken for the test changes from 40 seconds to 26 seconds (matching the clock speed increase). So, at least on Ubuntu 11.04+Q6600, the futex without QUANTUM change gives fairness without performance degradation. Next week, I will retest on RHEL 5.5 Q9650 3GHz and see if the perf difference can be explained by cpu freq scaling (e.g. the "pipe" version might keep Valgrind on a single "fast" core, while the "futex" version might switch Valgrind to a slower cpu very often). It would be worth verifying if the slowdown of comment #39 is also influenced by cpu freq scaling.
I have done additional measurements on a RHEL5.5, 4-cores Q9650 3 GHz. (the measurements are measuring a bunch of combinations of "pipe" lock, "futex" lock, with/without QUANTUM changes, ...). I will attach the raw measurements/programs. The conclusion I did from analyzing the raw data is: * the "pipe" scheduler bad fairness is confirmed * the "futex" scheduler good fairness is confirmed. * the increase of QUANTUM decreases the fairness of the "futex" scheduler (=> having a big QUANTUM is not good for interactive applications). * the cpu frequency scaling is not very friendly with the serialization done by Valgrind : cpu frequency scaling can decrease the performance of the "pipe" scheduler by 10 to 20% of the "futex" scheduler by more (up to 50 %). * Without cpu freq scaling (i.e. cpu always forced at max clock), the futex scheduler seems slightly faster, but the difference is small (around 1..2%), so not enough measurement to be sure. It would be nice to have a system call to ask to the kernel: "please have all my threads running on the same cpu but without fixing the cpu so that if needed all the threads can migrate to another CPU". To my knowledge, this does not exist. So, it might be simulated by fixing affinity of all threads to the current cpu then run for some QUANTUM * NR_OF_QUANTUM_ON_SAME_FIXED_CPU then unset the affinity then run for some QUANTUM * NR_OF_QUANTUM_ON_FREE_CPU (this should allow the linux scheduler to move Valgrind to a less busy cpu if needed) then again fixing to the current cpu, etc. With the fair scheduler + this "same cpu preference", we might obtain fairness and good performance, even with cpu freq scaling. There is of course always a risk to decrease performance (e.g. if a lot of threads are accessing different data, running them on different cpu will give them more L1 or so cache, than if they all run on the same cpu). There is also a risk that the above is very dependent on the linux scheduler (I think there are new scheduling algorithms on very recent linux kernels). => so, these NR_OF_QUANTUM_ON_* parameters might be clo options for really power users. If both are given as 0, it might even be "use the good old pipe scheduler". There is some other things still to look at: 1. at least on RHEL5.5, the fair scheduler does not link in x86/32 bits mode (I guess the "sync_fetch_and_add" might have to be coded in assembler rather than to use the gcc builtin). 2. something must be done for MacOS (is there some futexes equivalent on this beast ?) 3. it might be that the perf. behaviour is very different on other platforms (ppc ? arm ?). Philippe
Created attachment 62255 [details] measurement raw data (read the next attachment for some explanation)
Created attachment 62256 [details] shell script run on RHEL5.5 to obtain raw measurements
Created attachment 62257 [details] slightly modified sleepers.c test program
(In reply to comment #40) > Does this third version avoid the slowdowns even if you set > SCHEDULING_QUANTUM back to its original value (100000) ? I don't want > to have a large increase in this value, since I think it could cause > very poor behaviour for interactive programs in the worst case. The > new value (10 million) means each thread could run for several seconds > before another thread gets a turn. I have restored the third version that preserves the value of SCHEDULING_QUANTUM.
(In reply to comment #40) > It would be nice to have a system call to ask to the kernel: > "please have all my threads running on the same cpu but without fixing the > cpu so that if needed > all the threads can migrate to another CPU". > To my knowledge, this does not exist. > So, it might be simulated by fixing affinity of all threads > to the current cpu > then run for some QUANTUM * NR_OF_QUANTUM_ON_SAME_FIXED_CPU > then unset the affinity > then run for some QUANTUM * NR_OF_QUANTUM_ON_FREE_CPU > (this should allow the linux scheduler to move Valgrind to a less busy cpu > if needed) > then again fixing to the current cpu, etc. My own preference is to modify the Linux kernel instead of implementing a workaround in Valgrind. There are other applications that need having two threads to run on the same CPU in order to avoid an excessive number of cache misses. But I'm afraid it might take a while before such a system call is backported to RHEL.
(In reply to comment #46) > (In reply to comment #40) > I have restored the third version that preserves the value of > SCHEDULING_QUANTUM. The 3rd version of the patch had the QUANTUM change (better rolled back) but had other things that I believe should be kept (e.g. it was properly implementing the ownership assertions. So, it is probably easier/better to a version 4, which is version 3 minus the QUANTUM single line change. (or I misunderstood the differences between v2 and v3 of the patch). Philippe
(In reply to comment #47) > My own preference is to modify the Linux kernel instead of implementing a > workaround in Valgrind. There are other applications that need having two > threads to run on the same CPU in order to avoid an excessive number of cache > misses. But I'm afraid it might take a while before such a system call is > backported to RHEL. Having the correct system call looks effectively the nice/correct approach, but I suspect it can take a signicant time to have the syscall implemented, then in the linux kernel, then in the various distributions, then having all Valgrind users upgrading their kernels : all the valgrind users that are upgrading valgrind faster than their distribution might suffer from a 10 to 50% degradation (or they have to bypass this by either forcing a single CPU or disabling cpu frequency scaling). Do you already have an idea about what to do for the x86 link and MacOS ? Philippe
Created attachment 62272 [details] Patch that makes Valgrind scheduling fair Remove SCHEDULING_QUANTUM change from the fairness improvement patch.
(In reply to comment #49) > Do you already have an idea about what to do for the x86 link and MacOS ? The suggestion to implement atomic increment on x86 via inline assembly makes sense to me. With regards to OS/X support I see two options: - Either look up how pthread_cond_wait() and pthread_cond_signal() are implemented in the online Darwin C library source code and use that knowledge to port the ticket lock patch to Darwin. - Or use the existing big lock implementation (anonymous pipe) on OS/X.
Created attachment 65446 [details] Patch that makes Valgrind scheduling fair Patch that implements the described proposal, v5. Changes compared to v4: - Builds now on 32-bit Linux systems (at least RHEL 4). Should build on Darwin too (not tested yet). - valgrind -v ... prints now the name of the ticket lock implementation selected at configure time. - Added more comments.
> With regards to OS/X support I see two options: > - Either look up how pthread_cond_wait() and pthread_cond_signal() are > implemented in the online Darwin C library source code and use that knowledge > to port the ticket lock patch to Darwin. > - Or use the existing big lock implementation (anonymous pipe) on OS/X. How about the following proposal. I make it because, on the one hand, I see that we need fair scheduling, at least some times. But on the other hand the potential performance losses make me nervous. Proposal is: add this feature, but make it disabled by default, at least for the time being, whilst we gather more experience with it. Add a flag to enable it. So, then the existing pipe based lock is the default, and works everywhere; and for platforms that can support the fair scheduler (Linux), it can be selected with a flag. For platforms that can't or don't yet support it (OSX), we can simply refuse to start if the flag is given.
Created attachment 65493 [details] Patch that makes Valgrind scheduling fair (In reply to comment #53) > Proposal is: add this feature, but make it disabled by default, at least > for the time being, whilst we gather more experience with it. Add a flag > to enable it. So, then the existing pipe based lock is the default, and > works everywhere; and for platforms that can support the fair scheduler > (Linux), it can be selected with a flag. For platforms that can't or > don't yet support it (OSX), we can simply refuse to start if the flag is > given. Added flag --fair-sched=[no|yes] in version 6 of the patch.
Created attachment 65525 [details] Patch that makes Valgrind scheduling fair Add --fair-sched=try and use this in the regression tests such that these pass again on systems where the fair scheduling support doesn't build.
Sorry that it took so long but I finally found the time to test this patch on Darwin, and it seems to work fine there (i.e. doesn't cause any regressions). Any objections against applying this patch on the trunk ?
(In reply to comment #56) > Sorry that it took so long but I finally found the time to test this patch on > Darwin, and it seems to work fine there (i.e. doesn't cause any regressions). > Any objections against applying this patch on the trunk ? Commit this looks a good idea. A few additional remarks/question: * Would be worth adding a user manual entry for the new option + explain which problem this is solving * The new files are containing emacs settings, while other files do not. * On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits. => will the ticket lock be fully disabled then ? or enabled only for 64 bits ?
(In reply to comment #57) > * On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits. > => will the ticket lock be fully disabled then ? > or enabled only for 64 bits ? Strange. Valgrind + the latest version of the fair scheduling patch builds fine here on 32-bit RHEL 4.9. Are you sure that you were using the latest version of the patch and that you have rerun configure before starting compilation ?
(In reply to comment #58) > (In reply to comment #57) > > * On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits. > > => will the ticket lock be fully disabled then ? > > or enabled only for 64 bits ? > > Strange. Valgrind + the latest version of the fair scheduling patch builds fine > here on 32-bit RHEL 4.9. Are you sure that you were using the latest version of > the patch and that you have rerun configure before starting compilation ? Sorry, I was completely unclear. I will try to rephrase the question: will Valgrind on RHEL5.5 be configured with ticket lock in 64 bits and with pipe lock in 32 bits ? Or will pipe lock be configured both in 32 and 64 bits in case ticket lock is not ok in 32 bits ?
(In reply to comment #59) > will Valgrind on RHEL5.5 be configured with ticket lock in 64 bits > and with pipe lock in 32 bits ? > > Or will pipe lock be configured both in 32 and 64 bits in case ticket lock > is not ok in 32 bits ? The (fair) ticket lock code is compiled if and only if: * The header <linux/futex.h> is found by the configure script and it has a definition for FUTEX_WAIT. * Using the gcc builtin functions for atomic memory access does not result in a linker error. These configure tests should make sure the ticket lock code is compiled only on 64-bit Linux systems but neither on 32-bit Linux systems nor on Darwin systems.
(In reply to comment #60) > The (fair) ticket lock code is compiled if and only if: > * The header <linux/futex.h> is found by the configure script and it has a > definition for FUTEX_WAIT. > * Using the gcc builtin functions for atomic memory access does not result in a > linker error. On RHEL5.5 64 bits, the primary build arch (amd64) will pass these tests, but the secondary build arch (x86) will fail these checks (no gcc builtin). => in such a case, is the primary arch using ticket lock and the secondary the pipe lock ?
(In reply to comment #61) > On RHEL5.5 64 bits, the primary build arch (amd64) will pass these > tests, but the secondary build arch (x86) will fail these checks (no gcc > builtin). > => in such a case, is the primary arch using ticket lock and the secondary the > pipe lock ? In that case the ticket lock will be used for both architectures - at least if enabled via the proper --fair-sched=... option.
(In reply to comment #62) > (In reply to comment #61) > > On RHEL5.5 64 bits, the primary build arch (amd64) will pass these > > tests, but the secondary build arch (x86) will fail these checks (no gcc > > builtin). > > => in such a case, is the primary arch using ticket lock and the secondary the > > pipe lock ? > > In that case the ticket lock will be used for both architectures - at least if > enabled via the proper --fair-sched=... option. I do not understand this. If configure verifies only for the primary arch that the ticket lock is usable, then it will fail to link the 32 bit.
(In reply to comment #63) > If configure verifies only for the primary arch > that the ticket lock is usable, then it will fail to link the 32 bit. The above statement is wrong. If anyone is interested, some output obtained from an amd64 system: $ objdump -d coregrind/libcoregrind_amd64_linux_a-ticket-lock-linux.o [ ... ] 0000000000000020 <release_sched_lock>: [ ... ] 32: b8 01 00 00 00 mov $0x1,%eax 37: f0 0f c1 07 lock xadd %eax,(%rdi) [ ... ] $ objdump -d coregrind/libcoregrind_x86_linux_a-ticket-lock-linux.o [ ... ] 00000020 <release_sched_lock>: [ ... ] 3b: ba 01 00 00 00 mov $0x1,%edx 40: f0 0f c1 10 lock xadd %edx,(%eax) [ ... ]
(In reply to comment #64) > (In reply to comment #63) > > If configure verifies only for the primary arch > > that the ticket lock is usable, then it will fail to link the 32 bit. > > The above statement is wrong. Well, the above statement is wrong on your system, but is correct on mine :). I think that what is happening is: On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc compiling in 32 bits does not have the needed builtin. So, the configure check (done only on the primary arch, 64 bits) succeeds, the 64 bits arch of valgrind is compiled and linked correctly, but the 32 bits then fails to link. Below is the failing link (and a grep showing that TICKET lock is activated). The below is a 3.7.0 with the last version of the fair patch, autogen.sh and configure re-run. From what I can see, the configure test should be done for both the primary arch and the secondary arch, as a.o. the support for builtin can differ. Hope this clarifies the mystery Philippe ... make[3]: Entering directory `/cm/local_build/weekly/fair/memcheck' ../coregrind/link_tool_exe_linux 0x38000000 gcc -Wno-long-long -Wno-pointer-sign -fno-stack-protector -o memcheck-x86-linux -m32 -O2 -g -Wall -Wmissing-prototypes -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-declarations -Wno-format-zero-length -fno-strict-aliasing -fno-builtin -O2 -static -nodefaultlibs -nostartfiles -u _start -m32 -static -nodefaultlibs -nostartfiles -u _start -m32 memcheck_x86_linux-mc_leakcheck.o memcheck_x86_linux-mc_malloc_wrappers.o memcheck_x86_linux-mc_main.o memcheck_x86_linux-mc_translate.o memcheck_x86_linux-mc_machine.o memcheck_x86_linux-mc_errors.o ../coregrind/libcoregrind-x86-linux.a ../VEX/libvex-x86-linux.a -lgcc ../coregrind/libcoregrind-x86-linux.a(libcoregrind_x86_linux_a-ticket-lock-linux.o): In function `release_sched_lock': /cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:159: undefined reference to `__sync_fetch_and_add_4' /cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:162: undefined reference to `__sync_fetch_and_add_4' ../coregrind/libcoregrind-x86-linux.a(libcoregrind_x86_linux_a-ticket-lock-linux.o): In function `acquire_sched_lock': /cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:116: undefined reference to `__sync_fetch_and_add_4' collect2: ld returned 1 exit status make[3]: *** [memcheck-x86-linux] Error 1 make[3]: Leaving directory `/cm/local_build/weekly/fair/memcheck' make[2]: *** [check-recursive] Error 1 make[2]: Leaving directory `/cm/local_build/weekly/fair/memcheck' make[1]: *** [check-recursive] Error 1 make[1]: Leaving directory `/cm/local_build/weekly/fair' make: *** [check] Error 2 wao@dhws018: uname -a Linux dhws018 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:16 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux wao@dhws018: gcc -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux Thread model: posix gcc version 4.1.2 20080704 (Red Hat 4.1.2-48) wao@dhws018: grep TICKET config.h #define ENABLE_LINUX_TICKET_LOCK 1 wao@dhws018:
Created attachment 66411 [details] Patch that makes Valgrind scheduling fair > On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc > compiling in 32 bits does not have the needed builtin. Thanks for reporting that. Does the new version of the fair scheduling patch help ?
(In reply to comment #66) > Created an attachment (id=66411) [details] > Patch that makes Valgrind scheduling fair > > > On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc > > compiling in 32 bits does not have the needed builtin. > > Thanks for reporting that. Does the new version of the fair scheduling patch > help ? new version properly compiles and link in 32 and 64 bits on RHEL5.5. The resulting 64bits valgrind has fair scheduling. The resulting 32bits valgrind has no fair scheduling (which is what is expected). So, this is all good. One small remark: the resulting config.h contains a confusing line: /* Define to 1 if gcc supports __sync_add_and_fetch() on the secondary architecture */ #define HAVE_BUILTIN_ATOMIC 1 The config.h does not mention primary architecture. I see that the compilation commands for 64 bits are using -DENABLE_LINUX_TICKET_LOCK, while the compilation commands for 32 bits are not defining this symbol. So, not clear why the config.h contains the HAVE_BUILTIN_ATOMIC for the secondary arch. Thanks
Created attachment 66417 [details] Patch that makes Valgrind scheduling fair > One small remark: the resulting config.h contains a confusing line: > /* Define to 1 if gcc supports __sync_add_and_fetch() on the secondary > architecture */ > #define HAVE_BUILTIN_ATOMIC 1 There was one line too much in configure.in - should be fixed now.
Excellent. btw I used this the other day to debug some concurrent code w/ Helgrind and DRD. +1 to land whenever you think it's ready. btw, one thing in the patch: + if (VG_(strcmp)(tmp_str, "yes") == 0) + VG_(clo_fair_sched) = enable_fair_sched; + else if (VG_(strcmp)(tmp_str, "try") == 0) + VG_(clo_fair_sched) = disable_fair_sched; + else if (VG_(strcmp)(tmp_str, "no") == 0) + VG_(clo_fair_sched) = try_fair_sched; looks a bit strange. "try" produces disable_ and "no" produces try_. Is that what you intended? Maybe I misunderstand something.
(In reply to comment #67) > I see that the compilation commands for 64 bits are using > -DENABLE_LINUX_TICKET_LOCK, > while the compilation commands for 32 bits are not defining this symbol. > So, not clear why the config.h contains the HAVE_BUILTIN_ATOMIC for the > secondary arch. You might have missed these changes in coregrind/Makefile.am. These enable -DENABLE_LINUX_TICKET_LOCK depending on the build architecture and the values of ENABLE_LINUX_TICKET_LOCK_PRIMARY and ENABLE_LINUX_TICKET_LOCK_SECONDARY: @@ -371,6 +375,13 @@ libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CFLAGS = \ $(AM_CFLAGS_@VGCONF_PLATFORM_PRI_CAPS@) libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CCASFLAGS = \ $(AM_CCASFLAGS_@VGCONF_PLATFORM_PRI_CAPS@) +if ENABLE_LINUX_TICKET_LOCK_PRIMARY +libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_SOURCES += \ + m_scheduler/ticket-lock-linux.c +libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CFLAGS += \ + -DENABLE_LINUX_TICKET_LOCK +endif + if VGCONF_HAVE_PLATFORM_SEC libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_SOURCES = \ $(COREGRIND_SOURCES_COMMON) @@ -382,6 +393,12 @@ libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CFLAGS = \ $(AM_CFLAGS_@VGCONF_PLATFORM_SEC_CAPS@) libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CCASFLAGS = \ $(AM_CCASFLAGS_@VGCONF_PLATFORM_SEC_CAPS@) +if ENABLE_LINUX_TICKET_LOCK_SECONDARY +libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_SOURCES += \ + m_scheduler/ticket-lock-linux.c +libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CFLAGS += \ + -DENABLE_LINUX_TICKET_LOCK +endif endif #----------------------------------------------------------------------------
Created attachment 66448 [details] Patch that makes Valgrind scheduling fair Fix semantics of --fair-sched=try and --fair-sched=no.
Applied the attached patch as r12280.