Created attachment 90685 [details] Fix coregrind semaphore starvation when running threads-per-CPU. Hello. I recently had to debug valgrind jobs with massive starvation issues in m_scheduler. In the end, it looked like a deadlock but is strictly speaking just a massive fairness issues, and --fair-sched=yes turned out to be a workaround (which I had not heard of). Filing a bug and patch anyway, because it took a lot of tracing and reasoning about the application's behavior to understand it. The issue is primarily when debugging apps based on Intel's DPDK or, more generally: event-driven frameworks which run multithreaded, but with strictly one thread per CPU core and pinned. Citing the patch description: --snip-- Fix coregrind semaphore starvation when running threads-per-CPU. Some event-driven frameworks (such as DPDK) run with only a select number of system threads, each pinned to a unique system CPU core. Between loop iterations, some of these (such as DPDK) won't yield the CPU to wait for kernel events, but spin in a synchronous loop to poll resources mapped to userspace. In such a scenario, a running thread releases the semaphore just to immediately re-aquire it again. At least on Linux, the running thread is by far the most likely to succeed an immediate sema_down after sema_up. The result is almost complete starvation of any other runnable task. Fixed by setting the pipe FD to non-blocking, so all tasks remain runnable while competing in sema_down(). Failing with EAGAIN inserts a sched_yield(), to reduce some of the resulting overhead on UP systems. This is admittedly not an elegant or lightweight solution. A better one will presumably have to employ a scheduling mechanism which is fairer than pipes, or emulate one. --snip-- Yes, --fair-sched=yes would resolve the issue. But in the case at hand, the problem manifests not just as degraded performance, but a complete stall of progress, unless the user manually induced some scheduling noise (typically I can remote-control progress on/off just by running a grep on a different console). I believe that if pipes are the default, they should just-work.
I'm a bit confused. Would making fair-sched the default be an acceptable alternative solution, or does this fix achieve something beyond that?
Also, what is the cost of this patch? The description hints at the fact that it has undesirable consequences and that the yield is an attempt to mitigate the worst of them. Isn't the whole point of --fair-sched being a switch that is off by default that we didn't have a solution that would fix the rare problem cases without having a price that we didn't want to pay when we didn't need to? Hence putting it behind a switch so the people that do need it can turn it on.
(In reply to Tom Hughes from comment #2) > Isn't the whole point of --fair-sched being a switch that is off by default > that we didn't have a solution that would fix the rare problem cases without > having a price that we didn't want to pay when we didn't need to? In practice, for all serious stuff (firefox etc, which often has > 50 threads) I use --fair-sched all the time. I wouldn't be averse to setting it on by default on targets that support it (Linux, Android).
> Also, what is the cost of this patch? It's busy-polling. All threads stay awake and consume cycles, the sched_yield() just avoid wasting more cycles than necessary when under contention. But the result is higher scheduling overhead and higher CPU utilization. The ticket mainly exists because of the starvation otherwise arising, which is frustrating to users who are unaware of the default mechanism and its consequences. Suggestions: - This patch. - A starvation detection. Is expected time spent with the semaphore held roughly bounded? Then poll() the pipe, and send a warning to the console when timing out. Warning suggests --fair-sched=yes. - Default to --fair-sched=yes. Not sure about negative implications. - Default to --fair-sched=yes when running multithreaded on SMP platforms.
(In reply to Tom Hughes from comment #2) > Also, what is the cost of this patch? The description hints at the fact that > it has undesirable consequences and that the yield is an attempt to mitigate > the worst of them. > > Isn't the whole point of --fair-sched being a switch that is off by default > that we didn't have a solution that would fix the rare problem cases without > having a price that we didn't want to pay when we didn't need to? Hence > putting it behind a switch so the people that do need it can turn it on. When fair-sched was added, it was decided to not activate fair-sched by default, because this can decrease the performance of multi-threaded program by up to 50%, as fair-sched interacts badly with cpu frequency scaling. More info about that in http://valgrind.org/docs/manual/manual-core.html, 2.7.1. Scheduling and Multi-Thread Performance Having a busy wait loop in Valgrind scheduler seems not a good idea. It is not very clear how using poll then read will solve the problem: I suspect all threads will be waken up when the busy thread releases the lock, and only one of the waken thread will succeed to read, others will block in read and so back to square 1 : either starvation, or we will have to change the read to non blocking read. Such poll+non blocking read will also probably suffer from the 'thundering herd' syndrom. Note that the initial implementation of fair-sched suffered from this thundering herd problem. It was solved by having an array of futex, and the ticket lock nr was used to decide which futex to wait on, and which futex to wake.
(In reply to Daniel Stodden from comment #4) > > Also, what is the cost of this patch? > > It's busy-polling. All threads stay awake and consume cycles, the > sched_yield() just avoid wasting more cycles than necessary when under > contention. But the result is higher scheduling overhead and higher CPU > utilization. > > The ticket mainly exists because of the starvation otherwise arising, which > is frustrating to users > who are unaware of the default mechanism and its consequences. > > Suggestions: > - This patch. > - A starvation detection. Is expected time spent with the semaphore held > roughly bounded? > Then poll() the pipe, and send a warning to the console when timing out. > Warning suggests --fair-sched=yes. > - Default to --fair-sched=yes. Not sure about negative implications. > - Default to --fair-sched=yes when running multithreaded on SMP platforms. Maybe we could implement the starvation detection in the Valgrind scheduler, by the thread that gets most/all of the CPU : every XX major scheduling event, the running thread could look at other threads counters, and see they could progress. The best solution is probably to have Valgrind running more than one thread in parallel. This should give both efficiency and (I hope) reasonable fairness. However, a 'really multi-threaded valgrind' is not that easy. See bug 301830 (and if you go to fosdem valgrind devroom, the talk about this mtV prototype)
> When fair-sched was added, it was decided to not activate fair-sched by default, > because this can decrease the performance of multi-threaded program by up to 50%, > as fair-sched interacts badly with cpu frequency scaling. Hi again. I just looked at the fair-sched implementation and got to re-think things. So I understand fair is completely fair indeed, a fully-featured ticket lock, and that's heavier indeed. I'm not suprised it's not turned on by default. Just to re-state the well-known: a regular lock may hold a populated waitqueue, but threads which happen to aquire simply on the basis of being the one scheduled will preempt this queue. For regular locks and mutexes, less fairness contributes a lot to better throughput. Pretty much every preemptively scheduled thread package does that. Regarding the semaphore size: Is my understanding correct that the effectively serializes *all* userspace execution under Valgrind? (Meaning: the only notable exception being system calls? (I think I saw those bits releasing while browsing some code)). Meaning: that 1-(-and-polling-)thread-per-cpu case I'm after is likely the most malicious workload to turns lock fairness under these conditions upside down. In fact, I'm not even sure if a regular NPTL mutex would result in something much different from what I got with pipes. Unfortunately it's one of rising popularity for I/O bound workloads. :P I think I understand this better now. So here's my thesis: Don't blame the pipes. I used to suspect a NPTL lock would behave better. Now I think it won't. It's that mix of pinning and serialization, there are no tickets in futex() either, to schedule gratuitously for the sake of other threads. And multithreaded Valgrind is not coming anytime soon, agreed. Conclusion. This patch is indeed not a good fix. The much saner solution is what Philippe suggests: starvation detection / prevention for runnable tasks. I think it means Linux builds should default to a different lock, but not ticket locks primarily. Instead, I'd suggest to accept that serialization is a) necessary evil but b) promotes starvation. So -> augment a dedicated lock with time slices to prevent it: - Lock eagerly, preempting other waiters, for a number of rounds. That's the fast path, different from fair-sched=yes - But sample whether there are waiting tasks, taking time stamps. - Too many successes make the winning thread move to the end of the run queue, in favour of the next runnable one in line. That's the slow but necessary ticket'ing case. And I guess that could be the default algo. The portable way could just use a posix mutex/cond pair, and Valgrind could default to that on any platform unless it has no pthreads. Any reason to prefer raw futex()es? If I'd write either one, any chance to get it in? Any better suggestions certainly welcome.
(In reply to Daniel Stodden from comment #7) > I think it means Linux builds should default to a different lock, but not > ticket locks primarily. Instead, I'd suggest to accept that serialization is > a) necessary evil but b) promotes starvation. So -> augment a dedicated lock > with time slices to prevent it: > > - Lock eagerly, preempting other waiters, for a number of rounds. That's the > fast path, different from fair-sched=yes > - But sample whether there are waiting tasks, taking time stamps. > - Too many successes make the winning thread move to the end of the run > queue, in favour of the next runnable one in line. > That's the slow but necessary ticket'ing case. > > And I guess that could be the default algo. > > The portable way could just use a posix mutex/cond pair, and Valgrind could > default to that on any platform unless it has no pthreads. Any reason to > prefer raw futex()es? > > If I'd write either one, any chance to get it in? Any better suggestions > certainly welcome. Certainly I would be interested in looking at an implementation which has most of anti-starvation properties of the existing fair-sched but avoids most of the overhead. If you can implement your suggestion within the constraints of the code base, I'd be interested to look at it. It sounds complex though, and there's one important fact that complicates it: use of libc and libpthread is not allowed. The tool executables are completely standalone, statically linked, and rely on no other libraries at all. You have to do raw syscalls yourself. Mostly these are wrapped up in a nano- libc, m_libc*.c. But there's no pthread_ anything, for example. Also it needs to run on all targets -- hence any magic bits of atomic assembly code etc will need to be abstracted out. And we need to retain the pipe implementation as a fallback on MacOS since MacOS doesn't appear to have futexes. That said, if it looks good on (eg) x86_64, I'm sure we can muster support to do the inline assembly bits on other targets. It's about time we had a m_libcatomic.c.
(In reply to Julian Seward from comment #8) > Certainly I would be interested in looking at an implementation which > has most of anti-starvation properties of the existing fair-sched but > avoids most of the overhead. Note that to my knowledge, --fair-sched is only slower when interacting badly with cpu frequency scaling. If you set your system to max fixed frequency, I think fair-sched is not slower than pipe. According to bug 270006 comment 42, when cpu freq scaling is disabled, fair sched is even slightly more efficient than the pipe based lock (at least the measurement I did on RHEL5 at that time). IIRC, the decision to not activate fair sched by default was due to this very bad interaction. Note that cpu freq scaling interacts also (somewhat) with pipe based lock, but not as badly as with fair sched.