Bug 343357 - Please fix semaphore starvation when running threads-per-CPU
Summary: Please fix semaphore starvation when running threads-per-CPU
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.8.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-27 02:24 UTC by Daniel Stodden
Modified: 2016-04-09 19:40 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Fix coregrind semaphore starvation when running threads-per-CPU. (2.25 KB, patch)
2015-01-27 02:24 UTC, Daniel Stodden
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Stodden 2015-01-27 02:24:46 UTC
Created attachment 90685 [details]
Fix coregrind semaphore starvation when running threads-per-CPU.

Hello.

I recently had to debug valgrind jobs with massive starvation issues in m_scheduler.

In the end, it looked like a deadlock but is strictly speaking just a massive fairness issues, and --fair-sched=yes turned out to be a workaround (which I had not heard of). 

Filing a bug and patch anyway, because it took a lot of tracing and reasoning about
the application's behavior to understand it.

The issue is primarily when debugging apps based on Intel's DPDK or, more generally: event-driven frameworks which run multithreaded, but with strictly one thread per CPU core and pinned. Citing the patch description:

--snip--
Fix coregrind semaphore starvation when running threads-per-CPU.

Some event-driven frameworks (such as DPDK) run with only a select
number of system threads, each pinned to a unique system CPU core.

Between loop iterations, some of these (such as DPDK) won't yield the
CPU to wait for kernel events, but spin in a synchronous loop
to poll resources mapped to userspace.

In such a scenario, a running thread releases the semaphore just to
immediately re-aquire it again. At least on Linux, the running thread
is by far the most likely to succeed an immediate sema_down after
sema_up. The result is almost complete starvation of any other
runnable task.

Fixed by setting the pipe FD to non-blocking, so all tasks remain
runnable while competing in sema_down(). Failing with EAGAIN inserts a
sched_yield(), to reduce some of the resulting overhead on UP systems.

This is admittedly not an elegant or lightweight solution. A better
one will presumably have to employ a scheduling mechanism which is
fairer than pipes, or emulate one.
--snip--

Yes, --fair-sched=yes would resolve the issue. But in the case at hand, the problem
manifests not just as degraded performance, but a complete stall of progress, unless
the user manually induced some scheduling noise (typically I can remote-control progress on/off
just by running a grep on a different console).

I believe that if pipes are the default, they should just-work.
Comment 1 Julian Seward 2015-01-27 09:45:11 UTC
I'm a bit confused.  Would making fair-sched the default be an
acceptable alternative solution, or does this fix achieve something
beyond that?
Comment 2 Tom Hughes 2015-01-27 10:41:32 UTC
Also, what is the cost of this patch? The description hints at the fact that it has undesirable consequences and that the yield is an attempt to mitigate the worst of them.

Isn't the whole point of --fair-sched being a switch that is off by default that we didn't have a solution that would fix the rare problem cases without having a price that we didn't want to pay when we didn't need to? Hence putting it behind a switch so the people that do need it can turn it on.
Comment 3 Julian Seward 2015-01-27 12:32:12 UTC
(In reply to Tom Hughes from comment #2)
> Isn't the whole point of --fair-sched being a switch that is off by default
> that we didn't have a solution that would fix the rare problem cases without
> having a price that we didn't want to pay when we didn't need to?

In practice, for all serious stuff (firefox etc, which often has > 50 threads)
I use --fair-sched all the time.  I wouldn't be averse to setting it on by
default on targets that support it (Linux, Android).
Comment 4 Daniel Stodden 2015-01-27 18:02:01 UTC
> Also, what is the cost of this patch?

It's busy-polling. All threads stay awake and consume cycles, the sched_yield() just avoid wasting more cycles than necessary when under contention. But the result is higher scheduling overhead and higher CPU utilization.

The ticket mainly exists because of the starvation otherwise arising, which is frustrating to users 
who are unaware of the default mechanism and its consequences.

Suggestions:
- This patch.
- A starvation detection. Is expected time spent with the semaphore held roughly bounded?
  Then poll() the pipe, and send a warning to the console when timing out. 
  Warning suggests --fair-sched=yes.
- Default to --fair-sched=yes. Not sure about negative implications.
- Default to --fair-sched=yes when running multithreaded on SMP platforms.
Comment 5 Philippe Waroquiers 2015-01-27 20:38:17 UTC
(In reply to Tom Hughes from comment #2)
> Also, what is the cost of this patch? The description hints at the fact that
> it has undesirable consequences and that the yield is an attempt to mitigate
> the worst of them.
> 
> Isn't the whole point of --fair-sched being a switch that is off by default
> that we didn't have a solution that would fix the rare problem cases without
> having a price that we didn't want to pay when we didn't need to? Hence
> putting it behind a switch so the people that do need it can turn it on.

When fair-sched was added, it was decided to not activate fair-sched by default,
because this can decrease the performance of multi-threaded program by up to 50%,
as fair-sched interacts badly with cpu frequency scaling.

More info about that in http://valgrind.org/docs/manual/manual-core.html,
2.7.1. Scheduling and Multi-Thread Performance

Having a busy wait loop in Valgrind scheduler seems not a good idea.

It is not very clear how using poll then read will solve the problem:
I suspect all threads will be waken up when the busy thread releases the lock,
and only one of the waken thread will succeed to read,
others will block in read and so back to square 1 :
either starvation, or we will have to change the read to non blocking read.
Such poll+non blocking read will also probably suffer from the 'thundering herd' syndrom.

Note that the initial implementation of fair-sched suffered from this thundering herd
problem. It was solved by having an array of futex, and the ticket lock nr was used
to decide which futex to wait on, and which futex to wake.
Comment 6 Philippe Waroquiers 2015-01-27 21:49:40 UTC
(In reply to Daniel Stodden from comment #4)
> > Also, what is the cost of this patch?
> 
> It's busy-polling. All threads stay awake and consume cycles, the
> sched_yield() just avoid wasting more cycles than necessary when under
> contention. But the result is higher scheduling overhead and higher CPU
> utilization.
> 
> The ticket mainly exists because of the starvation otherwise arising, which
> is frustrating to users 
> who are unaware of the default mechanism and its consequences.
> 
> Suggestions:
> - This patch.
> - A starvation detection. Is expected time spent with the semaphore held
> roughly bounded?
>   Then poll() the pipe, and send a warning to the console when timing out. 
>   Warning suggests --fair-sched=yes.
> - Default to --fair-sched=yes. Not sure about negative implications.
> - Default to --fair-sched=yes when running multithreaded on SMP platforms.

Maybe we could implement the starvation detection in the Valgrind scheduler,
by the thread that gets most/all of the CPU : every XX major scheduling event,
the running thread could look at other threads counters, and see they could progress.

The best solution is probably to have Valgrind running more than one thread
in parallel. This should give both efficiency and (I hope) reasonable fairness.

However, a 'really multi-threaded valgrind' is not that easy.
See bug 301830 (and if you go to fosdem valgrind devroom, the talk about this 
mtV prototype)
Comment 7 Daniel Stodden 2015-01-28 03:54:35 UTC
> When fair-sched was added, it was decided to not activate fair-sched by default,
> because this can decrease the performance of multi-threaded program by up to 50%,
> as fair-sched interacts badly with cpu frequency scaling.

Hi again.

I just looked at the fair-sched implementation and got to re-think things.

So I understand fair is completely fair indeed, a fully-featured ticket lock,  and that's heavier indeed. I'm not suprised it's not turned on by default.

Just to re-state the well-known: a regular lock may hold a populated waitqueue, but threads which  happen to aquire simply on the basis of being the one scheduled will preempt this queue. For regular locks and mutexes, less fairness contributes a lot to better throughput.  Pretty much every preemptively scheduled thread package does that.

Regarding the semaphore size: Is my understanding correct that the effectively serializes *all* userspace execution under Valgrind? 
(Meaning:  the only notable exception being  system calls? (I think I saw those bits releasing while browsing some code)).

Meaning: that 1-(-and-polling-)thread-per-cpu case I'm after is likely the most malicious workload to turns lock fairness under these conditions upside down. In fact, I'm not even sure if a regular NPTL mutex would result in something much different from what I got with pipes. 

Unfortunately it's one of rising popularity for I/O bound workloads. :P 

I think I understand this better now. So here's my thesis: Don't blame the pipes. I used to suspect a NPTL lock would behave better. Now I think it won't. It's that mix of pinning and serialization, there are no tickets in futex()  either, to  schedule gratuitously for the sake of other threads.

And multithreaded Valgrind is not coming anytime soon, agreed.

Conclusion. This patch is indeed not a good fix. The much saner solution is what Philippe suggests:  starvation detection / prevention for runnable tasks. 

I think it means Linux builds should default to a different lock, but not ticket locks primarily. Instead, I'd suggest to accept that serialization is a) necessary evil but b) promotes starvation. So -> augment a dedicated lock with time slices to prevent it:

- Lock eagerly, preempting other waiters, for a number of rounds. That's the fast path, different from fair-sched=yes
- But sample whether there are waiting tasks, taking time stamps.
- Too many successes make the winning thread move to the end of the run queue, in favour of the next runnable one in line.
  That's the slow but necessary ticket'ing case. 

And I guess that could be the default algo. 

The portable way could just use a posix mutex/cond pair, and Valgrind could default to that on any platform unless it has no pthreads. Any reason to prefer raw futex()es? 

If I'd write either one, any chance to get it in?  Any better suggestions certainly welcome.
Comment 8 Julian Seward 2015-01-28 09:58:44 UTC
(In reply to Daniel Stodden from comment #7)
> I think it means Linux builds should default to a different lock, but not
> ticket locks primarily. Instead, I'd suggest to accept that serialization is
> a) necessary evil but b) promotes starvation. So -> augment a dedicated lock
> with time slices to prevent it:
> 
> - Lock eagerly, preempting other waiters, for a number of rounds. That's the
> fast path, different from fair-sched=yes
> - But sample whether there are waiting tasks, taking time stamps.
> - Too many successes make the winning thread move to the end of the run
> queue, in favour of the next runnable one in line.
>   That's the slow but necessary ticket'ing case. 
> 
> And I guess that could be the default algo. 
> 
> The portable way could just use a posix mutex/cond pair, and Valgrind could
> default to that on any platform unless it has no pthreads. Any reason to
> prefer raw futex()es? 
> 
> If I'd write either one, any chance to get it in?  Any better suggestions
> certainly welcome.

Certainly I would be interested in looking at an implementation which
has most of anti-starvation properties of the existing fair-sched but
avoids most of the overhead.

If you can implement your suggestion within the constraints of the
code base, I'd be interested to look at it.  It sounds complex though,
and there's one important fact that complicates it: use of libc and
libpthread is not allowed.  The tool executables are completely
standalone, statically linked, and rely on no other libraries at all.

You have to do raw syscalls yourself.  Mostly these are wrapped up in
a nano- libc, m_libc*.c.  But there's no pthread_ anything, for example.

Also it needs to run on all targets -- hence any magic bits of atomic
assembly code etc will need to be abstracted out.  And we need to
retain the pipe implementation as a fallback on MacOS since MacOS
doesn't appear to have futexes.  That said, if it looks good on (eg)
x86_64, I'm sure we can muster support to do the inline assembly bits
on other targets.  It's about time we had a m_libcatomic.c.
Comment 9 Philippe Waroquiers 2015-01-28 22:01:56 UTC
(In reply to Julian Seward from comment #8)
> Certainly I would be interested in looking at an implementation which
> has most of anti-starvation properties of the existing fair-sched but
> avoids most of the overhead.
Note that to my knowledge, --fair-sched is only slower when interacting
badly with cpu frequency scaling.
If you set your system to max fixed frequency, I think fair-sched is not
slower than pipe. According to bug 270006 comment 42, when cpu freq
scaling is disabled, fair sched is even slightly more efficient than
the pipe based lock (at least the measurement I did on RHEL5 at that time).

IIRC, the decision to not activate fair sched by default was due to this
very bad interaction.

Note that cpu freq scaling interacts also (somewhat) with pipe based lock,
but not as badly as with fair sched.