270006 – Valgrind scheduler unfair

Bug 270006 - Valgrind scheduler unfair

Summary: Valgrind scheduler unfair

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (other bugs)
Version First Reported In:	3.7 SVN
Platform:	Unlisted Binaries Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Bart Van Assche

URL:
Keywords:

Depends on:
Blocks:

Reported:	2011-04-03 16:07 UTC by Philippe Waroquiers
Modified:	2011-12-08 17:00 UTC (History)
CC List:	6 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
sleepers.c reproducing (un-)fairness (3.88 KB, text/plain) 2011-04-03 16:07 UTC, Philippe Waroquiers	Details
Patch that makes Valgrind scheduling fair (9.00 KB, patch) 2011-07-11 18:09 UTC, Bart Van Assche	Details
sleepers.c reproducing (un-)fairness (3.24 KB, text/plain) 2011-07-12 08:51 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (10.84 KB, patch) 2011-07-13 18:54 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (11.55 KB, patch) 2011-07-18 16:56 UTC, Bart Van Assche	Details
measurement raw data (read the next attachment for some explanation) (194.71 KB, application/octet-stream) 2011-07-28 00:03 UTC, Philippe Waroquiers	Details
shell script run on RHEL5.5 to obtain raw measurements (1.03 KB, application/octet-stream) 2011-07-28 00:04 UTC, Philippe Waroquiers	Details
slightly modified sleepers.c test program (3.26 KB, text/plain) 2011-07-28 00:04 UTC, Philippe Waroquiers	Details
Patch that makes Valgrind scheduling fair (11.15 KB, patch) 2011-07-28 12:55 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (16.35 KB, patch) 2011-11-09 17:51 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (29.78 KB, patch) 2011-11-10 19:15 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (30.39 KB, patch) 2011-11-11 15:04 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (33.03 KB, patch) 2011-12-05 18:41 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (32.92 KB, patch) 2011-12-05 21:06 UTC, Bart Van Assche	Details
Patch that makes Valgrind scheduling fair (32.92 KB, patch) 2011-12-06 20:05 UTC, Bart Van Assche	Details
Show Obsolete (10) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Philippe Waroquiers 2011-04-03 16:07:09 UTC

Created attachment 58543 [details]
sleepers.c reproducing (un-)fairness

A gdbserver regression test was failing on some linux OS, and working on others.
After investigation, I believe the cause is poor fairness of the Valgrind
scheduler. The unfairness disappear when setting Valgrind affinity for one
single cpu.

The below was tested on the gcc compile farm machine gcc10 (a debian amd64 24 cores)

Running the attached program natively gives:
./sleepers 1 0 1000000000 B-B-B-B-
loops/sleep ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
Brussels ready to sleep and/or burn
London ready to sleep and/or burn
Petaouchnok ready to sleep and/or burn
main ready to sleep and/or burn
main finished to sleep and/or burn
Brussels finished to sleep and/or burn
Petaouchnok finished to sleep and/or burn
London finished to sleep and/or burn

Running it under valgrind (tested with the standard installed 3.3.1 and 3.7.0SVN) gives the below. You see some threads go to the end, before
letting other threads go on.
vg-in-place --tool=none ./sleepers 1 0 1000000000 B-B-B-B-
==16378== Nulgrind, the minimal Valgrind tool
==16378== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote.
==16378== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info
==16378== Command: ./sleepers 1 0 1000000000 B-B-B-B-
==16378== 
loops/sleep ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
Brussels ready to sleep and/or burn
Brussels finished to sleep and/or burn
London ready to sleep and/or burn
London finished to sleep and/or burn
Petaouchnok ready to sleep and/or burn
main ready to sleep and/or burn
main finished to sleep and/or burn
Petaouchnok finished to sleep and/or burn
==16378== 

Running with affinity on one single cpu re-introduces fairness:
taskset 1 vg-in-place --tool=none ./sleepers 1 0 1000000000 B-B-B-B-
==29679== Nulgrind, the minimal Valgrind tool
==29679== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote.
==29679== Using Valgrind-3.7.0.SVN and LibVEX; rerun with -h for copyright info
==29679== Command: ./sleepers 1 0 1000000000 B-B-B-B-
==29679== 
loops/sleep ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
Brussels ready to sleep and/or burn
London ready to sleep and/or burn
Petaouchnok ready to sleep and/or burn
main ready to sleep and/or burn
London finished to sleep and/or burn
Petaouchnok finished to sleep and/or burn
Brussels finished to sleep and/or burn
main finished to sleep and/or burn
==29679== 

From various trials, it seems the more cpu there is, the less Valgrind scheduler
is fair.
I suppose that this is because the sched_yield and/or read on pipe
do not guarantee any fairness on a multi_cpu system with plenty of
cpu ready to run (effectively, why would the kernel not re-run directly
the thread which has done sched_yield: if this is the only thread in the
waiting queue, it will run directly, "re-reading the lock" on the pipe
and proceed.
By setting the affinity of the process to one single cpu, there is more than
one thread in the queue, and the yielding thread goes at the end, letting
other threads running.

Comment 1 Christian Borntraeger 2011-04-04 09:44:09 UTC

Am 03.04.2011 16:07, schrieb Philippe Waroquiers:
> From various trials, it seems the more cpu there is, the less Valgrind
> scheduler
> is fair.
> I suppose that this is because the sched_yield and/or read on pipe
> do not guarantee any fairness on a multi_cpu system with plenty of
> cpu ready to run (effectively, why would the kernel not re-run directly
> the thread which has done sched_yield: if this is the only thread in the
> waiting queue, it will run directly, "re-reading the lock" on the pipe
> and proceed.
> By setting the affinity of the process to one single cpu, there is more than
> one thread in the queue, and the yielding thread goes at the end, letting
> other threads running.
> 

For completeness, I have also seen scheduler unfairness on a big s390 system.
My understanding was that an interprocessor signal is often slower than a pipe
read, giving an extra bonus to the old thread.
The problem was not that critical, but a real round robin scheduler might be
necessary in the future. (We could also change valgrind to be multi-threaded,
but this looks like a herculean task)

Christian

Comment 2 Julian Seward 2011-04-04 10:38:30 UTC

I'm not surprised to hear of these reports.

Just to be clear about this, Valgrind doesn't schedule the threads
at all.  (It used to in version 1.x, which is why the code is
called m_scheduler/scheduler.c, but not any more -- definitely not
in the last 6 years.)

Now, Valgrind merely implements locking, using a pipe, as you both
have seen.  The idea is to make the kernel decide which thread to run
next.  This is critical, because only the kernel knows for sure which
threads are blocked in system calls and which are runnable.  If
Valgrind made that choice, we have no 100% reliable way to know which
threads are blocked in syscalls, and so we risk deadlock, if we choose
one of them to run.  It also makes handling of signals simpler.

--------

That said ...

Philippe, thanks for the demo about fairness.  I have often thought
that it might be unfair, but never investigated.

For various reasons it would be nice to implement something fairer,
maybe a round robin.  For one thing it would make Helgrind (and, I
presume, DRD) behave more repeatably, since they are both
scheduling-sensitive.

The difficulty is I never figured out a way to do this, that preserves
the properties listed above (the kernel makes the final decision)

Any suggestions, I would be happy to hear.  Really, we want to be able
to say, when a thread comes to the end of its timeslice (or more
generally when we drop the BigLock), that this thread should now go to
the back of the scheduling queue for this process.  As you've probably
seen in the sources, I did some experimentation with calling
sched_yield, but I never investigated enough to find out if that
helps, or whether it helps reliably.  Maybe it is worth chasing?

Sorry if this is long and unfocussed.  /me needs more coffee.

Comment 3 Philippe Waroquiers 2011-04-05 00:31:11 UTC

(In reply to comment #2)
> Any suggestions, I would be happy to hear.  Really, we want to be able
> to say, when a thread comes to the end of its timeslice (or more
> generally when we drop the BigLock), that this thread should now go to
> the back of the scheduling queue for this process.  As you've probably

I believe the problem here is that the scheduling queue maintained by
the linux kernel scheduler is not a queue per process, but rather a
queue per cpu (with some logic to "move" processes from one cpu
to another).
If on a SMP machine, there are many cpus available, then each thread
will be on its own cpu. Calling sched_yield will have no effect : it will
directly re-run.

The only current bypass I have is to ensure the process being
run is "locked" on one single cpu. Then all the threads of the process
are on one single cpu, and then fairness re-appears (I believe because
sched_yield will put the calling thread at the end of the linux scheduler
queue on this cpu, behind the other threads).
The other threads have then a chance to acquire the lock and run.

Whenever I have some more time, I will dig more in details on the effect
of sched_yield on multiple cpu but I am not optimistic.

There are for sure ways to ensure a certain scheduling order, but
that implies to go to real time scheduling priorities on linux, which
does not look appropriate for a process running under valgrind.

Comment 4 Julian Seward 2011-04-05 00:37:45 UTC

(In reply to comment #3)
> The only current bypass I have is to ensure the process being
> run is "locked" on one single cpu. Then all the threads of the process
> are on one single cpu, and then fairness re-appears (I believe because
> sched_yield will put the calling thread at the end of the linux scheduler
> queue on this cpu, behind the other threads).
> The other threads have then a chance to acquire the lock and run.

OK.  So it sounds like a combination of

* at startup, we do a schedsetaffinity request to the kernel
  (I assume there is a system call to do this) that we want to
  run on only one CPU

* use sched_yield whenever we drop the BigLock

might work?

Comment 5 Philippe Waroquiers 2011-04-05 00:52:57 UTC

(In reply to comment #4)

> OK.  So it sounds like a combination of
> 
> * at startup, we do a schedsetaffinity request to the kernel
>   (I assume there is a system call to do this) that we want to
>   run on only one CPU
> 
> * use sched_yield whenever we drop the BigLock
> 
> might work?
I first tried with the shell command line taskset 1 vg-in-place ...
which has re-introduced fairness.
Then I have put inside the test program the following piece of code:
// We will lock ourselves on one single cpu.
// This bypasses the unfairness of the Valgrind scheduler
// when a multi-cpu machine has enough cpu to run all the
// threads wanting to burn cpu.
static void setaffinity(void)
{
   cpu_set_t single_cpu;
   CPU_ZERO(&single_cpu);
   CPU_SET(1, &single_cpu);
   (void) sched_setaffinity(0, sizeof(single_cpu), &single_cpu);
}

Both tricks are working (in effect, they do the same).
However, this is somewhat ugly e.g. if you have a big computer
with plenty of CPUs, all the Valgrind processes will run on CPU nr 1.

To apply this trick as part of Valgrind, we might maybe randomly choose
a cpu in the list of cpu so as to somewhat distribute the Valgrind processes
load.

(at work, we are starting a system with many processes under Valgrind.
Hardcoding Valgrind to run on CPU nr 1 looks not appropriate).

Comment 6 Jeremy Fitzhardinge 2011-04-05 01:15:04 UTC

The trouble with setting the affinity is that it binds all the tasks to a specific cpu, which could get overloaded with valgrind processes if they all bind to the same one.

What you really want is a constraint to make all the tasks run (mostly) on the same cpu, but without really constraining which CPU is used.  Unfortunately, I don't think there's a useful mechanism for doing that.

The use of the pipe could be confusing the scheduler, because it tries to run the two ends of the pipe in parallel.  It would be better to use a futex where available, though I don't know what effect that would have on the scheduling decisions.

sched_yield() is very deprecated on Linux, and I wouldn't recommend using it.

Comment 7 Julian Seward 2011-04-05 01:47:27 UTC

(In reply to comment #6)

Greetings.

> The trouble with setting the affinity is that it binds all the tasks to a
> specific cpu, which could get overloaded with valgrind processes if they all
> bind to the same one.

Is it possible to change the affinity repeatedly as the program runs?
I am thinking of some nasty kludge where, mostly, affinity is enabled,
but is occasionally disabled, in the hope of allowing long-running
valgrind processes to migrate away from very busy CPUs.

Or .. it could be command line controlled?  I don't care at all about
fairness when running Memcheck, but for Helgrind I do.

> the two ends of the pipe in parallel.  It would be better to use a futex where
> available, though I don't know what effect that would have on the scheduling
> decisions.

I'd be willing to give that a try.  If you happen to know off the top
of your head, any chance you can sketch the calls to sys_futex needed
to declare/initialise/acquire/release/destroy such a lock?

Comment 8 Jeremy Fitzhardinge 2011-04-05 02:45:14 UTC

On 04/04/2011 04:48 PM, Julian Seward wrote:
> Is it possible to change the affinity repeatedly as the program runs?
> I am thinking of some nasty kludge where, mostly, affinity is enabled,
> but is occasionally disabled, in the hope of allowing long-running
> valgrind processes to migrate away from very busy CPUs.

You could disable it to allow migration, but how would arrange for all
the threads to migrate to the same other CPU?

> Or .. it could be command line controlled?  I don't care at all about
> fairness when running Memcheck, but for Helgrind I do.

"taskset" lets you start a process with a given CPU affinity, so that
would be useful for benchmarking what effect it actually has.

>> the two ends of the pipe in parallel.  It would be better to use a futex where
>> available, though I don't know what effect that would have on the scheduling
>> decisions.
> I'd be willing to give that a try.  If you happen to know off the top
> of your head, any chance you can sketch the calls to sys_futex needed
> to declare/initialise/acquire/release/destroy such a lock?

http://people.redhat.com/drepper/futex.pdf is a pretty good
introduction.  You could use a futex to implement something like a
ticket lock to get guaranteed FIFO scheduling of the threads.

    J

Comment 9 Philippe Waroquiers 2011-04-05 07:31:45 UTC

(In reply to comment #8)
> You could disable it to allow migration, but how would arrange for all
> the threads to migrate to the same other CPU?
sched_setaffinity(0, ...)
will set the affinity for all the threads of the process, and
so the threads will all move to a single cpu if specifying a singleton cpuset.
So, the nasty suggested by Julian can be done by from time to
time sched_setaffinity to the complete set of cpu available (to let the
linux scheduler move the threads to a non-busy cpu)
then sched_setaffinity again to this one single cpu.

But for sure, an explicit fairness done e.g. with a futex looks more sane/predictible than a set of (random?) sched_setaffinity  calls.

Comment 10 Julian Seward 2011-04-05 17:25:03 UTC

(In reply to comment #8)
> introduction.  You could use a futex to implement something like a
> ticket lock to get guaranteed FIFO scheduling of the threads.

That's interesting -- I never heard of a ticket lock before.

The ability to enforce FIFO is something I'd really like to have.  If
anyone can come up with a patch that works, I'd be interested to see
it.

Comment 11 Bart Van Assche 2011-07-11 18:09:14 UTC

Created attachment 61782 [details]
Patch that makes Valgrind scheduling fair

Can anyone who has access to a multi-core Linux system test the attached patch for me ? That patch should make Valgrind scheduling entirely fair. Review comments are welcome to.

Notes:
- The attached patch has been developed on a Linux system but has not yet been ported to Darwin.
- On my test setup regression test output was unaffected.

Comment 12 Tom Hughes 2011-07-12 07:55:35 UTC

I'm happy to test it, but I need a test case first...

I tried the program from the first comment but I get the same behaviour with and without (an unmodified) valgrind. Namely this:

bristol [~/vgtest] % ./sleepers 1 0 1000000000 B-B-B-B-
loops/sleep_ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
Brussels ready to sleep and/or burn
Brussels finished to sleep and/or burn
London ready to sleep and/or burn
London finished to sleep and/or burn
Petaouchnok ready to sleep and/or burn
Petaouchnok finished to sleep and/or burn
main ready to sleep and/or burn
main finished to sleep and/or burn

and this:

bristol [~/vgtest] % /tmp/valgrind/bin/valgrind --tool=none --quiet ./sleepers 1 0 1000000000 B-B-B-B-
loops/sleep_ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
Brussels ready to sleep and/or burn
Brussels finished to sleep and/or burn
London ready to sleep and/or burn
London finished to sleep and/or burn
Petaouchnok ready to sleep and/or burn
Petaouchnok finished to sleep and/or burn
main ready to sleep and/or burn
main finished to sleep and/or burn

This is on a six core machine (with HT so twelve threads in all).

Comment 13 Bart Van Assche 2011-07-12 08:51:56 UTC

Created attachment 61795 [details]
sleepers.c reproducing (un-)fairness

The second version of sleepers.c should align the "burn" loop start times better. If this change by itself is not yet sufficient, please try to increase the loop count on the command line.

Comment 14 Tom Hughes 2011-07-12 09:21:34 UTC

Well this is what I get with the new version. Without valgrind:

loops/sleep_ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
[1310462109.584349] main ready to sleep and/or burn
[1310462109.584358] main finished to sleep and/or burn
[1310462109.584351] London ready to sleep and/or burn
[1310462109.584352] Petaouchnok ready to sleep and/or burn
[1310462109.584371] Brussels ready to sleep and/or burn
[1310462109.584399] Brussels finished to sleep and/or burn
[1310462109.584376] London finished to sleep and/or burn
[1310462109.584387] Petaouchnok finished to sleep and/or burn

With unpatched valgrind:

loops/sleep_ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
[1310462112.285960] main ready to sleep and/or burn
[1310462112.289425] main finished to sleep and/or burn
[1310462112.290703] Petaouchnok ready to sleep and/or burn
[1310462112.290750] Petaouchnok finished to sleep and/or burn
[1310462112.288432] Brussels ready to sleep and/or burn
[1310462112.293177] Brussels finished to sleep and/or burn
[1310462112.288122] London ready to sleep and/or burn
[1310462112.293283] London finished to sleep and/or burn

With patched valgrind:

loops/sleep_ms/burn/threads_spec:  1 0 1000000000 B-B-B-B-
[1310462390.421093] Petaouchnok ready to sleep and/or burn
[1310462390.424389] Petaouchnok finished to sleep and/or burn
[1310462390.423341] London ready to sleep and/or burn
[1310462390.427027] London finished to sleep and/or burn
[1310462390.423415] main ready to sleep and/or burn
[1310462390.427160] main finished to sleep and/or burn
[1310462390.423200] Brussels ready to sleep and/or burn
[1310462390.428220] Brussels finished to sleep and/or burn

Is that good?

Comment 15 Julian Seward 2011-07-12 09:29:54 UTC

(In reply to comment #11)
> Created an attachment (id=61782) [details]
> Patch that makes Valgrind scheduling fair

I read though the core implementation (ML_(acquire_ticketlock),
ML_(release_ticketlock)).  A question: how does it preserve the
property that the kernel makes the final decision about which thread
to run, and at the same time encourage the kernel to be more fair in
choosing which one to run?  Can you give a summary of how you're using
sys_futex and the head/tail values in the_BigLock to achieve that?

Comment 16 Julian Seward 2011-07-12 09:33:13 UTC

(In reply to comment #15)
> A question: how does it preserve the
> property that the kernel makes the final decision about which thread
> to run

Ah, no, that's the wrong question.  The right question is: how does
it preserve the property that only the kernel knows for sure which
threads are blocked in syscalls and which are runnable?

Comment 17 Bart Van Assche 2011-07-12 09:38:42 UTC

(In reply to comment #14)
> Is that good?

Although I'm not sure that the results of a single run are statistically relevant, what I see is:
- With unpatched Valgrind, the difference between lowest and highest start time is 4.743 ms.
- With patched Valgrind, the difference between lowest and highest thread start time is 2.322 ms.

So the thread start time delta has been reduced significantly, which is good and which is what we are aiming for.

Comment 18 Bart Van Assche 2011-07-12 09:44:26 UTC

(In reply to comment #16)
> The right question is: how does
> it preserve the property that only the kernel knows for sure which
> threads are blocked in syscalls and which are runnable?

The attached patch shouldn't change anything in this area. The Valgrind
scheduler still holds the "big lock" as long as it is running client code and
still releases the big lock before invoking a system call and still reacquires
the big lock after returning from a system call. What the attached patch
changes though is that if there is contention on the big lock, e.g. because
multiple threads are running computations, that thread scheduling happens in
fair way.

Comment 19 Julian Seward 2011-07-12 10:19:51 UTC

(In reply to comment #18)
> the big lock after returning from a system call. What the attached patch
> changes though is that if there is contention on the big lock, e.g. because
> multiple threads are running computations, that thread scheduling happens in
> fair way.

Ok, good, but how does it ensure fairness?  I see that t->head and
t->tail are used to make a queue of threads wanting to acquire the
lock.  And it is implied that a thread releasing the lock can use
sys_futex(VKI_FUTEX_WAIT, ...) to cause the thread at the head of the
queue (with the lowest ticket number) to proceed.  What I can't quite
see, from reading the sys_futex man page, is how you can use futex to
signal to one specific thread.

Comment 20 Bart Van Assche 2011-07-12 15:55:47 UTC

(In reply to comment #19)
> Ok, good, but how does it ensure fairness?  I see that t->head and
> t->tail are used to make a queue of threads wanting to acquire the
> lock.  And it is implied that a thread releasing the lock can use
> sys_futex(VKI_FUTEX_WAIT, ...) to cause the thread at the head of the
> queue (with the lowest ticket number) to proceed.  What I can't quite
> see, from reading the sys_futex man page, is how you can use futex to
> signal to one specific thread.

The FUTEX_WAKE call used in the ticket lock implementation wakes up all threads waiting on the futex. Fairness is ensured by granting the lock to the thread that started waiting first on the ticket lock. All threads that were woken up needlessly will continue waiting.

Comment 21 Julian Seward 2011-07-12 17:55:32 UTC

> waiting on the futex. Fairness is ensured by granting the lock to the thread
> that started waiting first on the ticket lock. All threads that were woken up
> needlessly will continue waiting.

Ah, so that's what the while loop in ML_(acquire_ticketlock) is for, right?

I guess this means it suffers from the "thundering herd" problem.

Comment 22 Bart Van Assche 2011-07-12 18:00:31 UTC

(In reply to comment #21)
> Ah, so that's what the while loop in ML_(acquire_ticketlock) is for, right?
> 
> I guess this means it suffers from the "thundering herd" problem.

Isn't that already the case with the current semaphore implementation ? Doesn't writing to a pipe wake up all threads waiting to read from that pipe ?

Anyway, I'll have a look at addressing this.

Comment 23 Philippe Waroquiers 2011-07-12 19:24:09 UTC

(In reply to comment #22)
> > I guess this means it suffers from the "thundering herd" problem.
> Isn't that already the case with the current semaphore implementation ? 
> Doesn't writing to a pipe wake up all threads waiting to read from that pipe ?
> 
> Anyway, I'll have a look at addressing this.
Some quick feedback:

* on a few runs on a 4-core systems, the patch seems to increase the fairness
  but the (worst) fairness was encountered when the computer has more cores than
  the nr of thread to run, and that there are enough free cores to always
  run each thread wanting to run. Otherwise, I think the linux scheduler
  is ensuring some fairness.

* on RHEL4, the patch compiles and links in 64 bits, but fails with unknown
  symbols when compiled in 32 bits: 
    ticketlock-linux.c:61: undefined reference to `__sync_fetch_and_add_4'

* when running the sleepers with tool=none, the patch seems significantly
  slower (about 50% slower). Maybe the thundering herd ? Or maybe futex
  is slower than pipe write + read ? (or of course, maybe me took wrong
  measurement ?).

* some small details about the patch:
  - priv_ticketlock.h defines head and tail as unsigned, but the
    implementation uses UInt in the c file.
  - the assert verifying that the size of tail and head are 4 could be
    done in the init phase, rather than each time the lock is taken.

Comment 24 Bart Van Assche 2011-07-12 19:35:21 UTC

(In reply to comment #23)
> Some quick feedback:
> 
> * on a few runs on a 4-core systems, the patch seems to increase the fairness

That's good !
 
> * on RHEL4, the patch compiles and links in 64 bits, but fails with unknown
>   symbols when compiled in 32 bits: 
>     ticketlock-linux.c:61: undefined reference to `__sync_fetch_and_add_4'

Not good. We might be hit by this (http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Atomic-Builtins.html): Not all operations are supported by all target processors. If a particular operation cannot be implemented on the target processor, a warning will be generated and a call an external function will be generated. The external function will carry the same name as the builtin, with an additional suffix `_n' where n is the size of the data type.

> * when running the sleepers with tool=none, the patch seems significantly
>   slower (about 50% slower). Maybe the thundering herd ? Or maybe futex
>   is slower than pipe write + read ? (or of course, maybe me took wrong
>   measurement ?).

Can you please verify whether that result is reproducible ? A futex system call should take less time than a pipe write + read. Even more, with the ticket lock implementation in the uncontended case no system call is issued at all by the waiter.

> * some small details about the patch:
>   - priv_ticketlock.h defines head and tail as unsigned, but the
>     implementation uses UInt in the c file.
>   - the assert verifying that the size of tail and head are 4 could be
>     done in the init phase, rather than each time the lock is taken.

Thanks, will take this into account in v2.

Comment 25 Philippe Waroquiers 2011-07-12 19:53:19 UTC

(In reply to comment #24)

> Can you please verify whether that result is reproducible ? A futex system call
> should take less time than a pipe write + read. Even more, with the ticket lock
> implementation in the uncontended case no system call is issued at all by the
> waiter.

I have re-done the measurement (run 3 times each)
Without the patch, running the sleepers takes about 35 seconds.
With the patch, the same command takes about 45 seconds to 48 seconds.
(there is a bigger variation with the futext patch : twice 48 seconds, once
45 seconds), there is almost no variation with the pipe version)

Comment 26 Julian Seward 2011-07-12 23:11:39 UTC

(In reply to comment #25)
> I have re-done the measurement (run 3 times each)
> Without the patch, running the sleepers takes about 35 seconds.
> With the patch, the same command takes about 45 seconds to 48 seconds.

I wonder if this could (partly?) be the result of poorer spatial and
temporal cache locality that is caused (unavoidably) by the strict
round-robin scheduling.  Philippe, if you re-run the experiment but
use cachegrind instead of none, can you see any interesting variants
in cache miss numbers?

Comment 27 Julian Seward 2011-07-12 23:24:44 UTC

(In reply to comment #22)
> > I guess this means it suffers from the "thundering herd" problem.
> 
> Isn't that already the case with the current semaphore implementation ? Doesn't
> writing to a pipe wake up all threads waiting to read from that pipe?

I don't know how the kernel handles this case, but I can imagine that
the kernel can simply decide which thread is going to win the race to
read from the pipe, and resume only that thread.  At least on the 
user-space (Valgrind) side, each thread tries to read one byte from
the pipe, and if it succeeds then it owns the lock.  There is no 
loop around multiple attempts to read from the pipe.

Anyway, for the ticketlock, we might be able to work around it by using a
small set (eg, 17) futexes instead of just one.  When a thread waits
for ticket number N, it waits for futex_array[N % 17], and when a
thread signals for ticket number M, it signals on futex_array[M % 17].
This doesn't completely remove the problem, but it does reduce it by a
factor of (eg) 17.

Comment 28 Philippe Waroquiers 2011-07-13 09:35:38 UTC

(In reply to comment #26)

> I wonder if this could (partly?) be the result of poorer spatial and
> temporal cache locality that is caused (unavoidably) by the strict
> round-robin scheduling.  Philippe, if you re-run the experiment but
> use cachegrind instead of none, can you see any interesting variants
> in cache miss numbers?
With cachegrind, there is (irrelevant i.e. < 10) differences in the cache miss
numbers between the "futex" and the "pipe" scheduler.
It is also confirmed that futex is slower (depending on the tool, between
30 to 50% more cpu needed to run sleepers).

But the fairness is for sure much better, see below the difference in
the termination time with the "pipe" version, compared to the difference
in termination time with the "futex" version:

pipe:
[1310548450.699138] Brussels finished to sleep and/or burn
[1310548474.973137] Petaouchnok finished to sleep and/or burn
[1310548484.868193] London finished to sleep and/or burn
[1310548489.757102] main finished to sleep and/or burn
(I am proud Brussels finished first :).

futex:
[1310549573.337527] main finished to sleep and/or burn
[1310549573.595657] London finished to sleep and/or burn
[1310549573.772526] Brussels finished to sleep and/or burn
[1310549573.798663] Petaouchnok finished to sleep and/or burn
but when introducing fairness, Brussels finishes just before Petaouchnok :(.

Comment 29 Bart Van Assche 2011-07-13 18:54:54 UTC

Created attachment 61851 [details]
Patch that makes Valgrind scheduling fair

Attached version two of the patch that makes Valgrind scheduling fair. The performance issue should be resolved: make -s regtest completes now in the same time with and without the patch on my setup. Expanded the number of futexes from 1 to 16. Addressed Philippe's review comments about unsigned <> UInt and vg_assert().

Comment 30 Philippe Waroquiers 2011-07-13 20:35:38 UTC

(In reply to comment #29)

(note that the link problem in 32 bits/x86 is still there).

> Attached version two of the patch that makes Valgrind scheduling fair. The
> performance issue should be resolved: make -s regtest completes now in the same
> time with and without the patch on my setup. Expanded the number of futexes
> from 1 to 16. Addressed Philippe's review comments about unsigned <> UInt and
> vg_assert().

It looks like the thundering herd and/or performance problem is not solved.

What I see with the new version of the patch is that there are big
variation in the cpu needed to finish: the fastest was about 32 seconds
CPU, the slowest was around 50 seconds (always using around 99% of cpu).
This is on a RHEL5.5, 4 cores Q9650  @ 3.00GHz.

From what I can see, the more the computer is idle (i.e. no other activities),
the more cpu/the longer the test runs.
When I run two tests in parallel (so, 2 Valgrinds, running each 4 threads),
both tests finish earlier and with less CPU than if they are run one after
each other.
I tried multiple times to run
 for i in 1 2 3; do;  time ../vg-in-place --tool=none ./sleepers 1 0 2222222222222 B-B-B-B- & ; done

If I keep the &, they all run in parallel, each in about 32seconds.
If I remove the &, they each run in sequence, each taking 43 to 48 seconds.

I have some difficulties to interpret this, the best explanation I have is:
when there is always a core available to run each of the 4 threads, there is
a high level of contention, and this contention causes threads to be
rescheduled, without doing much interesting work.

The paper of Ulrich Drepper (see comment #8) is mentionning some bad
effect of cache line when having 4 cores or more, and was explaining
how to handle that. Maybe a similar problem and so a similar solution
might apply ?

Comment 31 Jeremy Fitzhardinge 2011-07-13 21:16:50 UTC

The thundering herd is going to be a big performance problem in a system with a large number of threads.

As an alternative to using a ticket lock, you could just have per-thread futexes, and link them onto an explicit FIFO list structure.  Then the release operation could explicitly WAKE the next thread's futex.

It'll be a bit fiddly managing the list, but it can probably be all done with compare-and-swap.

Comment 32 Julian Seward 2011-07-13 21:35:51 UTC

> but when introducing fairness, Brussels finishes just before Petaouchnok :(.

Where is Petaouchnok?  Somewhere in Siberia maybe?

Comment 33 Julian Seward 2011-07-13 21:38:19 UTC

(In reply to comment #29)
> make -s regtest completes now in the same time with and without the patch
> on my setup.

make regtest isn't a good performance test for this, alas.  Almost all of
the programs do very little real work, and what you're really measuring is
how long it takes for JIT to translate the ld.so and libc startup and
exit code, 570 times (or however many tests there are now).

Comment 34 Julian Seward 2011-07-13 21:43:11 UTC

(In reply to comment #31)
> The thundering herd is going to be a big performance problem in a system with a
> large number of threads.

Even if we buy it off by a factor of N, by using N futexes instead of 1,
as per comment #27 ?

Also, the size of the thundering herd is the number of runnable
threads; non-runnables aren't in it because they are blocked in
syscalls and haven't arrived yet at the futex-wait point.  And if we
have zillions of runnable threads then we have a severe performance
problem anyway due to the single-threading.  (not that that's a good
reason to make performance even worse, but still ..)

Comment 35 Julian Seward 2011-07-13 21:51:57 UTC

(In reply to comment #30)
> It looks like the thundering herd and/or performance problem is not
> solved.

Can we get some profiling data on this?  At the moment we are
more or less guessing what the problem might be.

Either oprofile, or profile with cachegrind or callgrind; details
below.

Also .. is ./sleepers deterministic?  iow, do you expect that the
number of instructions taken to run it depends on the scheduling, or
not?

To profile w/ cachegrind or callgrind (I think callgrind doesn't
work so well when running V itself):

Configure the V to be profiled with --enable-inner and put it in a 
directory "inner" (let's say).

Then run:

./trunk/Inst/bin/valgrind \
   --smc-check=all-non-file \
   -v --trace-children=yes --sim-hints=enable-outer \
   --tool=cachegrind [args for cachegrind]
   ./inner/Inst/bin/valgrind -v --tool=none \
   ./sleepers [args for sleepers]

and wait a long long time :-)  Results can be interesting though.

Comment 36 Philippe Waroquiers 2011-07-13 22:04:10 UTC

(In reply to comment #35)

> Also .. is ./sleepers deterministic?  iow, do you expect that the
> number of instructions taken to run it depends on the scheduling, or
> not?
I think it is deterministic : when run under cachegrind, there were only
ultra small differences (e.g. 10 on xxxxxxxxxxxxx instructions) on 2 
or 3 runs.



> and wait a long long time :-)  Results can be interesting though.
I will take a look at this.

Comment 37 Josef Weidendorfer 2011-07-14 16:04:17 UTC

Not sure what cachegrind/callgrind can show here. They only simulate
one cache hierarchy which is accessed by all threads. This can not show what
happens in reality. Even if we would have a multi-core cache simulation,
we would need to simulate the scheduling behavior of the Linux kernel.

(from comment #30)
> From what I can see, the more the computer is idle (i.e. no other
> activities), the more cpu/the longer the test runs.

If a core gets idle, it will be clocked down by the on-demand governor
(or on modern Intel processors with TurboBoost, active cores will
automatically have clock frequency increased, and decreased if getting
idle).

So, if the serialized Valgrind threads run on different cores, they
perhaps always start with reduced clock frequency, in contrast to a
more utilized system?

Comment 38 Bart Van Assche 2011-07-14 18:25:12 UTC

(In reply to comment #30)
> What I see with the new version of the patch is that there are big
> variation in the cpu needed to finish: the fastest was about 32 seconds
> CPU, the slowest was around 50 seconds (always using around 99% of cpu).
> This is on a RHEL5.5, 4 cores Q9650  @ 3.00GHz.
> 
> From what I can see, the more the computer is idle (i.e. no other activities),
> the more cpu/the longer the test runs.
> When I run two tests in parallel (so, 2 Valgrinds, running each 4 threads),
> both tests finish earlier and with less CPU than if they are run one after
> each other.
> I tried multiple times to run
>  for i in 1 2 3; do;  time ../vg-in-place --tool=none ./sleepers 1 0
> 2222222222222 B-B-B-B- & ; done
> 
> If I keep the &, they all run in parallel, each in about 32seconds.
> If I remove the &, they each run in sequence, each taking 43 to 48 seconds.

There is an implementation aspect of the ticket lock implementation in the attached patch that might be an explanation for the above. With two or more idle cores, each Valgrind context switch triggers two futex syscalls - one FUTEX_WAIT and one FUTEX_WAKE call. If the Linux scheduler wakes up the thread that wants to acquire Valgrinds big lock after it has been released, neither the releasing thread nor the acquiring thread will invoke the futex syscall.

Comment 39 Bart Van Assche 2011-07-18 16:56:34 UTC

Created attachment 61962 [details]
Patch that makes Valgrind scheduling fair

The following test allowed me to reproduce the slowdown caused by the fair scheduling patch:
   for ((i=0;i<3;i++)); do time ./vg-in-place --tool=none ~/test/sleepers 1 0 1000000000 B-B-B-B-; done

With the third version of the attached patch the slowdown is gone on my test setup (dual core). Does that third version also help on a system with more than two cores ?

Comment 40 Julian Seward 2011-07-19 06:58:47 UTC

Does this third version avoid the slowdowns even if you set
SCHEDULING_QUANTUM back to its original value (100000) ?  I don't want
to have a large increase in this value, since I think it could cause
very poor behaviour for interactive programs in the worst case.  The
new value (10 million) means each thread could run for several seconds
before another thread gets a turn.

Comment 41 Philippe Waroquiers 2011-07-20 11:40:21 UTC

(In reply to comment #40)
> Does this third version avoid the slowdowns even if you set
> SCHEDULING_QUANTUM back to its original value (100000) ?  I don't want

I retested the patch on an Ubuntu 11.04, 4 core, Q6600, 2.4GHz.

On this setup, there is no performance difference between the pipe version,
the futex(100000) and the futex(10000000) version.
The best fairness is ensured by the futex(100000). The increase in quantum causes
differences of a few seconds in thread termination time, so it looks effectively not
good for interactive applications.

There is significant performance difference caused by cpu freq scaling
(see Josef's comment #37).

When doing
   sudo cpufreq-set -r -g performance 
(instead of the default -g ondemand),
the time taken for the test changes from 40 seconds to 26 seconds (matching
the clock speed increase).

So, at least on Ubuntu 11.04+Q6600, the futex without QUANTUM change gives fairness without
performance degradation.

Next week, I will retest on  RHEL 5.5 Q9650 3GHz and see if the perf difference
can be explained by cpu freq scaling (e.g. the "pipe" version might keep Valgrind
on a single "fast" core, while the "futex" version might switch Valgrind to a slower
cpu very often).

It would be worth verifying if the slowdown of comment #39 is also influenced by
cpu freq scaling.

Comment 42 Philippe Waroquiers 2011-07-28 00:00:01 UTC

I have done additional measurements on a RHEL5.5, 4-cores Q9650 3 GHz.
(the measurements are measuring a bunch of combinations of "pipe" lock,
"futex" lock, with/without QUANTUM changes, ...). I will attach
the raw measurements/programs.

The conclusion I did from analyzing the raw data is:
* the "pipe" scheduler bad fairness is confirmed
* the "futex" scheduler good fairness is confirmed.
* the increase of QUANTUM decreases the fairness of the "futex" scheduler
(=> having a big QUANTUM is not good for interactive applications).
* the cpu frequency scaling is not very friendly with the serialization
done by Valgrind :
cpu frequency scaling can decrease the performance
of the "pipe" scheduler by 10 to 20%
of the "futex" scheduler by more (up to 50 %).
* Without cpu freq scaling (i.e. cpu always forced at max clock),
the futex scheduler seems slightly faster,
but the difference is small (around 1..2%), so not enough measurement to be sure.

It would be nice to have a system call to ask to the kernel:
"please have all my threads running on the same cpu but without fixing the cpu so that if needed
all the threads can migrate to another CPU".
To my knowledge, this does not exist.
So, it might be simulated by fixing affinity of all threads
to the current cpu
then run for some QUANTUM * NR_OF_QUANTUM_ON_SAME_FIXED_CPU
then unset the affinity
then run for some QUANTUM * NR_OF_QUANTUM_ON_FREE_CPU
(this should allow the linux scheduler to move Valgrind to a less busy cpu if needed)
then again fixing to the current cpu, etc.

With the fair scheduler + this "same cpu preference", we might obtain
fairness and good performance, even with cpu freq scaling.

There is of course always a risk to decrease performance (e.g. if a lot
of threads are accessing different data, running them on different cpu
will give them more L1 or so cache, than if they all run on the same cpu).
There is also a risk that the above is very dependent on the linux scheduler
(I think there are new scheduling algorithms on very recent linux kernels).
=> so, these NR_OF_QUANTUM_ON_* parameters might be clo options for really power users.
If both are given as 0, it might even be "use the good old pipe scheduler".

There is some other things still to look at:
1. at least on RHEL5.5, the fair scheduler does not link in x86/32 bits mode
(I guess the "sync_fetch_and_add" might have to be coded in assembler
rather than to use the gcc builtin).
2. something must be done for MacOS
(is there some futexes equivalent on this beast ?)
3. it might be that the perf. behaviour is very different on other platforms
(ppc ? arm ?).

Philippe

Comment 43 Philippe Waroquiers 2011-07-28 00:03:23 UTC

Created attachment 62255 [details]
measurement raw data (read the next attachment for some explanation)

Comment 44 Philippe Waroquiers 2011-07-28 00:04:09 UTC

Created attachment 62256 [details]
shell script run on RHEL5.5 to obtain raw measurements

Comment 45 Philippe Waroquiers 2011-07-28 00:04:41 UTC

Created attachment 62257 [details]
slightly modified sleepers.c test program

Comment 46 Bart Van Assche 2011-07-28 11:53:51 UTC

(In reply to comment #40)
> Does this third version avoid the slowdowns even if you set
> SCHEDULING_QUANTUM back to its original value (100000) ?  I don't want
> to have a large increase in this value, since I think it could cause
> very poor behaviour for interactive programs in the worst case.  The
> new value (10 million) means each thread could run for several seconds
> before another thread gets a turn.

I have restored the third version that preserves the value of SCHEDULING_QUANTUM.

Comment 47 Bart Van Assche 2011-07-28 11:57:12 UTC

(In reply to comment #40)
> It would be nice to have a system call to ask to the kernel:
>    "please have all my threads running on the same cpu but without fixing the
> cpu so that if needed
>     all the threads can migrate to another CPU".
> To my knowledge, this does not exist.
> So, it might be simulated by fixing affinity of all threads 
> to the current cpu
> then run for some QUANTUM * NR_OF_QUANTUM_ON_SAME_FIXED_CPU
> then unset the affinity
> then run for some QUANTUM * NR_OF_QUANTUM_ON_FREE_CPU
>    (this should allow the linux scheduler to move Valgrind to a less busy cpu
> if needed)
> then again fixing to the current cpu, etc.

My own preference is to modify the Linux kernel instead of implementing a workaround in Valgrind. There are other applications that need having two threads to run on the same CPU in order to avoid an excessive number of cache misses. But I'm afraid it might take a while before such a system call is backported to RHEL.

Comment 48 Philippe Waroquiers 2011-07-28 12:42:50 UTC

(In reply to comment #46)
> (In reply to comment #40)

> I have restored the third version that preserves the value of
> SCHEDULING_QUANTUM.
The 3rd version of the patch had the QUANTUM change (better rolled back)
but had other things that I believe should be kept (e.g. it was properly
implementing the ownership assertions.
So, it is probably easier/better to a version 4, which is version 3 minus
the QUANTUM single line change.
(or I misunderstood the differences between v2 and v3 of the patch).

Philippe

Comment 49 Philippe Waroquiers 2011-07-28 12:46:44 UTC

(In reply to comment #47)

> My own preference is to modify the Linux kernel instead of implementing a
> workaround in Valgrind. There are other applications that need having two
> threads to run on the same CPU in order to avoid an excessive number of cache
> misses. But I'm afraid it might take a while before such a system call is
> backported to RHEL.

Having the correct system call looks effectively the nice/correct approach,
but I suspect it can take a signicant time to have the syscall implemented,
then in the linux kernel, then in the various distributions, then
having all Valgrind users upgrading their kernels : all the valgrind users
that are upgrading valgrind faster than their distribution might suffer
from a 10 to 50% degradation
(or they have to bypass this by either forcing a single CPU or disabling
cpu frequency scaling).

Do you already have an idea about what to do for the x86 link and MacOS ?

Philippe

Comment 50 Bart Van Assche 2011-07-28 12:55:22 UTC

Created attachment 62272 [details]
Patch that makes Valgrind scheduling fair

Remove SCHEDULING_QUANTUM change from the fairness improvement patch.

Comment 51 Bart Van Assche 2011-07-28 13:00:57 UTC

(In reply to comment #49)
> Do you already have an idea about what to do for the x86 link and MacOS ?

The suggestion to implement atomic increment on x86 via inline assembly makes sense to me.

With regards to OS/X support I see two options:
- Either look up how pthread_cond_wait() and pthread_cond_signal() are implemented in the online Darwin C library source code and use that knowledge to port the ticket lock patch to Darwin.
- Or use the existing big lock implementation (anonymous pipe) on OS/X.

Comment 52 Bart Van Assche 2011-11-09 17:51:05 UTC

Created attachment 65446 [details]
Patch that makes Valgrind scheduling fair

Patch that implements the described proposal, v5.

Changes compared to v4:
- Builds now on 32-bit Linux systems (at least RHEL 4). Should build on Darwin too (not tested yet).
- valgrind -v ... prints now the name of the ticket lock implementation selected at configure time.
- Added more comments.

Comment 53 Julian Seward 2011-11-10 00:10:05 UTC

> With regards to OS/X support I see two options:
> - Either look up how pthread_cond_wait() and pthread_cond_signal() are
> implemented in the online Darwin C library source code and use that knowledge
> to port the ticket lock patch to Darwin.
> - Or use the existing big lock implementation (anonymous pipe) on OS/X.

How about the following proposal.  I make it because, on the one hand, I see
that we need fair scheduling, at least some times.  But on the other hand the
potential performance losses make me nervous.

Proposal is: add this feature, but make it disabled by default, at least
for the time being, whilst we gather more experience with it.  Add a flag
to enable it.  So, then the existing pipe based lock is the default, and
works everywhere; and for platforms that can support the fair scheduler
(Linux), it can be selected with a flag.  For platforms that can't or
don't yet support it (OSX), we can simply refuse to start if the flag is
given.

Comment 54 Bart Van Assche 2011-11-10 19:15:07 UTC

Created attachment 65493 [details]
Patch that makes Valgrind scheduling fair

(In reply to comment #53)
> Proposal is: add this feature, but make it disabled by default, at least
> for the time being, whilst we gather more experience with it.  Add a flag
> to enable it.  So, then the existing pipe based lock is the default, and
> works everywhere; and for platforms that can support the fair scheduler
> (Linux), it can be selected with a flag.  For platforms that can't or
> don't yet support it (OSX), we can simply refuse to start if the flag is
> given.

Added flag --fair-sched=[no|yes] in version 6 of the patch.

Comment 55 Bart Van Assche 2011-11-11 15:04:30 UTC

Created attachment 65525 [details]
Patch that makes Valgrind scheduling fair

Add --fair-sched=try and use this in the regression tests such that these pass again on systems where the fair scheduling support doesn't build.

Comment 56 Bart Van Assche 2011-12-03 13:57:42 UTC

Sorry that it took so long but I finally found the time to test this patch on Darwin, and it seems to work fine there (i.e. doesn't cause any regressions). Any objections against applying this patch on the trunk ?

Comment 57 Philippe Waroquiers 2011-12-04 12:59:02 UTC

(In reply to comment #56)
> Sorry that it took so long but I finally found the time to test this patch on
> Darwin, and it seems to work fine there (i.e. doesn't cause any regressions).
> Any objections against applying this patch on the trunk ?
Commit this looks a good idea.
A few additional remarks/question:
* Would be worth adding a user manual entry for the new option
  + explain which problem this is solving
* The new files are containing emacs settings, while other files do not.
* On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits.
  => will the ticket lock be fully disabled then ?
     or enabled only for 64 bits ?

Comment 58 Bart Van Assche 2011-12-04 13:43:00 UTC

(In reply to comment #57)
> * On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits.
>   => will the ticket lock be fully disabled then ?
>      or enabled only for 64 bits ?

Strange. Valgrind + the latest version of the fair scheduling patch builds fine here on 32-bit RHEL 4.9. Are you sure that you were using the latest version of the patch and that you have rerun configure before starting compilation ?

Comment 59 Philippe Waroquiers 2011-12-04 15:35:11 UTC

(In reply to comment #58)
> (In reply to comment #57)
> > * On RHEL 5.5, ticket lock is ok in 64 bits, but does not link in 32 bits.
> >   => will the ticket lock be fully disabled then ?
> >      or enabled only for 64 bits ?
> 
> Strange. Valgrind + the latest version of the fair scheduling patch builds fine
> here on 32-bit RHEL 4.9. Are you sure that you were using the latest version of
> the patch and that you have rerun configure before starting compilation ?

Sorry, I was completely unclear.
I will try to rephrase the question:
will Valgrind on RHEL5.5 be configured with ticket lock in 64 bits
and with pipe lock in 32 bits ?

Or will pipe lock be configured both in 32 and 64 bits in case ticket lock
is not ok in 32 bits ?

Comment 60 Bart Van Assche 2011-12-04 15:45:18 UTC

(In reply to comment #59)
> will Valgrind on RHEL5.5 be configured with ticket lock in 64 bits
> and with pipe lock in 32 bits ?
> 
> Or will pipe lock be configured both in 32 and 64 bits in case ticket lock
> is not ok in 32 bits ?

The (fair) ticket lock code is compiled if and only if:
* The header <linux/futex.h> is found by the configure script and it has a definition for FUTEX_WAIT.
* Using the gcc builtin functions for atomic memory access does not result in a linker error.

These configure tests should make sure the ticket lock code is compiled only on 64-bit Linux systems but neither on 32-bit Linux systems nor on Darwin systems.

Comment 61 Philippe Waroquiers 2011-12-04 16:05:46 UTC

(In reply to comment #60)
> The (fair) ticket lock code is compiled if and only if:
> * The header <linux/futex.h> is found by the configure script and it has a
> definition for FUTEX_WAIT.
> * Using the gcc builtin functions for atomic memory access does not result in a
> linker error.

On RHEL5.5 64 bits, the primary build arch (amd64) will pass these
tests, but the secondary build arch (x86) will fail these checks (no gcc builtin).
=> in such a case, is the primary arch using ticket lock and the secondary the pipe lock ?

Comment 62 Bart Van Assche 2011-12-04 16:25:07 UTC

(In reply to comment #61)
> On RHEL5.5 64 bits, the primary build arch (amd64) will pass these
> tests, but the secondary build arch (x86) will fail these checks (no gcc
> builtin).
> => in such a case, is the primary arch using ticket lock and the secondary the
> pipe lock ?

In that case the ticket lock will be used for both architectures - at least if enabled via the proper --fair-sched=... option.

Comment 63 Philippe Waroquiers 2011-12-04 16:37:28 UTC

(In reply to comment #62)
> (In reply to comment #61)
> > On RHEL5.5 64 bits, the primary build arch (amd64) will pass these
> > tests, but the secondary build arch (x86) will fail these checks (no gcc
> > builtin).
> > => in such a case, is the primary arch using ticket lock and the secondary the
> > pipe lock ?
> 
> In that case the ticket lock will be used for both architectures - at least if
> enabled via the proper --fair-sched=... option.

I do not understand this. If configure verifies only for the primary arch
that the ticket lock is usable, then it will fail to link the 32 bit.

Comment 64 Bart Van Assche 2011-12-04 16:49:21 UTC

(In reply to comment #63)
> If configure verifies only for the primary arch
> that the ticket lock is usable, then it will fail to link the 32 bit.

The above statement is wrong.

If anyone is interested, some output obtained from an amd64 system:

$ objdump -d coregrind/libcoregrind_amd64_linux_a-ticket-lock-linux.o
[ ... ]
0000000000000020 <release_sched_lock>:
[ ... ]
  32:   b8 01 00 00 00          mov    $0x1,%eax
  37:   f0 0f c1 07             lock xadd %eax,(%rdi)
[ ... ]

$ objdump -d coregrind/libcoregrind_x86_linux_a-ticket-lock-linux.o
[ ... ]
00000020 <release_sched_lock>:
[ ... ]
  3b:   ba 01 00 00 00          mov    $0x1,%edx
  40:   f0 0f c1 10             lock xadd %edx,(%eax)
[ ... ]

Comment 65 Philippe Waroquiers 2011-12-04 17:41:41 UTC

(In reply to comment #64)
> (In reply to comment #63)
> > If configure verifies only for the primary arch
> > that the ticket lock is usable, then it will fail to link the 32 bit.
> 
> The above statement is wrong.

Well, the above statement is wrong on your system, but is correct on mine :).

I think that what is happening is:
On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc
compiling in 32 bits does not have the needed builtin.
So, the configure check (done only on the primary arch, 64 bits) succeeds,
the 64 bits arch of valgrind is compiled and linked correctly, but the 32 bits then
fails to link.

Below is the failing link (and a grep showing that TICKET lock is activated).
The below is a 3.7.0 with the last version of the fair patch,
autogen.sh and configure re-run.

From what I can see, the configure test should be done for both
the primary arch and the secondary arch, as a.o. the support for builtin
can differ.

Hope this clarifies the mystery

Philippe

...
make[3]: Entering directory `/cm/local_build/weekly/fair/memcheck'
../coregrind/link_tool_exe_linux 0x38000000 gcc  -Wno-long-long  -Wno-pointer-sign -fno-stack-protector   -o memcheck-x86-linux -m32 -O2 -g -Wall -Wmissing-prototypes -Wshadow -Wpointer-arith -Wstrict-prototypes -Wmissing-declarations -Wno-format-zero-length -fno-strict-aliasing -fno-builtin -O2 -static -nodefaultlibs -nostartfiles -u _start  -m32 -static -nodefaultlibs -nostartfiles -u _start  -m32 memcheck_x86_linux-mc_leakcheck.o memcheck_x86_linux-mc_malloc_wrappers.o memcheck_x86_linux-mc_main.o memcheck_x86_linux-mc_translate.o memcheck_x86_linux-mc_machine.o memcheck_x86_linux-mc_errors.o ../coregrind/libcoregrind-x86-linux.a ../VEX/libvex-x86-linux.a -lgcc 
../coregrind/libcoregrind-x86-linux.a(libcoregrind_x86_linux_a-ticket-lock-linux.o): In function `release_sched_lock':
/cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:159: undefined reference to `__sync_fetch_and_add_4'
/cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:162: undefined reference to `__sync_fetch_and_add_4'
../coregrind/libcoregrind-x86-linux.a(libcoregrind_x86_linux_a-ticket-lock-linux.o): In function `acquire_sched_lock':
/cm/local_build/weekly/fair/coregrind/m_scheduler/ticket-lock-linux.c:116: undefined reference to `__sync_fetch_and_add_4'
collect2: ld returned 1 exit status
make[3]: *** [memcheck-x86-linux] Error 1
make[3]: Leaving directory `/cm/local_build/weekly/fair/memcheck'
make[2]: *** [check-recursive] Error 1
make[2]: Leaving directory `/cm/local_build/weekly/fair/memcheck'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/cm/local_build/weekly/fair'
make: *** [check] Error 2
wao@dhws018: uname -a
Linux dhws018 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:16 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
wao@dhws018: gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)
wao@dhws018: grep TICKET config.h
#define ENABLE_LINUX_TICKET_LOCK 1
wao@dhws018:

Comment 66 Bart Van Assche 2011-12-05 18:41:04 UTC

Created attachment 66411 [details]
Patch that makes Valgrind scheduling fair

> On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc
> compiling in 32 bits does not have the needed builtin.

Thanks for reporting that. Does the new version of the fair scheduling patch help ?

Comment 67 Philippe Waroquiers 2011-12-05 20:44:34 UTC

(In reply to comment #66)
> Created an attachment (id=66411) [details]
> Patch that makes Valgrind scheduling fair
> 
> > On a RHEL5.5 64bits, gcc compiling in 64 bits has the needed builtin but gcc
> > compiling in 32 bits does not have the needed builtin.
> 
> Thanks for reporting that. Does the new version of the fair scheduling patch
> help ?
new version properly compiles and link in 32 and 64 bits on RHEL5.5.
The resulting 64bits valgrind has fair scheduling.
The resulting 32bits valgrind has no fair scheduling (which is what is expected).
So, this is all good.

One small remark: the resulting config.h contains a confusing line:
/* Define to 1 if gcc supports __sync_add_and_fetch() on the secondary
   architecture */
#define HAVE_BUILTIN_ATOMIC 1
The config.h does not mention primary architecture.
I see that the compilation commands for 64 bits are using -DENABLE_LINUX_TICKET_LOCK,
while the compilation commands for 32 bits are not defining this symbol.
So, not clear why the config.h contains the HAVE_BUILTIN_ATOMIC for the
secondary arch.

Thanks

Comment 68 Bart Van Assche 2011-12-05 21:06:47 UTC

Created attachment 66417 [details]
Patch that makes Valgrind scheduling fair

> One small remark: the resulting config.h contains a confusing line:
> /* Define to 1 if gcc supports __sync_add_and_fetch() on the secondary
>   architecture */
> #define HAVE_BUILTIN_ATOMIC 1

There was one line too much in configure.in - should be fixed now.

Comment 69 Julian Seward 2011-12-05 22:26:08 UTC

Excellent.  btw I used this the other day to debug some concurrent code w/ 
Helgrind and DRD.  +1 to land whenever you think it's ready.

btw, one thing in the patch:

+         if (VG_(strcmp)(tmp_str, "yes") == 0)
+            VG_(clo_fair_sched) = enable_fair_sched;
+         else if (VG_(strcmp)(tmp_str, "try") == 0)
+            VG_(clo_fair_sched) = disable_fair_sched;
+         else if (VG_(strcmp)(tmp_str, "no") == 0)
+            VG_(clo_fair_sched) = try_fair_sched;

looks a bit strange.  "try" produces disable_ and "no" produces try_.
Is that what you intended?  Maybe I misunderstand something.

Comment 70 Bart Van Assche 2011-12-06 19:52:32 UTC

(In reply to comment #67)
> I see that the compilation commands for 64 bits are using
> -DENABLE_LINUX_TICKET_LOCK,
> while the compilation commands for 32 bits are not defining this symbol.
> So, not clear why the config.h contains the HAVE_BUILTIN_ATOMIC for the
> secondary arch.

You might have missed these changes in coregrind/Makefile.am. These enable -DENABLE_LINUX_TICKET_LOCK depending on the build architecture and the values of ENABLE_LINUX_TICKET_LOCK_PRIMARY and ENABLE_LINUX_TICKET_LOCK_SECONDARY:

@@ -371,6 +375,13 @@ libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CFLAGS = \
     $(AM_CFLAGS_@VGCONF_PLATFORM_PRI_CAPS@)
 libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CCASFLAGS = \
     $(AM_CCASFLAGS_@VGCONF_PLATFORM_PRI_CAPS@)
+if ENABLE_LINUX_TICKET_LOCK_PRIMARY
+libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_SOURCES += \
+    m_scheduler/ticket-lock-linux.c
+libcoregrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_a_CFLAGS += \
+    -DENABLE_LINUX_TICKET_LOCK
+endif
+
 if VGCONF_HAVE_PLATFORM_SEC
 libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_SOURCES = \
     $(COREGRIND_SOURCES_COMMON)
@@ -382,6 +393,12 @@ libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CFLAGS = \
     $(AM_CFLAGS_@VGCONF_PLATFORM_SEC_CAPS@)
 libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CCASFLAGS = \
     $(AM_CCASFLAGS_@VGCONF_PLATFORM_SEC_CAPS@)
+if ENABLE_LINUX_TICKET_LOCK_SECONDARY
+libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_SOURCES += \
+    m_scheduler/ticket-lock-linux.c
+libcoregrind_@VGCONF_ARCH_SEC@_@VGCONF_OS@_a_CFLAGS += \
+    -DENABLE_LINUX_TICKET_LOCK
+endif
 endif
 
 #----------------------------------------------------------------------------

Comment 71 Bart Van Assche 2011-12-06 20:05:11 UTC

Created attachment 66448 [details]
Patch that makes Valgrind scheduling fair

Fix semantics of --fair-sched=try and --fair-sched=no.

Comment 72 Bart Van Assche 2011-12-08 17:00:45 UTC

Applied the attached patch as r12280.