490651 – Stop using -flto-partition=one

Bug 490651 - Stop using -flto-partition=one

Summary: Stop using -flto-partition=one

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (other bugs)
Version First Reported In:	3.24 GIT
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Mark Wielaard

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-07-22 14:42 UTC by Sam James
Modified:	2024-07-28 18:45 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
0001-configure-drop-flto-partition-one.patch (1.56 KB, patch) 2024-07-22 14:42 UTC, Sam James	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Sam James 2024-07-22 14:42:14 UTC

Created attachment 171893 [details]
0001-configure-drop-flto-partition-one.patch

As discussed with mjw a few weeks ago, our LTO builds are unnecessarily slow at the moment.

Attached a patch with explanation to fix that.

Comment 1 Sam James 2024-07-22 14:42:26 UTC

Subject: [PATCH] configure: drop -flto-partition=one

For me, -flto-partition=one takes ~35m to build + test, while the default
(which is 'balanced') takes ~5m.

The reason that -flto-partition=one is slower is because it disables all
of gcc's LTO parallelisation. This can produce better code, at the cost
of (far) more expensive build times. If users want that, they can still
pass it in their *FLAGS, but I don't think it's a suitable default.

This was originally added in ab773096df7aaaf46e8883af5ed4690f4d4499af.

Comment 2 Mark Wielaard 2024-07-23 12:46:32 UTC

I agree with Sam that using the default partitioning algorithm seems better than forcing one.
Philippe, any comments?

Comment 3 Philippe Waroquiers 2024-07-23 20:25:36 UTC

(In reply to Mark Wielaard from comment #2)
> I agree with Sam that using the default partitioning algorithm seems better
> than forcing one.
> Philippe, any comments?
Looking at the comments in configure.ac, clearly at that time, I considered that the build time
was a one shot ("off-line") price to pay (for "builders" and "packagers"), while a faster memcheck
helps users every day.

But if people are relatively often building valgrind with lto and waiting for the build to be finished,
 I can understand the 35 minutes is a lot compared to 5 minutes.

Comment 4 Mark Wielaard 2024-07-24 10:11:37 UTC

(In reply to Philippe Waroquiers from comment #3)
> Looking at the comments in configure.ac, clearly at that time, I considered
> that the build time
> was a one shot ("off-line") price to pay (for "builders" and "packagers"),
> while a faster memcheck
> helps users every day.
> 
> But if people are relatively often building valgrind with lto and waiting
> for the build to be finished,
>  I can understand the 35 minutes is a lot compared to 5 minutes.

Yeah, I think even distros packagers would like using the default lto settings.
And if they really want they can still do
CFLAGS="-flto-partition=one" ./configure --enable-lto

Comment 5 Mark Wielaard 2024-07-24 10:13:47 UTC

commit 41a9896f9a91c0e1e44edf3aeff90b6f42f5c132
Author: Sam James <sam@gentoo.org>
Date:   Mon Jul 22 12:26:39 2024 +0100

    configure: drop -flto-partition=one
    
    For me, -flto-partition=one takes ~35m to build + test, while the default
    (which is 'balanced') takes ~5m.
    
    The reason that -flto-partition=one is slower is because it disables all
    of gcc's LTO parallelisation. This can produce better code, at the cost
    of (far) more expensive build times. If users want that, they can still
    pass it in their *FLAGS, but I don't think it's a suitable default.
    
    This was originally added in ab773096df7aaaf46e8883af5ed4690f4d4499af.
    
    https://bugs.kde.org/show_bug.cgi?id=490651

Comment 6 Philippe Waroquiers 2024-07-27 20:32:26 UTC

Note that I did at work some trials of building with and without lto, with lto-partition=one and with the default lto partitioning,
with gcc (GCC) 12.3.1
on AMD Ryzen Threadripper PRO 5975WX 32-Cores
Red Hat Enterprise Linux release 8.8 (Ootpa)
(only the 64 bit valgrind version).

time (./autogen.sh; ./configure  --prefix=`pwd`/install --enable-only64bit ; make -j20 install; make regtest) 2>&1 | tee B.out
(in 3 different directories:  lto+one, just lto, no lto. 


I have not observed a significant difference for the build of valgrind itself between the 2 lto versions:

Build with --enable-lto=yes (one partition) ("lto")
( ./autogen.sh; ./configure --enable-lto=yes --prefix=`pwd`/install ; make  ;  760.16s user 113.95s system 63% cpu 23:01.57 total
tee B.out  0.01s user 0.05s system 0% cpu 23:01.57 total
Configuring+compiling+installing valgrind: 5 min 51 seconds (measured as the timestamp difference between config.h.in and the include dir timestamp.

Build with --enable-lto=yes  no partition argument i.e. default, ("ltodef")
Configuring+compiling+installing valgrind: 5 min 45 seconds.
( ./autogen.sh; ./configure --enable-lto=yes --prefix=`pwd`/install ; make  ;  684.63s user 100.39s system 60% cpu 21:42.06 total
tee B.out  0.01s user 0.06s system 0% cpu 21:42.06 total

(note that in total, a non lto build has taken around 15:53 minutes, with 21 seconds for config/compile/install valgrind).
I suspect in fact that some tests are quite sensitive to timing and can sometimes loop a lot longer and/or block during a long(er) time

So, unless I missed something, I do not observe a significant build time difference between the 2 lto builds.
I re-measured with make clean  and then only make install. Again both lto builds were similar (5min51s vs 5min38s)


So I guess that if we want to decrease (a lot) the time to build+test valgrind, we should parallelise the tests.
i.e. finish https://bugs.kde.org/show_bug.cgi?id=319307

Note also that I have not observed a huge difference in performance between the "one lto" and the "balanced (default) lto",
so removing lto-partition=one is confirmed as not being a problem.

Comment 7 Sam James 2024-07-27 20:59:46 UTC

The reproduction instructions you used seem mixed. If you compile with `make` (implicit -j1), then it's essentially the same, as GCC won't parallelise.

If you run with `make -jN`, then GCC will automatically use the jobserver make created and consume slots for LTO parallelisation (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto):
> You can also specify -flto=jobserver to use GNU make’s job server mode to determine the number of parallel jobs. This is useful when the Makefile calling GCC is already executing in parallel. You must prepend a ‘+’ to the command recipe in the parent Makefile for this to work. This option likely only works if MAKE is GNU make. Even without the option value, GCC tries to automatically detect a running GNU make’s job server. 

I suspect this explains the discrepancy.

Comment 8 Philippe Waroquiers 2024-07-28 11:16:39 UTC

(In reply to Sam James from comment #7)
> The reproduction instructions you used seem mixed. If you compile with
> `make` (implicit -j1), then it's essentially the same, as GCC won't
> parallelise.
> 
> If you run with `make -jN`, then GCC will automatically use the jobserver
> make created and consume slots for LTO parallelisation
I always used make -j20 install (the "time" truncated the string, but the full command was given at the beginning of the comment).

But in any case, even if parallelism was not used, I got around slightly less than 6 minutes of config/make -j20/install for both builds
 (not the 35 minutes you observed).


> (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto):
> > You can also specify -flto=jobserver to use GNU make’s job server mode to determine the number of parallel jobs. This is useful when the Makefile calling GCC is already executing in parallel. You must prepend a ‘+’ to the command recipe in the parent Makefile for this to work. This option likely only works if MAKE is GNU make. Even without the option value, GCC tries to automatically detect a running GNU make’s job server. 
> 
> I suspect this explains the discrepancy.
As explained above, make -j20 without tests used slightly  than 6 minutes for both lto builds,
and make with tests around 22 and 23 minutes.
So, I do not observe 35 minutes nor a significant difference between the 2 lto builds.
Might of course be heavily influenced by the CPU used (I used a AMD Ryzen Threadripper PRO 5975WX 32-Cores)

Comment 9 Mark Wielaard 2024-07-28 13:18:33 UTC

Here are the timings on two of my machines (all on fresh checkouts)

= Intel Core i7-10850H

./autogen.sh && ./configure && time make -j8
real	0m3.100s
user	0m5.273s
sys	0m4.253s

./autogen.sh && ./configure --enable-lto && time make -j8
real	3m37.978s
user	25m41.657s
sys	1m26.144s

./autogen.sh && ./configure --enable-lto CFLAGS="-flto-partition=one" && time make -j8
real	9m0.954s
user	16m13.692s
sys	0m37.233s

= Intel Core i9-9900

./autogen.sh && ./configure && time make -j12
real	0m1.522s
user	0m1.930s
sys	0m1.142s

./autogen.sh && ./configure --enable-lto && time make -j12
real	1m58.869s
user	20m19.670s
sys	1m3.180s

./autogen.sh && ./configure --enable-lto CFLAGS="-flto-partition=one" && time make -j12
real	7m8.945s
user	14m1.536s
sys	0m33.254s

The (default) lto builds are ~70 times slower than none-lto builds.
And the -flto-partition=one builds are again about ~3 times slower than the default lto builds.

It is definitely noticeable for me. And I have seen much longer build times on some non-x86_64 arches.

Comment 10 Philippe Waroquiers 2024-07-28 15:47:41 UTC

(In reply to Mark Wielaard from comment #9)
> It is definitely noticeable for me. And I have seen much longer build times
> on some non-x86_64 arches.
Effectively this gives more difference than my measurements.

How much does the testing add ?

Comment 11 Mark Wielaard 2024-07-28 18:45:52 UTC

(In reply to Philippe Waroquiers from comment #10)
> (In reply to Mark Wielaard from comment #9)
> > It is definitely noticeable for me. And I have seen much longer build times
> > on some non-x86_64 arches.
> Effectively this gives more difference than my measurements.

Yeah, I am really surprised you aren't seeing any difference with and without the one partition. Especially given you are using many more threads to build.

> How much does the testing add ?

make check is ~5 times slower with --enable-lto, adding CFLAGS="--lto-partition=one" makes it another ~3 times slower.

make check -j12
real	0m6.492s
user	0m25.506s
sys	0m10.960s

--enable-lto
make check -j12
real	0m31.975s
user	3m27.578s
sys	0m24.597s

CFLAGS="--lto-partition=one"
real	1m30.105s
user	2m6.856s
sys	0m15.383s

make regtest shows the opposite, it gets faster, but just a few seconds difference. All tests pass in all cases:
== 922 tests, 0 stderr failures, 0 stdout failures, 0 stderrB failures, 0 stdoutB failures, 0 post failures ==

make regtest
real	10m42.997s
user	6m57.130s
sys	1m13.850s

--enable-lto
make regtest
real	10m31.466s
user	6m46.754s
sys	1m14.648s

CFLAGS="--lto-partition=one"
make regtest
real	10m27.229s
user	6m42.505s
sys	1m14.506s

So it does indeed look like --enable-lto gives a slightly faster valgrind. Adding --lto-partition=one does make it a little faster, but not that much more.