Created attachment 171893 [details] 0001-configure-drop-flto-partition-one.patch As discussed with mjw a few weeks ago, our LTO builds are unnecessarily slow at the moment. Attached a patch with explanation to fix that.
Subject: [PATCH] configure: drop -flto-partition=one For me, -flto-partition=one takes ~35m to build + test, while the default (which is 'balanced') takes ~5m. The reason that -flto-partition=one is slower is because it disables all of gcc's LTO parallelisation. This can produce better code, at the cost of (far) more expensive build times. If users want that, they can still pass it in their *FLAGS, but I don't think it's a suitable default. This was originally added in ab773096df7aaaf46e8883af5ed4690f4d4499af.
I agree with Sam that using the default partitioning algorithm seems better than forcing one. Philippe, any comments?
(In reply to Mark Wielaard from comment #2) > I agree with Sam that using the default partitioning algorithm seems better > than forcing one. > Philippe, any comments? Looking at the comments in configure.ac, clearly at that time, I considered that the build time was a one shot ("off-line") price to pay (for "builders" and "packagers"), while a faster memcheck helps users every day. But if people are relatively often building valgrind with lto and waiting for the build to be finished, I can understand the 35 minutes is a lot compared to 5 minutes.
(In reply to Philippe Waroquiers from comment #3) > Looking at the comments in configure.ac, clearly at that time, I considered > that the build time > was a one shot ("off-line") price to pay (for "builders" and "packagers"), > while a faster memcheck > helps users every day. > > But if people are relatively often building valgrind with lto and waiting > for the build to be finished, > I can understand the 35 minutes is a lot compared to 5 minutes. Yeah, I think even distros packagers would like using the default lto settings. And if they really want they can still do CFLAGS="-flto-partition=one" ./configure --enable-lto
commit 41a9896f9a91c0e1e44edf3aeff90b6f42f5c132 Author: Sam James <sam@gentoo.org> Date: Mon Jul 22 12:26:39 2024 +0100 configure: drop -flto-partition=one For me, -flto-partition=one takes ~35m to build + test, while the default (which is 'balanced') takes ~5m. The reason that -flto-partition=one is slower is because it disables all of gcc's LTO parallelisation. This can produce better code, at the cost of (far) more expensive build times. If users want that, they can still pass it in their *FLAGS, but I don't think it's a suitable default. This was originally added in ab773096df7aaaf46e8883af5ed4690f4d4499af. https://bugs.kde.org/show_bug.cgi?id=490651
Note that I did at work some trials of building with and without lto, with lto-partition=one and with the default lto partitioning, with gcc (GCC) 12.3.1 on AMD Ryzen Threadripper PRO 5975WX 32-Cores Red Hat Enterprise Linux release 8.8 (Ootpa) (only the 64 bit valgrind version). time (./autogen.sh; ./configure --prefix=`pwd`/install --enable-only64bit ; make -j20 install; make regtest) 2>&1 | tee B.out (in 3 different directories: lto+one, just lto, no lto. I have not observed a significant difference for the build of valgrind itself between the 2 lto versions: Build with --enable-lto=yes (one partition) ("lto") ( ./autogen.sh; ./configure --enable-lto=yes --prefix=`pwd`/install ; make ; 760.16s user 113.95s system 63% cpu 23:01.57 total tee B.out 0.01s user 0.05s system 0% cpu 23:01.57 total Configuring+compiling+installing valgrind: 5 min 51 seconds (measured as the timestamp difference between config.h.in and the include dir timestamp. Build with --enable-lto=yes no partition argument i.e. default, ("ltodef") Configuring+compiling+installing valgrind: 5 min 45 seconds. ( ./autogen.sh; ./configure --enable-lto=yes --prefix=`pwd`/install ; make ; 684.63s user 100.39s system 60% cpu 21:42.06 total tee B.out 0.01s user 0.06s system 0% cpu 21:42.06 total (note that in total, a non lto build has taken around 15:53 minutes, with 21 seconds for config/compile/install valgrind). I suspect in fact that some tests are quite sensitive to timing and can sometimes loop a lot longer and/or block during a long(er) time So, unless I missed something, I do not observe a significant build time difference between the 2 lto builds. I re-measured with make clean and then only make install. Again both lto builds were similar (5min51s vs 5min38s) So I guess that if we want to decrease (a lot) the time to build+test valgrind, we should parallelise the tests. i.e. finish https://bugs.kde.org/show_bug.cgi?id=319307 Note also that I have not observed a huge difference in performance between the "one lto" and the "balanced (default) lto", so removing lto-partition=one is confirmed as not being a problem.
The reproduction instructions you used seem mixed. If you compile with `make` (implicit -j1), then it's essentially the same, as GCC won't parallelise. If you run with `make -jN`, then GCC will automatically use the jobserver make created and consume slots for LTO parallelisation (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto): > You can also specify -flto=jobserver to use GNU make’s job server mode to determine the number of parallel jobs. This is useful when the Makefile calling GCC is already executing in parallel. You must prepend a ‘+’ to the command recipe in the parent Makefile for this to work. This option likely only works if MAKE is GNU make. Even without the option value, GCC tries to automatically detect a running GNU make’s job server. I suspect this explains the discrepancy.
(In reply to Sam James from comment #7) > The reproduction instructions you used seem mixed. If you compile with > `make` (implicit -j1), then it's essentially the same, as GCC won't > parallelise. > > If you run with `make -jN`, then GCC will automatically use the jobserver > make created and consume slots for LTO parallelisation I always used make -j20 install (the "time" truncated the string, but the full command was given at the beginning of the comment). But in any case, even if parallelism was not used, I got around slightly less than 6 minutes of config/make -j20/install for both builds (not the 35 minutes you observed). > (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-flto): > > You can also specify -flto=jobserver to use GNU make’s job server mode to determine the number of parallel jobs. This is useful when the Makefile calling GCC is already executing in parallel. You must prepend a ‘+’ to the command recipe in the parent Makefile for this to work. This option likely only works if MAKE is GNU make. Even without the option value, GCC tries to automatically detect a running GNU make’s job server. > > I suspect this explains the discrepancy. As explained above, make -j20 without tests used slightly than 6 minutes for both lto builds, and make with tests around 22 and 23 minutes. So, I do not observe 35 minutes nor a significant difference between the 2 lto builds. Might of course be heavily influenced by the CPU used (I used a AMD Ryzen Threadripper PRO 5975WX 32-Cores)
Here are the timings on two of my machines (all on fresh checkouts) = Intel Core i7-10850H ./autogen.sh && ./configure && time make -j8 real 0m3.100s user 0m5.273s sys 0m4.253s ./autogen.sh && ./configure --enable-lto && time make -j8 real 3m37.978s user 25m41.657s sys 1m26.144s ./autogen.sh && ./configure --enable-lto CFLAGS="-flto-partition=one" && time make -j8 real 9m0.954s user 16m13.692s sys 0m37.233s = Intel Core i9-9900 ./autogen.sh && ./configure && time make -j12 real 0m1.522s user 0m1.930s sys 0m1.142s ./autogen.sh && ./configure --enable-lto && time make -j12 real 1m58.869s user 20m19.670s sys 1m3.180s ./autogen.sh && ./configure --enable-lto CFLAGS="-flto-partition=one" && time make -j12 real 7m8.945s user 14m1.536s sys 0m33.254s The (default) lto builds are ~70 times slower than none-lto builds. And the -flto-partition=one builds are again about ~3 times slower than the default lto builds. It is definitely noticeable for me. And I have seen much longer build times on some non-x86_64 arches.
(In reply to Mark Wielaard from comment #9) > It is definitely noticeable for me. And I have seen much longer build times > on some non-x86_64 arches. Effectively this gives more difference than my measurements. How much does the testing add ?
(In reply to Philippe Waroquiers from comment #10) > (In reply to Mark Wielaard from comment #9) > > It is definitely noticeable for me. And I have seen much longer build times > > on some non-x86_64 arches. > Effectively this gives more difference than my measurements. Yeah, I am really surprised you aren't seeing any difference with and without the one partition. Especially given you are using many more threads to build. > How much does the testing add ? make check is ~5 times slower with --enable-lto, adding CFLAGS="--lto-partition=one" makes it another ~3 times slower. make check -j12 real 0m6.492s user 0m25.506s sys 0m10.960s --enable-lto make check -j12 real 0m31.975s user 3m27.578s sys 0m24.597s CFLAGS="--lto-partition=one" real 1m30.105s user 2m6.856s sys 0m15.383s make regtest shows the opposite, it gets faster, but just a few seconds difference. All tests pass in all cases: == 922 tests, 0 stderr failures, 0 stdout failures, 0 stderrB failures, 0 stdoutB failures, 0 post failures == make regtest real 10m42.997s user 6m57.130s sys 1m13.850s --enable-lto make regtest real 10m31.466s user 6m46.754s sys 1m14.648s CFLAGS="--lto-partition=one" make regtest real 10m27.229s user 6m42.505s sys 1m14.506s So it does indeed look like --enable-lto gives a slightly faster valgrind. Adding --lto-partition=one does make it a little faster, but not that much more.