This is a performance improvement based on Peter Bergner's suggestion: Allocate caller-saved registers for short lived vregs and callee-save registers for vregs which span accross helper calls. This idea is quite simple but MOV coalescing went in the way. The first patch enhances MOV coalescing so it keeps better track of what registers form the coalescing chain. The second patch actually changes find_free_rreg(). With just a minor additional complexity, we get more compact generated code. This not only helps for better performance; it should also help with i-cache misses. Measuring instruction count, running inner Memcheck: perf/bz2: vanilla: 45,112,349,784 total; reg alloc 165,978,807; ratio 15.5 (97,397 -> 1,506,163) coalesce only: 45,120,691,777 total; reg alloc 167,035,575; ratio 15.5 (97,610 -> 1,509,396) both patches: 44,943,765,809 total; reg alloc 167,403,237; ratio 15.3 (97,607 -> 1,493,722) /bin/true: vanilla: 3,514,906,884 total; reg alloc 60,779,126 coalesce only: 3,506,029,832 total; reg alloc 61,242,900 both patches: 3,503,495,437 total; reg alloc 61,395,160
Created attachment 107963 [details] refactor MOV coalescing
Created attachment 107964 [details] register allocation
The only remaining thing is to refactor isMove callbacks in the remaining architectures so that the information is available directly in HRegUsage.
Created attachment 108134 [details] refactor MOV coalescing
Created attachment 108135 [details] register allocation
I've tested on amd64, ppc8le and arm64 architectures. In all cases, the produced code was more compact and overall performance better, when running inner Memcheck on perf/bz2. Numbers are given as instruction count; ratio as reported by Memcheck with '--stats=yes'. amd64: vanilla: 45,112,349,784 total; 165,978,807 reg alloc; ratio 15.5 v3-reoder: 44,943,765,809 total; 167,403,237 reg alloc; ratio 15.3 power8le: vanilla: 61,928,020,284 total; 351,285,156 reg alloc; ratio 17.0 v3-reorder: 61,919,130,481 total; 343,001,581 reg alloc; ratio 17.0 arm64 [callgrind does not work on this arch]: vanilla: ratio 14.7 v3-reorder: ratio 14.7 Manual inspection of top 200 SB profiled blocks showed VexExpansionRatio always few instructions better than in vanilla.
(In reply to Ivo Raisr from comment #4) > Created attachment 108134 [details] > refactor MOV coalescing This looks OK to me. I must say I didn't check the actual merging logic in host_generic_reg_alloc*.c, because I no longer know how those work and I assume you checked/tested it fairly carefully. I did look at the changes for each architecture, for the removal of isMove_XXXInstr and replacement by an extension to getRegUsage_XXXInstr. These are potentially dangerous, if we make a mistake there -- if it mistakenly claims that something is a move when it isn't, then the generated code will be incorrect. If the new versions miss out some moves that the old versions find, then the generated code will be slower. But it all looks OK to me. I do have one question though: This isMove_XXXInstr functions appear to ignore the virtual/real distinction, whereas the new getRegUsage_XXXInstr bits always do check hregIsVirtual() before setting u->isVregVregMove = True and recording the two operands. I assume that is deliberate. I would ask -- is it necessary to duplicate the 'hregIsVirtual?' check for each architecture? Can host_generic_reg_alloc*.c instead do that itself just once, immediately after calling getRegUsage_XXXInstr?
(In reply to Ivo Raisr from comment #5) > Created attachment 108135 [details] > register allocation Looks plausible to me, although I can't make any meaningful assessment of it.
(In reply to Ivo Raisr from comment #6) > amd64: > vanilla: 45,112,349,784 total; 165,978,807 reg alloc; ratio 15.5 > v3-reoder: 44,943,765,809 total; 167,403,237 reg alloc; ratio 15.3 Nice. > power8le: > vanilla: 61,928,020,284 total; 351,285,156 reg alloc; ratio 17.0 > v3-reorder: 61,919,130,481 total; 343,001,581 reg alloc; ratio 17.0 Curious as to why this doesn't give much of an improvement. Is it because power has more registers than amd64, so the opportunity for avoiding spill code is smaller? On that theory .. I would be interested to see the equivalent numbers for x86, to see if the gains are even larger. Can you measure those? Also I think it would be a good safety check.
Created attachment 108138 [details] mips report Hi, I tested this patches on mips64 and mips32. I couldn't find numbers that you are looking for, so I copied everything, I hope that's ok :)
(In reply to aleksandra from comment #10) > Created attachment 108138 [details] > mips report Aleksandra, thank you for testing on MIPS. Please can you re-attach the report as a plain text file? Opening zip files in bugzilla isn't so convenient.
Created attachment 108160 [details] mips report I hope this is better.
(In reply to aleksandra from comment #12) > Created attachment 108160 [details] > mips report > I hope this is better. Thanks, yes. So, what we can pull from that is MIPS64 both transtab: new 1,837 (44,740 -> 868,504; ratio 19.4) [0 scs] avg tce size 472 neither transtab: new 1,842 (44,992 -> 868,724; ratio 19.3) [0 scs] avg tce size 471 MIPS32 both transtab: new 1,756 (41,684 -> 826,544; ratio 19.8) [0 scs] avg tce size 470 neither transtab: new 1,769 (42,096 -> 830,100; ratio 19.7) [0 scs] avg tce size 469 (both = with both patches, neither = with neither patch) Which is a bit strange because it looks from that like the patch gives a small increase in code size, rather than the expected small decrease. Maybe I confused the before/after logs?
Created attachment 108170 [details] refactor MOV coalescing
Created attachment 108171 [details] register allocation
I've fixed the problem with duplicate hregIsVirtual() in host_arch_defs.c files. Here are the performance numbers for amd64 and x86: Running inner Memcheck on perf/bz2 (compiled with -O or -O2). Numbers are instruction counts: amd64 -O: total register allocator ratio vanilla: 45,102,001,869 165,978,807 15.5 patches: 44,928,390,448 169,280,398 15.3 amd64 -O2: vanilla: 44,309,403,899 191,289,068 16.5 patches: 44,108,763,444 195,193,491 16.3 x86 -O: vanilla: 48,711,841,474 233,097,543 14.6 patches: 48,077,926,058 235,505,230 14.4 x86 -O2: vanilla: 50,400,479,522 253,382,282 15.1 patches: 49,953,764,975 256,000,363 14.9 In all cases, the new implementation is faster and produces better (compact) code. I will investigate the situation on power and post the results here.
(In reply to Ivo Raisr from comment #16) > Here are the performance numbers for amd64 and x86: That's some nice improvements, especially for x86. Thanks for the measurements. I expect arm(32) would also get a quite big improvement, because it's also pretty low on registers. [This is not a request to measure, just random speculation.] > In all cases, the new implementation is faster and produces better (compact) > code. Assuming power looks sane (the generated code is at least not any bigger with the patch), and that everything is stable, I suggest you just land it whenever you like.
I tested this new patches and here are results: MIPS64 both transtab 1,837 (44,740 -> 868,632; ratio 19.4) [0 scs] avg tce size 472 neither transtab 1,834 (44,664 -> 865,872; ratio 19.4) [0 scs] avg tce size 472 MIPS32 both transtab 1,756 (41,684 -> 826,544; ratio 19.8) [0 scs] avg tce size 470 neither transtab 1,769 (42,036 -> 820,828; ratio 19.5) [0 scs] avg tce size 464
(In reply to Julian Seward from comment #17) > I expect arm(32) would also get a quite big improvement, because > it's also pretty low on registers. [This is not a request to measure, > just random speculation.] Unfortunately I do not have any machine with arm32 architecture. > Assuming power looks sane (the generated code is at least not any bigger > with the patch), and that everything is stable, I suggest you just land > it whenever you like. Actually profiling for the most used 200 blocks in perf/bz2 shows only the following difference: -VexExpansionRatio 228 3228 141 :10 +VexExpansionRatio 228 3196 140 :10 So this means that the majority of the blocks remained the same size; one block is slightly smaller now.
Pushed as commits: 83cabd32492e6d19d483a63522e4e874fa64b617 074de238d44c0cdaf394489ea69a67b76916fbce https://sourceware.org/git/?p=valgrind.git;a=commit;h=83cabd32492e6d19d483a63522e4e874fa64b617 https://sourceware.org/git/?p=valgrind.git;a=commit;h=074de238d44c0cdaf394489ea69a67b76916fbce
Thank you all for your responses! They were really helpful.