Philippe did some performance measurements of VEX register allocator v2 and v3 with combination of chasing on/off. Basically his observations conclude that register allocator v3 is faster to generate code, but the generated code is less efficient, and relatively quickly (at around 1 minute), the total becomes worse. Here are some proof of concepts for some ideas I had how to speed up v2 and v3.
Created attachment 107674 [details] Fix a typo in function find_vreg_to_spill(). Fixes a typo in function find_vreg_to_spill() for v3. This should produce better/more correct spilling decisions but my measurements did not prove that...
Created attachment 107675 [details] Track vreg usage on per-instruction precision in v3 Tracks vreg usage on per-instruction basis, using bitset. Should produce optimal spilling decisions.
Created attachment 107676 [details] Scan 10 instructions ahead instead of just 5 Scans 10 instructions ahead in function find_vreg_to_spill() instead of just 5 in v3. Hard to say if it improves anything...
Created attachment 107677 [details] Different algorithm for find_vreg_to_spill() in v3. Different algorithm for find_vreg_to_spill() in v3. Also scans 10 instructions instead of just 5.
Created attachment 107678 [details] Small performance enhancement to v2. Also makes it the default for testing. Small performance enhancement to VEX register allocator v2. Iterates only over real registers of the target hreg class. Also makes VEX register allocator v2 the default just for the sake of testing.
Philippe, please could you measure these guys on your performance tests. Although I was able to determine total instruction count on Memcheck+perf/bz2, I was not able to measure real time in a deterministic way using 'taskset 1'. The standard deviation on my laptop with amd64/Linux Ubuntu was over 3%. I think https://bugsfiles.kde.org/attachment.cgi?id=107678 can be integrated right away but I'd like to see in your testing how much it improves the things. Also https://bugsfiles.kde.org/attachment.cgi?id=107674 is quite interesting. My measurements did not show any improvement but the code (without the fix) is obviously wrong. I still do not understand what this is trying to tell me...
What I would like to do next is to compare how v2 and v3 allocate registers on some hot paths in perf/bz2. However I have difficulty how to locate such SBs. Although Callgrind shows addresses of such hot translated blocks, I am not able to correlate them to --flags=10000110 output. Is there any clever way how to do that?
One thing you could try is this: ./vg-in-place --vex-regalloc-version=2 \ --profile-flags=00000010 [whatever tool and args you like] ./perf/bz2 x That both shows you the hot blocks and also the regalloc output for them. and then compare against the same for the v3 allocator. Because perf/bz2 is very deterministic, you should be able to find matching block-pairs easily. It might also be worth trying with perf/fbench and ffbench because they are small, have hot loops and use FP registers a lot. Also, maybe worth comparing on x86 rather than amd64? Given that there are fewer allocatable registers on x86, differences in spilling strategies between v2 and v3 might be more obvious. (Just a guess.)
Created attachment 107711 [details] Reorder allocatable registers for AMD64 and X86 archs Reorder allocatable registers for AMD64 and X86 so that the callee saved are listed first. Helper calls always trash all caller saved registers. By listing the callee saved first the register allocator is more likely to pick them and does not need to spill that much before helper calls. If this proves to improve performance (at least somewhat) then the v3 register allocation algorithm needs to be revisited.
(In reply to Ivo Raisr from comment #9) Here are my findings for running Memcheck on perf/bz2: v2 baseline: 45.214 G instructions v2 with improvement: 45.183 G instructions v3 baseline: 45.132 G instructions v3 with reordering: 45.116 G instructions
(In reply to Ivo Raisr from comment #10) Do you have information about the regalloc cost vs the cost of the generated code? In other words, is there a way to tell if the patch actually changes the generated code, or whether it only causes some find-a-register loops inside the allocator to iterate less often?
(In reply to Ivo Raisr from comment #10) Included is also total cost of running doRegisterAllocation_v2/3: Here are my findings for running Memcheck on perf/bz2: v2 baseline: 45.214 G instructions total; 209 M instructions regalloc v2 with improvement: 45.183 G instructions; 208 M instructions regalloc v3 baseline: 45.132 G instructions; 161 M instructions regalloc v3 with reordering: 45.116 G instructions; 159 M instructions regalloc
Interestingly enough, VEX register allocator v2 also benefits from register reordering: For Memcheck on perf/bz2: ra2-baseline: 45,214 M instructions total; 209 M doRegisterAllocation_v2 ra2-reordered: 45,189 M instructions total; 205 M doRegisterAllocation_v2 ra3-baseline: 45,132 M instructions total; 160 M doRegisterAllocation_v3 ra3-reorder: 45,116 M instructions total; 159 M doRegisterAllocation_v3 For Memcheck on perf/ffbench: ra2-baseline: 15,646 M instructions total; 100 M regalloc ra2-reorder: 15,615 M instructions total; 99 M regalloc ra3-baseline: 15,749 M instructions total; 79 M regalloc ra3-reorder: 15,692 M instructions total; 78 M regalloc
Created attachment 107958 [details] Refactor tracking of MOV coalescing. VReg<->VReg MOV coalescing status is now a part of the HRegUsage. This allows register allocation to query it two times without incurring a performance penalty. This in turn allows to better keep track of vreg<->vreg MOV coalescing so that all vregs in the coalesce chain get the effective |dead_before| of the last vreg. A small performance improvement has been observed because this allows to coalesce even spilled vregs (previously only assigned ones).
Ivo, can we close this? I assume that all of these improvements have long since landed in the v3 allocator (and also, that we're now shipping v3 by default!) But do let me know if I assume wrongly.
Julian, I have closed this bug now. Register allocator v3 has been the default in Valgrid for many months and all the improvements from this bug which made sense have been already implemented.