Julian Seward designed new VEX register allocator last year. And here is its initial implementation. On amd64/Linux, it manages to allocate 385 superblocks while running "/bin/true" until it bombs out for some reason. There are still a few missing features to implement. However it is believed that compared to the previous register allocator v2 it is: - simpler - faster - perhaps producing slightly worse spilling decisions
Created attachment 106254 [details] patch 002 - Valgrind part
Created attachment 106255 [details] patch 002 - VEX part
New VEX register allocator is activated with command line option --vex-regalloc-version=3, for example: $ ./vg-in-place --tool=none --vex-regalloc-version=3 --trace-flags=00000110 \ --trace-notbelow=0 /bin/true
Created attachment 106259 [details] patch 003 - Valgrind part
Created attachment 106260 [details] patch 003 - VEX part
It turns out the current version has a design flaw because it does not track rreg live ranges. This manifests for example in this vcode: 82 movq $0x14,%rdi 83 movq %vR12,%rsi 84 movq $0x0,%rdx 85 movq 0xA8(%rbp),%rcx 86 call[4,RLPri_Int] 0x58147900 87 movq %rax,%vR191 88 movq %vR189,%rdi 89 movq %vR28,%rsi 90 movq %vR190,%rdx 91 movq %vR191,%rcx 92 call[4,RLPri_Int] 0x58147900 Imagine that register allocator processes instruction #89 and that %vReg28 is currently spilled. So it will: 1. allocate a "free" register - %rdi 2. generate reload into %rdi: movq (spill_slot), %rdi 3. register allocate "movq %vR28,%rsi" -> "movq %rdi,%rsi" I think it is needed to track rreg live ranges. I will try something simpler than does reg_alloc v2; rreg live ranges will be tracked individually per rreg.
Excellent to see this coming to life! When it can run hello-world I will give it a try. If you have cases where incorrect register allocation is causing subtle bugs (eg, program runs but produces different results than with the r2 allocator) then one old trick is to have a command line argument that allows to switch from r2 to r3 after some number of blocks processed (or the other direction), and find a deterministic case that fails. Then binary search for the failing block. (Sorry if this is obvious.)
Created attachment 106323 [details] patch 004 - Valgrind part
Created attachment 106324 [details] patch 004 - VEX part
The attached patches for register allocator v3 are able to drive simple programs (such as /bin/true and /bin/date) successfully through. There is still some functionality to be implemented, though.
Created attachment 106377 [details] patch 005 - Valgrind part
Created attachment 106378 [details] patch 005 - VEX part
Patches 005 implement function find_vreg_to_spill(). With these patches, the register allocator v3 is able to produce correct results for all tests in the Valgrind regression test suite. Therefore the implementation is deemed correct. As regards performance, v3 is slightly better (speedup reported up to 13.7%) then v2. However there are still a few TODOs in v3 which should help boost the performance even more. So I hope to have at the end faster and simpler reg_alloc v3 than is the current v2. Here are the numbers (% is the speedup of v3 compared to v2): $ perl perf/vg_perf --tools=none,memcheck,callgrind,helgrind,cachegrind,drd,massif --reps=3 --vg=../valgrind-reg_alloc3 --vg=../valgrind-linux perf -- Running tests in perf ---------------------------------------------- bigcode1 valgrind-reg_alloc3:0.09s no: 1.6s (18.0x, -----) me: 3.1s (35.0x, -----) ca:12.7s (140.8x, -----) he: 2.0s (22.8x, -----) ca: 3.7s (41.2x, -----) dr: 2.0s (22.2x, -----) ma: 1.9s (20.7x, -----) bigcode1 valgrind-linux:0.09s no: 1.6s (17.8x, 1.2%) me: 3.0s (33.3x, 4.8%) ca:12.7s (140.7x, 0.1%) he: 2.0s (22.2x, 2.4%) ca: 3.6s (40.2x, 2.4%) dr: 1.9s (21.7x, 2.5%) ma: 1.9s (20.7x, 0.0%) -- bigcode2 -- bigcode2 valgrind-reg_alloc3:0.08s no: 3.3s (41.1x, -----) me: 6.8s (85.5x, -----) ca:26.5s (330.9x, -----) he: 4.4s (54.9x, -----) ca: 6.5s (81.2x, -----) dr: 4.1s (51.6x, -----) ma: 3.9s (48.9x, -----) bigcode2 valgrind-linux:0.08s no: 3.4s (42.2x, -2.7%) me: 6.8s (85.1x, 0.4%) ca:25.6s (320.2x, 3.2%) he: 4.3s (53.9x, 1.8%) ca: 6.1s (76.5x, 5.8%) dr: 4.0s (49.4x, 4.4%) ma: 3.8s (48.0x, 1.8%) -- bz2 -- bz2 valgrind-reg_alloc3:0.54s no: 2.1s ( 3.9x, -----) me: 7.4s (13.6x, -----) ca:14.9s (27.5x, -----) he:10.9s (20.2x, -----) ca:12.5s (23.1x, -----) dr:11.1s (20.6x, -----) ma: 2.0s ( 3.8x, -----) bz2 valgrind-linux:0.54s no: 2.0s ( 3.7x, 4.8%) me: 6.3s (11.8x, 13.7%) ca:14.7s (27.1x, 1.3%) he:10.1s (18.6x, 7.8%) ca:11.8s (21.9x, 5.3%) dr:10.7s (19.8x, 4.0%) ma: 2.0s ( 3.8x, -0.5%) -- fbench -- fbench valgrind-reg_alloc3:0.23s no: 1.1s ( 4.7x, -----) me: 3.7s (16.1x, -----) ca: 5.1s (22.3x, -----) he: 2.7s (11.9x, -----) ca: 3.4s (14.8x, -----) dr: 2.7s (11.6x, -----) ma: 1.1s ( 4.8x, -----) fbench valgrind-linux:0.23s no: 1.1s ( 4.7x, 0.9%) me: 3.5s (15.3x, 4.6%) ca: 5.2s (22.5x, -0.8%) he: 2.6s (11.5x, 2.9%) ca: 3.1s (13.6x, 7.9%) dr: 2.6s (11.3x, 2.2%) ma: 1.1s ( 4.9x, -0.9%) -- ffbench -- ffbench valgrind-reg_alloc3:0.25s no: 1.1s ( 4.5x, -----) me: 3.4s (13.7x, -----) ca: 2.0s ( 7.9x, -----) he: 5.9s (23.4x, -----) ca: 4.4s (17.5x, -----) dr: 3.4s (13.4x, -----) ma: 1.1s ( 4.3x, -----) ffbench valgrind-linux:0.25s no: 1.1s ( 4.5x, 0.0%) me: 3.5s (13.8x, -0.9%) ca: 2.0s ( 7.8x, 0.5%) he: 6.0s (23.8x, -1.7%) ca: 4.2s (16.9x, 3.2%) dr: 3.4s (13.4x, 0.0%) ma: 1.1s ( 4.3x, 0.0%) -- heap -- heap valgrind-reg_alloc3:0.08s no: 0.8s (10.1x, -----) me: 5.2s (64.9x, -----) ca: 6.6s (83.0x, -----) he: 7.1s (89.0x, -----) ca: 3.9s (49.1x, -----) dr: 4.4s (54.8x, -----) ma: 2.5s (31.9x, -----) heap valgrind-linux:0.08s no: 0.8s (10.2x, -1.2%) me: 5.1s (63.2x, 2.5%) ca: 6.6s (83.0x, 0.0%) he: 7.0s (86.9x, 2.4%) ca: 3.8s (47.2x, 3.8%) dr: 4.3s (53.2x, 2.7%) ma: 2.5s (31.4x, 1.6%) -- heap_pdb4 -- heap_pdb4 valgrind-reg_alloc3:0.10s no: 0.9s ( 8.6x, -----) me: 7.9s (79.4x, -----) ca: 7.2s (72.0x, -----) he: 7.9s (79.1x, -----) ca: 4.2s (41.9x, -----) dr: 4.9s (49.0x, -----) ma: 2.7s (27.2x, -----) heap_pdb4 valgrind-linux:0.10s no: 0.9s ( 8.8x, -2.3%) me: 7.8s (78.4x, 1.3%) ca: 7.2s (71.6x, 0.6%) he: 7.7s (77.3x, 2.3%) ca: 4.2s (41.8x, 0.2%) dr: 4.8s (48.3x, 1.4%) ma: 2.8s (28.3x, -4.0%) -- many-loss-records -- many-loss-records valgrind-reg_alloc3:0.01s no: 0.3s (32.0x, -----) me: 1.4s (141.0x, -----) ca: 1.1s (107.0x, -----) he: 1.2s (123.0x, -----) ca: 0.8s (75.0x, -----) dr: 1.1s (111.0x, -----) ma: 0.7s (70.0x, -----) many-loss-records valgrind-linux:0.01s no: 0.3s (29.0x, 9.4%) me: 1.4s (140.0x, 0.7%) ca: 1.1s (106.0x, 0.9%) he: 1.2s (119.0x, 3.3%) ca: 0.7s (72.0x, 4.0%) dr: 1.1s (110.0x, 0.9%) ma: 0.7s (70.0x, 0.0%) -- many-xpts -- many-xpts valgrind-reg_alloc3:0.04s no: 0.4s ( 9.5x, -----) me: 1.7s (41.8x, -----) ca: 2.6s (64.5x, -----) he: 2.1s (51.7x, -----) ca: 1.0s (26.0x, -----) dr: 1.4s (35.5x, -----) ma: 1.4s (36.2x, -----) many-xpts valgrind-linux:0.04s no: 0.4s (10.0x, -5.3%) me: 1.5s (38.0x, 9.0%) ca: 2.5s (63.7x, 1.2%) he: 2.0s (50.0x, 3.4%) ca: 1.0s (24.8x, 4.8%) dr: 1.4s (35.2x, 0.7%) ma: 1.5s (37.0x, -2.1%) -- memrw -- memrw valgrind-reg_alloc3:0.06s no: 0.5s ( 8.5x, -----) me: 1.3s (21.0x, -----) ca: 2.5s (41.0x, -----) he: 4.4s (72.8x, -----) ca: 2.0s (33.8x, -----) dr: 1.3s (21.5x, -----) ma: 0.6s ( 9.3x, -----) memrw valgrind-linux:0.06s no: 0.5s ( 8.2x, 3.9%) me: 1.2s (20.8x, 0.8%) ca: 2.5s (40.8x, 0.4%) he: 4.3s (71.0x, 2.5%) ca: 2.0s (33.0x, 2.5%) dr: 1.2s (20.3x, 5.4%) ma: 0.6s ( 9.2x, 1.8%) -- sarp -- sarp valgrind-reg_alloc3:0.02s no: 0.3s (16.5x, -----) me: 2.3s (116.5x, -----) ca: 1.8s (88.0x, -----) he: 5.6s (281.5x, -----) ca: 0.9s (46.0x, -----) dr: 1.0s (51.0x, -----) ma: 0.4s (18.0x, -----) sarp valgrind-linux:0.02s no: 0.3s (17.0x, -3.0%) me: 2.2s (111.0x, 4.7%) ca: 1.8s (88.0x, 0.0%) he: 5.4s (271.0x, 3.7%) ca: 0.9s (46.0x, 0.0%) dr: 1.0s (48.0x, 5.9%) ma: 0.4s (18.0x, 0.0%) -- tinycc -- tinycc valgrind-reg_alloc3:0.15s no: 1.2s ( 7.9x, -----) me: 9.4s (62.7x, -----) ca: 9.4s (62.5x, -----) he:11.1s (74.0x, -----) ca: 9.4s (62.9x, -----) dr: 9.1s (60.9x, -----) ma: 1.6s (10.5x, -----) tinycc valgrind-linux:0.15s no: 1.2s ( 8.0x, -0.8%) me: 8.6s (57.6x, 8.1%) ca: 9.6s (64.1x, -2.6%) he:10.7s (71.3x, 3.6%) ca: 9.1s (60.5x, 3.8%) dr: 8.8s (58.9x, 3.3%) ma: 1.6s (10.6x, -0.6%)
(In reply to Ivo Raisr from comment #13) > Patches 005 implement function find_vreg_to_spill(). That's excellent. With this in place, I can start and quit Firefox on Memcheck, which requires processing more than 700000 blocks. So here are some numbers with Memcheck, from --stats=yes. For a small program: v2 transtab: new 2,793 (62,574 -> 995,820; ratio 15.9) [0 scs] avg tce size 356 v3 transtab: new 2,793 (62,574 -> 1,285,601; ratio 20.5) [0 scs] avg tce size 460 For Firefox (main process), it's similar: v2 transtab: new 705,840 (18,604,302 -> 319,510,676; ratio 17.2) [73,827 scs] avg tce size 452 v3 transtab: new 722,290 (19,026,251 -> 436,975,315; ratio 23.0) [86,640 scs] avg tce size 604 So there's a significant space loss. At least part of it comes I think from a simple cause. Memcheck instrumented code contains a lot of helper function calls, and so the v2 allocator tries to use callee-save registers rather than caller-save registers whenever it can, so it doesn't have to save registers around the calls. The v3 allocator doesn't appear to have that optimisation. For example: v2: 1 movq $0x4A286E5,0xB8(%rbp) 2 movq 0x3D8(%rbp),%r12 3 movq 0x38(%rbp),%r13 4 leaq 0xFFFFFFC0(%r13),%r14 5 cmpq $0x0,%r12 6 callnz[0,RLPri_None] 0x5805B7C0 v3: 1 movq $0x4A286E5,0xB8(%rbp) 2 movq 0x3D8(%rbp),%rsi 3 movq 0x38(%rbp),%rdi 4 leaq 0xFFFFFFC0(%rdi),%r8 5 cmpq $0x0,%rsi 6 movq %rsi,0xAF0(%rbp) 7 movq %rdi,0xAE8(%rbp) 8 movq %r8,0xAE0(%rbp) 9 callnz[0,RLPri_None] 0x5805B7C0 The v2 allocator starts off using r12, r13, r14, so it doesn't have to save them across the call. But the v3 allocator starts off using rsi, rdi, r8 and so has to spill all of them across the call. I remember that doing this (preferring callee-saved to caller-saved regs) in the v2 allocator was very important to performance, but right now I can't figure out from the v2 code how this is implemented.
I think it's done (in a stupid way) using the has_hlrs hint boolean and the loop beginning at host_generic_reg_alloc2.c:1367 (after applying the patch). If the code contains any calls at all then the caller-save registers will be marked with .has_hlrs = True, and so the search loop tries to avoid these (by assigning them to k_suboptimal rather than k) if possible.
(In reply to Julian Seward from comment #15) > loop beginning at host_generic_reg_alloc2.c:1367 (after applying the patch). Sorry, line 1367 in the unpatched sources.
Thank you for pointers where to focus next. Indeed, that critical functionality is not implemented yet: line 873 of host_generic_reg_alloc3.c: /* Find a free rreg of the correct class. TODO-JIT: Find an rreg whose live range (if any) is as far ahead as possible. */
Created attachment 106492 [details] Use callee-saved registers in preference to caller-saved registers, if possible This small change applies on top of "106378: patch 005 - VEX part". On a simple test (/usr/bin/date on memcheck) it reduces the code expansion, relative to the v2 allocator, from 6.9% to 0.64%, in other words, removes almost all the generated-code performance loss from the new allocator.
Actual numbers: r2 2,773 (60,618 -> 979,296; ratio 16.2) [0 scs] avg tce size 353 r3-before 2,767 (60,481 -> 1,047,119; ratio 17.3) [0 scs] avg tce size 378 r3-after 2,767 (60,481 -> 985,570; ratio 16.3) [0 scs] avg tce size 356
For the actual cost of doing register allocation, for /usr/bin/date on Memcheck, I have the following profile results from Callgrind. These numbers are for doRegisterAllocation_{v2,v3} and all of their callees. v2 193.4 million insns v3 156.7 million insns The v3 allocator beats the v2 allocator on all Callgrind-measured numbers (memory accesses, misses, etc) except that it is about 10% worse in mispredicted conditional branches, for reasons I don't know yet.
I am currently on vacations so won't probably make any progress until the end of July...
Memcheck "--stats=yes" results running on Firefox under amd64/Linux (Ubuntu): v2: 716,447 (19,412,410 -> 331,710,658; ratio 17.1) [213,213 scs] avg tce size 462 v3 (patches 006): 685,657 (18,614,387 -> 318,876,497; ratio 17.1) [179,498 scs] avg tce size 465
Created attachment 106549 [details] patch 006 - Valgrind part
Created attachment 106550 [details] patch 006 - VEX part
Right now the only remaining TODOs from patches 006 are: /* TODO-JIT: To solve more efficiently questions such as "Find me a free register of the given class" it would help if registers in the RRegUniverse were grouped by their class. */ /* TODO-JIT: Contrary to what the algorithm overview above says, we simplify it for the time being and simply spill the vreg. Regardless if it is used by the instruction or not. This is very suboptimal but it can be refined later.*/ /* --- TODO-JIT: Direct reload --- */
Created attachment 106595 [details] patch 007 - Valgrind part
Created attachment 106596 [details] patch 007 - VEX part
The attached patches 007 implement groups of real registers in RRegUniverse. They help to efficiently answer questions such as "Find me a free register of the given class".
I tried to do performance comparison of reg_alloc v3 implementation as it progresses in terms of patches. However the performance tests are not reliable - I could not get consistent results out of them on my amd64/Linux Ubuntu. Variations of up to +- 2% were observed on the same code in different runs. No other CPU intensive process was running in parallel. Patches description: 005: the initial register allocator v3 implementation which was able to run the complete regression test suite 005-J: 005 + Julian's enhancement to prefer callee saved registers 006: 005 + ranking of real registers based on their live ranges 007: 006 + groups of real registers in RRegUniverse [vanilla]: baseline Valgrind Therefore the only conclusions I can draw from these are: - patches 005 () produce worse behaviour than existing v3; - patches 005-J, 006 and 007 perform roughly equivalent or slightly better than existing register allocator v2 - it cannot be deduced on this level if there any performance improvements among 005-J, 006, and 007 (there should be!) Here are the results. Only speedups are printed with a new command line option "--terse". $ perl ./perf/vg_perf --tools=none,memcheck,callgrind,helgrind,cachegrind,drd,massif \ --terse --reps=10 --vg=./vanilla --vg=./ra3-005 --vg=./ra3-005-J --vg=./ra3-006 \ --vg=./ra3-007 vanilla/perf -- Running tests in vanilla/perf -------------------------------------- -- bigcode1 -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -4.7% ca: 0.2% he: -1.1% ca: -1.5% dr: -0.6% ma: 0.0% ra3-005-J : no: 0.0% me: 0.4% ca: 0.3% he: -0.5% ca: -0.9% dr: 0.0% ma: 0.6% ra3-006 : no: 0.0% me: 0.4% ca: 0.2% he: -0.0% ca: -1.2% dr: 0.0% ma: 0.0% ra3-007 : no: 0.0% me: 0.7% ca: 0.5% he: -0.5% ca: -0.6% dr: 0.0% ma: 1.2% -- bigcode2 -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: -0.7% me: -2.1% ca: -0.0% he: -0.5% ca: -1.3% dr: -1.1% ma: 0.9% ra3-005-J : no: -0.3% me: 1.0% ca: 0.1% he: -1.6% ca: -0.2% dr: -0.3% ma: -0.3% ra3-006 : no: -0.7% me: 0.8% ca: -0.2% he: -0.5% ca: -0.2% dr: -0.5% ma: 0.3% ra3-007 : no: -0.3% me: 0.2% ca: 0.1% he: -0.5% ca: 0.0% dr: 0.0% ma: 0.9% -- bz2 -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.5% me:-13.2% ca: 0.0% he: -6.0% ca: -4.3% dr: -5.4% ma: 1.1% ra3-005-J : no: 1.1% me: -0.2% ca: 0.4% he: 0.2% ca: -2.0% dr: -0.2% ma: 2.1% ra3-006 : no: 0.5% me: 0.0% ca: 0.2% he: 0.5% ca: -0.8% dr: 0.5% ma: 1.6% ra3-007 : no: 0.5% me: 0.0% ca: 0.2% he: 0.3% ca: -1.0% dr: 0.6% ma: 1.1% -- fbench -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -4.0% ca: -0.4% he: -1.9% ca: -1.6% dr: -3.5% ma: 0.9% ra3-005-J : no: 0.9% me: 0.0% ca: -1.2% he: 1.1% ca: -0.3% dr: -0.8% ma: 0.9% ra3-006 : no: 0.0% me: 0.3% ca: -0.2% he: 1.1% ca: 0.3% dr: 0.0% ma: 1.8% ra3-007 : no: 0.0% me: -0.3% ca: -1.0% he: 1.1% ca: -0.3% dr: -1.9% ma: 0.9% -- ffbench -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: -0.9% me: -2.1% ca: 0.5% he: -3.2% ca: -5.0% dr: -2.2% ma: 0.0% ra3-005-J : no: 0.0% me: -0.6% ca: 0.5% he: 0.2% ca: 0.3% dr: -0.9% ma: 1.9% ra3-006 : no: -0.9% me: 0.0% ca: 0.0% he: -0.4% ca: 0.8% dr: 0.0% ma: 0.0% ra3-007 : no: -0.9% me: 0.3% ca: 0.0% he: 0.4% ca: 0.3% dr: -0.6% ma: 0.0% -- heap -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -2.2% ca: 0.9% he: -1.9% ca: -2.2% dr: -2.4% ma: -0.8% ra3-005-J : no: -2.6% me: -0.2% ca: 1.4% he: -0.6% ca: 0.0% dr: -0.5% ma: -1.2% ra3-006 : no: -2.6% me: -0.6% ca: 0.9% he: -0.6% ca: 0.0% dr: 0.0% ma: 0.4% ra3-007 : no: -1.3% me: -0.6% ca: 0.2% he: 1.0% ca: 0.0% dr: -2.9% ma: 2.9% -- heap_pdb4 -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 1.2% me: -2.4% ca: 0.6% he: -1.5% ca: -3.0% dr: -3.5% ma: -0.4% ra3-005-J : no: 1.2% me: -0.7% ca: 0.0% he: -2.1% ca: 0.0% dr: -0.2% ma: -0.8% ra3-006 : no: 1.2% me: -0.5% ca: 0.0% he: -2.5% ca: -0.5% dr: -0.2% ma: 0.0% ra3-007 : no: 1.2% me: -0.4% ca: 0.3% he: -0.4% ca: 0.0% dr: -3.3% ma: 3.1% -- many-loss-records -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: -3.6% me: -1.6% ca: -1.0% he: -0.9% ca: -1.4% dr: -1.9% ma: 0.0% ra3-005-J : no: -7.1% me: -0.8% ca: -1.0% he: 1.8% ca: 0.0% dr: -1.0% ma: 1.5% ra3-006 : no: -3.6% me: 0.8% ca: -1.0% he: 0.0% ca: 0.0% dr: -1.0% ma: 0.0% ra3-007 : no: -7.1% me: 0.8% ca: -1.0% he: -0.9% ca: 0.0% dr: -1.0% ma: -1.5% -- many-xpts -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -4.7% ca: 0.4% he: -3.6% ca: -4.2% dr: -0.7% ma: 0.0% ra3-005-J : no: 0.0% me: 0.0% ca: 0.0% he: -0.5% ca: 0.0% dr: 1.5% ma: 0.7% ra3-006 : no: 0.0% me: 0.7% ca: 0.4% he: 0.5% ca: 1.0% dr: 2.2% ma: 0.0% ra3-007 : no: -2.6% me: 0.0% ca: 0.0% he: -0.5% ca: 2.1% dr: 2.9% ma: 0.7% -- memrw -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -7.6% ca: 0.8% he: -1.2% ca: -1.6% dr: -5.0% ma: -1.9% ra3-005-J : no: 0.0% me: 1.7% ca: 0.8% he: 0.5% ca: 0.0% dr: -1.7% ma: -3.8% ra3-006 : no: 0.0% me: 3.4% ca: 1.3% he: 0.0% ca: 1.0% dr: 0.0% ma: -1.9% ra3-007 : no: 0.0% me: 0.8% ca: 0.8% he: 0.2% ca: 0.5% dr: -2.5% ma: 0.0% -- sarp -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: 0.0% me: -2.8% ca: -0.6% he: -0.9% ca: 0.0% dr: -6.4% ma: 0.0% ra3-005-J : no: -3.0% me: -0.5% ca: -1.1% he: 0.4% ca: -1.1% dr: 0.0% ma: 0.0% ra3-006 : no: 3.0% me: -0.5% ca: 0.0% he: 0.6% ca: 2.2% dr: 1.1% ma: -2.9% ra3-007 : no: 0.0% me: 6.4% ca: 2.3% he: 1.3% ca: 1.1% dr: 0.0% ma: -2.9% -- tinycc -- vanilla : no: ----- me: ----- ca: ----- he: ----- ca: ----- dr: ----- ma: ----- ra3-005 : no: -0.9% me: -4.1% ca: 0.3% he: -5.3% ca: -2.1% dr: -5.6% ma: 0.0% ra3-005-J : no: 0.9% me: 0.7% ca: 0.5% he: -0.3% ca: -0.8% dr: -0.2% ma: 0.7% ra3-006 : no: 0.9% me: 0.8% ca: 0.6% he: -0.4% ca: -1.0% dr: -0.5% ma: 0.0% ra3-007 : no: -0.9% me: 0.6% ca: 0.5% he: -0.2% ca: -0.3% dr: 0.2% ma: 0.7% -- Finished tests in vanilla/perf -------------------------------------- == 12 programs, 420 timings =================
Created attachment 106687 [details] patch 008 - Valgrind part
Created attachment 106688 [details] patch 008 - VEX part
Patches 008 implement direct reload optimization. Now the only remaining TODO is the following one: /* TODO-JIT: Contrary to what the algorithm overview above says, we simplify it for the time being and simply spill the vreg. Regardless if it is used by the instruction or not. This is very suboptimal but it can be refined later.*/ I will address it in the coming days.
Created attachment 106801 [details] patch 009 - Valgrind part
Created attachment 106802 [details] patch 009 - VEX part
The final set of patches implementing new VEX register allocator version 3 for AMD64 and X86 architectures. All regression tests pass successfully, Firefox runs ok.
Some performance comparison... Number of instructions executed either in function doRegisterAllocation() [Valgrind baseline, v2] or doRegisterAllocation_v3 [new VEX register allocator, v3]: --- memcheck on /bin/date --- v2: 123,878,741 v3: 93,544,940 --- memcheck on perf/bigcode --- v2: 871,980,968 v3: 697,984,555 These numbers are obtained from running outer callgrind on inner memcheck as per README_DEVELOPERS with "--toggle-collect=doRegisterAllocation --toggle-collect=doRegisterAllocation_v3". Then running callgrind_annotate on the callgrind output files. Also the produced code from v3 is comparable or only marginally worse than from v2. ./vg-in-place --stats=yes perf/bigcode v2: transtab: new 30,033 (318,068 -> 9,133,920; ratio 28.7) v3: transtab: new 30,028 (317,855 -> 9,133,573; ratio 28.7) ./vg-in-place --stats=yes date v2: transtab: new 2,850 (61,825 -> 1,004,273; ratio 16.2) v3: transtab: new 2,845 (61,612 -> 1,006,286; ratio 16.3)
Created attachment 106876 [details] patch 010 - Valgrind part
Created attachment 106877 [details] patch 010 - VEX part
Patches 010 implement the necessary changes for supporting the register allocator v3 to all architectures supported by VEX.
regtested on SLES12 on s390x. regression test suite sees no change. make perf seems to have small improvements.
With the changes (patch 010 Valgrind/VEX) applied, Memcheck is crashing on MIPS32/MIPS64. $ ./vg-in-place ls ==27412== Memcheck, a memory error detector ==27412== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==27412== Using Valgrind-3.14.0.SVN and LibVEX; rerun with -h for copyright info ==27412== Command: ls ==27412== vex: priv/host_generic_reg_alloc3.c:374 (find_free_rreg): Assertion `0' failed. vex storage: T total 2656776 bytes allocated vex storage: P total 0 bytes allocated valgrind: the 'impossible' happened: LibVEX called failure_exit(). host stacktrace: ==27412== at 0x580829C0: show_sched_status_wrk (m_libcassert.c:355) ==27412== by 0x58082B70: report_and_quit (m_libcassert.c:426) ==27412== by 0x58082DD4: panic (m_libcassert.c:502) ==27412== by 0x58082DD4: vgPlain_core_panic_at (m_libcassert.c:507) ==27412== by 0x58082E18: vgPlain_core_panic (m_libcassert.c:512) ==27412== by 0x580A5268: failure_exit (m_translate.c:740) ==27412== by 0x5817C720: vex_assert_fail (main_util.c:247) ==27412== by 0x580378D0: find_free_rreg.isra.2.constprop.8 (host_generic_reg_alloc3.c:374) ==27412== by 0x581BF13C: doRegisterAllocation (host_generic_reg_alloc3.c:917) ==27412== by 0x5817B0A4: libvex_BackEnd (main_main.c:1086) ==27412== by 0x5817B0A4: LibVEX_Translate (main_main.c:1185) ==27412== by 0x580A7D04: vgPlain_translate (m_translate.c:1794) ==27412== by 0x580EB9F4: handle_chain_me (scheduler.c:1084) ==27412== by 0x580EDBA4: vgPlain_scheduler (scheduler.c:1428) ==27412== by 0x581026F4: thread_wrapper (syswrap-linux.c:103) ==27412== by 0x581026F4: run_a_thread_NORETURN (syswrap-linux.c:156) sched status: running_tid=1 Thread 1: status = VgTs_Runnable (lwpid 27412) ==27412== at 0x400617C: _dl_start_final (rtld.c:281) ==27412== by 0x4006528: _dl_start (rtld.c:415) ==27412== by 0x40023CC: __start (in /lib/mips64el-linux-gnuabi64/ld-2.24.so) Note: see also the FAQ in the source distribution. It contains workarounds to several common problems. In particular, if Valgrind aborted or crashed after identifying problems in your program, there's a good chance that fixing those problems will prevent Valgrind aborting or crashing, especially if it happened in m_mallocfree.c. If that doesn't help, please report this bug to: www.valgrind.org In the bug report, send all the above text, the valgrind version, and what OS and version you are using. Thanks.
Created attachment 107004 [details] patch 010 - VEX part + some debugging for MIPS crashes This patch adds some debugging statements in find_free_rreg() where MIPS arch was hitting assertion at priv/host_generic_reg_alloc3.c:374.
(In reply to Petar Jovanovic from comment #41) > With the changes (patch 010 Valgrind/VEX) applied, Memcheck is crashing on > MIPS32/MIPS64. Petar, thank you for testing the patches. Please could you test again with the latest patch called "patch 010 - VEX part + some debugging for MIPS crashes". It should print a bunch of debugging statements and then I hope to see what went wrong there.
010 works OK on arm64-linux. On arm(32)-linux, it fails at startup verifying sanity of the register set, due to a small error in the definition of getRRegUniverse_ARM: the first allocable_end[] is indexed at HRcInt64 rather than HRcInt32. After fixing that it works fine.
Created attachment 107008 [details] patch 011 - Valgrind part
Created attachment 107009 [details] patch 011 - VEX part
Patches 011 (re-)introduce command line option --vex-regalloc-version=2|3 which allows to switch between register allocator implementations. Note that version 3 (new implementation) is now the default. They also fix license text in host_generic_reg_alloc3.c. Debugging output around host_generic_reg_alloc3.c:374 is still present to ease root causing assertion hit on MIPS.
Created attachment 107010 [details] log of MIPS execution
(In reply to Ivo Raisr from comment #43) > Petar, thank you for testing the patches. > Please could you test again with the latest patch called > "patch 010 - VEX part + some debugging for MIPS crashes". > It should print a bunch of debugging statements and then I hope to see what > went wrong there. Check the log in the comment #48.
Petar, Thank you for the debugging output. Unfortunately it is not sufficient to infer what went wrong. Please could you set to 1 the following #define's at the beginning of host_generic_reg_alloc3.c: #define DEBUG_REGALLOC 1 #define SANITY_CHECKS_EVERY_INSTR 1 recompile and attach again the output (will be large). Thank you!
Created attachment 107011 [details] log of MIPS execution (large) Here is the log, compressed.
Petar, thank you for the detailed output. Now the root cause is clear and I need to fix slightly the algorithm in find_free_rreg(). I'll prepare a new set of patches...
Created attachment 107036 [details] patch 012 - Valgrind part
Created attachment 107037 [details] patch 012 - VEX part
Patches 012 should fix the assertion hit on MIPS platform in function find_free_rreg(). Petar, please could you test please?
Situation is better, yet not ideal. Majority of the programs pass, but some crash. E.g. ~/radni/valgrind/trunk$ ./vg-in-place -q --track-origins=yes --vex-regalloc-version=3 ./memcheck/tests/atomic_incs vex: priv/host_generic_reg_alloc3.c:250 (spill_vreg): Assertion `vreg_state[v_idx].dead_before > (Short) current_ii' failed. vex storage: T total 55230600 bytes allocated vex storage: P total 0 bytes allocated valgrind: the 'impossible' happened: LibVEX called failure_exit(). host stacktrace: ==19301== at 0x58082A60: show_sched_status_wrk (m_libcassert.c:355) ==19301== by 0x58082C10: report_and_quit (m_libcassert.c:426) ==19301== by 0x58082E74: panic (m_libcassert.c:502) ==19301== by 0x58082E74: vgPlain_core_panic_at (m_libcassert.c:507) ==19301== by 0x58082EB8: vgPlain_core_panic (m_libcassert.c:512) ==19301== by 0x580A53B8: failure_exit (m_translate.c:740) ==19301== by 0x5817C9C0: vex_assert_fail (main_util.c:247) ==19301== by 0x58037790: spill_vreg (host_generic_reg_alloc3.c:250) ==19301== by 0x581C1ACC: doRegisterAllocation_v3 (host_generic_reg_alloc3.c:1048) ==19301== by 0x5817B310: libvex_BackEnd (main_main.c:1095) ==19301== by 0x5817B310: LibVEX_Translate (main_main.c:1198) ==19301== by 0x580A7E54: vgPlain_translate (m_translate.c:1794) ==19301== by 0x580EBBF4: handle_chain_me (scheduler.c:1084) ==19301== by 0x580EDDA4: vgPlain_scheduler (scheduler.c:1428) ==19301== by 0x581028F4: thread_wrapper (syswrap-linux.c:103) ==19301== by 0x581028F4: run_a_thread_NORETURN (syswrap-linux.c:156) sched status: running_tid=1 Thread 1: status = VgTs_Runnable (lwpid 19301) ==19301== at 0x4017228: _dl_determine_tlsoffset (dl-tls.c:239) ==19301== by 0x4002268: init_tls (rtld.c:615) ==19301== by 0x400492C: dl_main (rtld.c:1685) ==19301== by 0x401C240: _dl_sysdep_start (dl-sysdep.c:249) ==19301== by 0x4006220: _dl_start_final (rtld.c:307) ==19301== by 0x4006528: _dl_start (rtld.c:415) ==19301== by 0x40023CC: __start (in /lib/mips64el-linux-gnuabi64/ld-2.24.so)
Petar, thank you again for testing patches 012! Please could you set to 1 the following #define's at the beginning of host_generic_reg_alloc3.c: #define DEBUG_REGALLOC 1 #define SANITY_CHECKS_EVERY_INSTR 1 recompile and attach again the output (will be large).
Ivo, here is a new log. https://transfer.pcloud.com/download.html?code=5ZbBWcZMexCPcqqmR4ZFfeRZT410zHsBPVS9h74eABPRb4ByVQIX I could not attach it as it is too big, even after compression.
Thank you, Petar. I was able to download the file and decipher it. The root cause is now known and I have to find an elegant way how to fix it.
Created attachment 107068 [details] patch 013 - Valgrind part
Created attachment 107069 [details] patch 013 - VEX part
Patches 013 should fix the assertion hit on MIPS platform when doing MOV coalescing and the destination vreg is becoming dead. Petar, please could you test please? Also I think that this scenario happened because MIPS isel backend generated dead code. You can discover more if you run the program with patches 012 and options such as --trace-flags=11111110 --trace-notbelow=0 and let the program crash as before. Then please attach debug output of only the last SB.
(In reply to Ivo Raisr from comment #62) > Petar, please could you test please? > The latest patch (lucky 13) seems to work fine on one MIPS variant I have tested. Thanks for the extra work. I will test other MIPS variants early next week. > Also I think that this scenario happened because MIPS isel backend generated > dead code. You can discover more if you run the program with patches 012 and > options such as --trace-flags=11111110 --trace-notbelow=0 and let the > program crash as before. Then please attach debug output of only the last SB. Thanks. Will look into it.
Compile and regression suite tested on AMD64/Darwin (macOS 10.11, clang-800.0.42.1). No new regressions seen with v13 of the patches. Looks good for the macOS platform.
(In reply to Petar Jovanovic from comment #63) > The latest patch (lucky 13) seems to work fine on one MIPS variant I > have tested. Thanks for the extra work. > I will test other MIPS variants early next week. Petar, thank you for the testing so far. Have you had a chance to test other MIPS variants?
Pending no other problems arise, I would like to land new VEX register allocator v3 after SVN->GIT migration is completed and the dust settles. That is in 2-3 weeks from now. The new version v3 will be the default; it will be possible to switch to the old version v2 with command line option: --vex-regalloc-version=2 Please shout now if you would like me to address your concerns about regalloc v3.
(In reply to Ivo Raisr from comment #65) > (In reply to Petar Jovanovic from comment #63) > > > The latest patch (lucky 13) seems to work fine on one MIPS variant I > > have tested. Thanks for the extra work. > > I will test other MIPS variants early next week. > > Petar, thank you for the testing so far. Have you had a chance to test other > MIPS variants? Yes, I can confirm now that v13 has been tested on other MIPS variants and it is good to go. (In reply to Ivo Raisr from comment #62) > Also I think that this scenario happened because MIPS isel backend generated > dead code. You can discover more if you run the program with patches 012 and > options such as --trace-flags=11111110 --trace-notbelow=0 and let the > program crash as before. Then please attach debug output of only the last SB. This seems to be related to the way how MIPS port uses and handles DivModU128to64 and some other Iops. The issue comes from tricks done due to the lack of appropriate Iops that I am considering to add now. These would be: Iop_DivModU64to64, // :: I64,I64 -> I128 // of which lo half is div and hi half is mod Iop_DivModS32to32, // :: I32,I32 -> I64 // of which lo half is div and hi half is mod Iop_DivModU32to32, // :: I32,I32 -> I64 // of which lo half is div and hi half is mod Does anyone object or is there a need to create a separate issue for it?
(In reply to Petar Jovanovic from comment #67) > This seems to be related to the way how MIPS port uses and handles > DivModU128to64 and some other Iops. > The issue comes from tricks done due to the lack of appropriate Iops that I > am considering to add now. > These would be: > > Iop_DivModU64to64, // :: I64,I64 -> I128 > // of which lo half is div and hi half is mod > Iop_DivModS32to32, // :: I32,I32 -> I64 > // of which lo half is div and hi half is mod > Iop_DivModU32to32, // :: I32,I32 -> I64 > // of which lo half is div and hi half is mod > > Does anyone object or is there a need to create a separate issue for it? If I understand your comment correctly, you are saying that MIPS front end misuses DivModU128to64 for something which is not intended. Then please go ahead and fix that but please do not associate it in any way with this bug as these are independent.
(In reply to Ivo Raisr from comment #68) > If I understand your comment correctly, you are saying that MIPS front end > misuses DivModU128to64 for something which is not intended. Yes, instead of using (nonexistent) Iop_DivModU64to64, it uses DivModU128to64 but the back'end implements simplified code it as it was Iop_DivModU64to64. It will be fixed. > Then please go ahead and fix that but please do not associate it in any way > with this bug > as these are independent. Don't worry. :) I just put explanation here of the dead code issue here.
Created attachment 107524 [details] patch 014 (Valgrind + VEX) Patch 014 combines Valgrind + VEX parts together as this fits with the new git SCM. Changes compared to 013: - trailing whitespace removed - some fields removed from RRegState - and moved to newly introduced structure RRegLRState - this decoupling helps to differentiate RRegState and RRegLRState concepts - and also helps when introducing If-Then-Else support later
I am going to integrate patch 014 next week unless I hear otherwise.
Pushed as changeset efa1e5ef8d257e3b20facf6f04350d29578ae9e4.
https://sourceware.org/git/?p=valgrind.git;a=commit;h=efa1e5ef8d257e3b20facf6f04350d29578ae9e4
Ivo, just to say, thank you for doing all this reg-alloc work. I know it was a lot of work. The results are good!
And thank you, Julian, for your guidance. Unfortunately I won't be able to finish the VEX enhancement/rework residing in (I think) vex-jit-hacks-2 branch. Perhaps some brave soul will be able to continue...