Bug 381553 - VEX register allocator v3
Summary: VEX register allocator v3
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.13 SVN
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Ivo Raisr
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-23 00:23 UTC by Ivo Raisr
Modified: 2018-07-24 12:58 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
patch 002 - Valgrind part (1.00 KB, patch)
2017-06-23 00:24 UTC, Ivo Raisr
Details
patch 002 - VEX part (54.24 KB, patch)
2017-06-23 00:26 UTC, Ivo Raisr
Details
patch 003 - Valgrind part (1.39 KB, patch)
2017-06-23 11:22 UTC, Ivo Raisr
Details
patch 003 - VEX part (56.55 KB, patch)
2017-06-23 11:23 UTC, Ivo Raisr
Details
patch 004 - Valgrind part (1.39 KB, patch)
2017-06-26 23:56 UTC, Ivo Raisr
Details
patch 004 - VEX part (68.91 KB, patch)
2017-06-26 23:56 UTC, Ivo Raisr
Details
patch 005 - Valgrind part (1.39 KB, patch)
2017-06-29 21:24 UTC, Ivo Raisr
Details
patch 005 - VEX part (70.27 KB, patch)
2017-06-29 21:24 UTC, Ivo Raisr
Details
Use callee-saved registers in preference to caller-saved registers, if possible (3.32 KB, patch)
2017-07-07 13:01 UTC, Julian Seward
Details
patch 006 - Valgrind part (1.39 KB, patch)
2017-07-10 21:09 UTC, Ivo Raisr
Details
patch 006 - VEX part (71.87 KB, patch)
2017-07-10 21:10 UTC, Ivo Raisr
Details
patch 007 - Valgrind part (1.89 KB, patch)
2017-07-13 22:04 UTC, Ivo Raisr
Details
patch 007 - VEX part (80.12 KB, patch)
2017-07-13 22:04 UTC, Ivo Raisr
Details
patch 008 - Valgrind part (1.89 KB, patch)
2017-07-17 21:19 UTC, Ivo Raisr
Details
patch 008 - VEX part (82.93 KB, patch)
2017-07-17 21:20 UTC, Ivo Raisr
Details
patch 009 - Valgrind part (1.94 KB, patch)
2017-07-23 17:23 UTC, Ivo Raisr
Details
patch 009 - VEX part (88.94 KB, patch)
2017-07-23 17:23 UTC, Ivo Raisr
Details
patch 010 - Valgrind part (962 bytes, patch)
2017-07-27 00:51 UTC, Ivo Raisr
Details
patch 010 - VEX part (153.74 KB, patch)
2017-07-27 00:51 UTC, Ivo Raisr
Details
patch 010 - VEX part + some debugging for MIPS crashes (154.27 KB, patch)
2017-08-01 07:39 UTC, Ivo Raisr
Details
patch 011 - Valgrind part (2.48 KB, patch)
2017-08-01 13:14 UTC, Ivo Raisr
Details
patch 011 - VEX part (113.59 KB, patch)
2017-08-01 13:15 UTC, Ivo Raisr
Details
log of MIPS execution (10.59 KB, text/plain)
2017-08-01 13:27 UTC, Petar Jovanovic
Details
log of MIPS execution (large) (262.97 KB, application/x-bzip)
2017-08-01 15:00 UTC, Petar Jovanovic
Details
patch 012 - Valgrind part (2.48 KB, patch)
2017-08-02 10:45 UTC, Ivo Raisr
Details
patch 012 - VEX part (113.41 KB, patch)
2017-08-02 10:46 UTC, Ivo Raisr
Details
patch 013 - Valgrind part (2.48 KB, patch)
2017-08-04 10:52 UTC, Ivo Raisr
Details
patch 013 - VEX part (114.10 KB, patch)
2017-08-04 10:52 UTC, Ivo Raisr
Details
patch 014 (Valgrind + VEX) (125.03 KB, patch)
2017-08-25 22:30 UTC, Ivo Raisr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivo Raisr 2017-06-23 00:23:22 UTC
Julian Seward designed new VEX register allocator last year.
And here is its initial implementation.
On amd64/Linux, it manages to allocate 385 superblocks while running "/bin/true" until it bombs out for some reason.
There are still a few missing features to implement.

However it is believed that compared to the previous register allocator v2 it is:
- simpler
- faster
- perhaps producing slightly worse spilling decisions
Comment 1 Ivo Raisr 2017-06-23 00:24:11 UTC
Created attachment 106254 [details]
patch 002 - Valgrind part
Comment 2 Ivo Raisr 2017-06-23 00:26:06 UTC
Created attachment 106255 [details]
patch 002 - VEX part
Comment 3 Ivo Raisr 2017-06-23 00:26:58 UTC
New VEX register allocator is activated with command line option --vex-regalloc-version=3, for example:

$ ./vg-in-place --tool=none --vex-regalloc-version=3 --trace-flags=00000110 \
  --trace-notbelow=0 /bin/true
Comment 4 Ivo Raisr 2017-06-23 11:22:50 UTC
Created attachment 106259 [details]
patch 003 - Valgrind part
Comment 5 Ivo Raisr 2017-06-23 11:23:10 UTC
Created attachment 106260 [details]
patch 003 - VEX part
Comment 6 Ivo Raisr 2017-06-23 11:31:11 UTC
It turns out the current version has a design flaw because it does not track rreg live ranges. This manifests for example in this vcode:

 82   movq $0x14,%rdi
 83   movq %vR12,%rsi
 84   movq $0x0,%rdx
 85   movq 0xA8(%rbp),%rcx
 86   call[4,RLPri_Int] 0x58147900
 87   movq %rax,%vR191
 88   movq %vR189,%rdi
 89   movq %vR28,%rsi
 90   movq %vR190,%rdx
 91   movq %vR191,%rcx
 92   call[4,RLPri_Int] 0x58147900

Imagine that register allocator processes instruction #89 and that %vReg28
is currently spilled. So it will:
1. allocate a "free" register - %rdi
2. generate reload into %rdi: movq (spill_slot), %rdi
3. register allocate "movq %vR28,%rsi" -> "movq %rdi,%rsi"


I think it is needed to track rreg live ranges.
I will try something simpler than does reg_alloc v2; rreg live ranges will be tracked individually per rreg.
Comment 7 Julian Seward 2017-06-26 09:38:01 UTC
Excellent to see this coming to life!  When it can run hello-world I will
give it a try.

If you have cases where incorrect register allocation is causing subtle bugs
(eg, program runs but produces different results than with the r2 allocator)
then one old trick is to have a command line argument that allows to switch
from r2 to r3 after some number of blocks processed (or the other direction),
and find a deterministic case that fails.  Then binary search for the failing
block.  (Sorry if this is obvious.)
Comment 8 Ivo Raisr 2017-06-26 23:56:10 UTC
Created attachment 106323 [details]
patch 004 - Valgrind part
Comment 9 Ivo Raisr 2017-06-26 23:56:34 UTC
Created attachment 106324 [details]
patch 004 - VEX part
Comment 10 Ivo Raisr 2017-06-26 23:57:48 UTC
The attached patches for register allocator v3 are able to drive simple programs (such as /bin/true and /bin/date) successfully through.

There is still some functionality to be implemented, though.
Comment 11 Ivo Raisr 2017-06-29 21:24:33 UTC
Created attachment 106377 [details]
patch 005 - Valgrind part
Comment 12 Ivo Raisr 2017-06-29 21:24:54 UTC
Created attachment 106378 [details]
patch 005 - VEX part
Comment 13 Ivo Raisr 2017-06-29 21:55:23 UTC
Patches 005 implement function find_vreg_to_spill().
With these patches, the register allocator v3 is able to produce correct results for all tests in the Valgrind regression test suite.
Therefore the implementation is deemed correct.

As regards performance, v3 is slightly better (speedup reported up to 13.7%) then v2. However there are still a few TODOs in v3 which should help boost the performance even more. So I hope to have at the end faster and simpler reg_alloc v3 than is the current v2.

Here are the numbers (% is the speedup of v3 compared to v2):
$ perl perf/vg_perf --tools=none,memcheck,callgrind,helgrind,cachegrind,drd,massif --reps=3 --vg=../valgrind-reg_alloc3 --vg=../valgrind-linux perf
-- Running  tests in perf ----------------------------------------------
bigcode1 valgrind-reg_alloc3:0.09s  no: 1.6s (18.0x, -----)  me: 3.1s (35.0x, -----)  ca:12.7s (140.8x, -----)  he: 2.0s (22.8x, -----)  ca: 3.7s (41.2x, -----)  dr: 2.0s (22.2x, -----)  ma: 1.9s (20.7x, -----)
bigcode1 valgrind-linux:0.09s  no: 1.6s (17.8x,  1.2%)  me: 3.0s (33.3x,  4.8%)  ca:12.7s (140.7x,  0.1%)  he: 2.0s (22.2x,  2.4%)  ca: 3.6s (40.2x,  2.4%)  dr: 1.9s (21.7x,  2.5%)  ma: 1.9s (20.7x,  0.0%)
-- bigcode2 --
bigcode2 valgrind-reg_alloc3:0.08s  no: 3.3s (41.1x, -----)  me: 6.8s (85.5x, -----)  ca:26.5s (330.9x, -----)  he: 4.4s (54.9x, -----)  ca: 6.5s (81.2x, -----)  dr: 4.1s (51.6x, -----)  ma: 3.9s (48.9x, -----)
bigcode2 valgrind-linux:0.08s  no: 3.4s (42.2x, -2.7%)  me: 6.8s (85.1x,  0.4%)  ca:25.6s (320.2x,  3.2%)  he: 4.3s (53.9x,  1.8%)  ca: 6.1s (76.5x,  5.8%)  dr: 4.0s (49.4x,  4.4%)  ma: 3.8s (48.0x,  1.8%)
-- bz2 --
bz2      valgrind-reg_alloc3:0.54s  no: 2.1s ( 3.9x, -----)  me: 7.4s (13.6x, -----)  ca:14.9s (27.5x, -----)  he:10.9s (20.2x, -----)  ca:12.5s (23.1x, -----)  dr:11.1s (20.6x, -----)  ma: 2.0s ( 3.8x, -----)
bz2      valgrind-linux:0.54s  no: 2.0s ( 3.7x,  4.8%)  me: 6.3s (11.8x, 13.7%)  ca:14.7s (27.1x,  1.3%)  he:10.1s (18.6x,  7.8%)  ca:11.8s (21.9x,  5.3%)  dr:10.7s (19.8x,  4.0%)  ma: 2.0s ( 3.8x, -0.5%)
-- fbench --
fbench   valgrind-reg_alloc3:0.23s  no: 1.1s ( 4.7x, -----)  me: 3.7s (16.1x, -----)  ca: 5.1s (22.3x, -----)  he: 2.7s (11.9x, -----)  ca: 3.4s (14.8x, -----)  dr: 2.7s (11.6x, -----)  ma: 1.1s ( 4.8x, -----)
fbench   valgrind-linux:0.23s  no: 1.1s ( 4.7x,  0.9%)  me: 3.5s (15.3x,  4.6%)  ca: 5.2s (22.5x, -0.8%)  he: 2.6s (11.5x,  2.9%)  ca: 3.1s (13.6x,  7.9%)  dr: 2.6s (11.3x,  2.2%)  ma: 1.1s ( 4.9x, -0.9%)
-- ffbench --
ffbench  valgrind-reg_alloc3:0.25s  no: 1.1s ( 4.5x, -----)  me: 3.4s (13.7x, -----)  ca: 2.0s ( 7.9x, -----)  he: 5.9s (23.4x, -----)  ca: 4.4s (17.5x, -----)  dr: 3.4s (13.4x, -----)  ma: 1.1s ( 4.3x, -----)
ffbench  valgrind-linux:0.25s  no: 1.1s ( 4.5x,  0.0%)  me: 3.5s (13.8x, -0.9%)  ca: 2.0s ( 7.8x,  0.5%)  he: 6.0s (23.8x, -1.7%)  ca: 4.2s (16.9x,  3.2%)  dr: 3.4s (13.4x,  0.0%)  ma: 1.1s ( 4.3x,  0.0%)
-- heap --
heap     valgrind-reg_alloc3:0.08s  no: 0.8s (10.1x, -----)  me: 5.2s (64.9x, -----)  ca: 6.6s (83.0x, -----)  he: 7.1s (89.0x, -----)  ca: 3.9s (49.1x, -----)  dr: 4.4s (54.8x, -----)  ma: 2.5s (31.9x, -----)
heap     valgrind-linux:0.08s  no: 0.8s (10.2x, -1.2%)  me: 5.1s (63.2x,  2.5%)  ca: 6.6s (83.0x,  0.0%)  he: 7.0s (86.9x,  2.4%)  ca: 3.8s (47.2x,  3.8%)  dr: 4.3s (53.2x,  2.7%)  ma: 2.5s (31.4x,  1.6%)
-- heap_pdb4 --
heap_pdb4 valgrind-reg_alloc3:0.10s  no: 0.9s ( 8.6x, -----)  me: 7.9s (79.4x, -----)  ca: 7.2s (72.0x, -----)  he: 7.9s (79.1x, -----)  ca: 4.2s (41.9x, -----)  dr: 4.9s (49.0x, -----)  ma: 2.7s (27.2x, -----)
heap_pdb4 valgrind-linux:0.10s  no: 0.9s ( 8.8x, -2.3%)  me: 7.8s (78.4x,  1.3%)  ca: 7.2s (71.6x,  0.6%)  he: 7.7s (77.3x,  2.3%)  ca: 4.2s (41.8x,  0.2%)  dr: 4.8s (48.3x,  1.4%)  ma: 2.8s (28.3x, -4.0%)
-- many-loss-records --
many-loss-records valgrind-reg_alloc3:0.01s  no: 0.3s (32.0x, -----)  me: 1.4s (141.0x, -----)  ca: 1.1s (107.0x, -----)  he: 1.2s (123.0x, -----)  ca: 0.8s (75.0x, -----)  dr: 1.1s (111.0x, -----)  ma: 0.7s (70.0x, -----)
many-loss-records valgrind-linux:0.01s  no: 0.3s (29.0x,  9.4%)  me: 1.4s (140.0x,  0.7%)  ca: 1.1s (106.0x,  0.9%)  he: 1.2s (119.0x,  3.3%)  ca: 0.7s (72.0x,  4.0%)  dr: 1.1s (110.0x,  0.9%)  ma: 0.7s (70.0x,  0.0%)
-- many-xpts --
many-xpts valgrind-reg_alloc3:0.04s  no: 0.4s ( 9.5x, -----)  me: 1.7s (41.8x, -----)  ca: 2.6s (64.5x, -----)  he: 2.1s (51.7x, -----)  ca: 1.0s (26.0x, -----)  dr: 1.4s (35.5x, -----)  ma: 1.4s (36.2x, -----)
many-xpts valgrind-linux:0.04s  no: 0.4s (10.0x, -5.3%)  me: 1.5s (38.0x,  9.0%)  ca: 2.5s (63.7x,  1.2%)  he: 2.0s (50.0x,  3.4%)  ca: 1.0s (24.8x,  4.8%)  dr: 1.4s (35.2x,  0.7%)  ma: 1.5s (37.0x, -2.1%)
-- memrw --
memrw    valgrind-reg_alloc3:0.06s  no: 0.5s ( 8.5x, -----)  me: 1.3s (21.0x, -----)  ca: 2.5s (41.0x, -----)  he: 4.4s (72.8x, -----)  ca: 2.0s (33.8x, -----)  dr: 1.3s (21.5x, -----)  ma: 0.6s ( 9.3x, -----)
memrw    valgrind-linux:0.06s  no: 0.5s ( 8.2x,  3.9%)  me: 1.2s (20.8x,  0.8%)  ca: 2.5s (40.8x,  0.4%)  he: 4.3s (71.0x,  2.5%)  ca: 2.0s (33.0x,  2.5%)  dr: 1.2s (20.3x,  5.4%)  ma: 0.6s ( 9.2x,  1.8%)
-- sarp --
sarp     valgrind-reg_alloc3:0.02s  no: 0.3s (16.5x, -----)  me: 2.3s (116.5x, -----)  ca: 1.8s (88.0x, -----)  he: 5.6s (281.5x, -----)  ca: 0.9s (46.0x, -----)  dr: 1.0s (51.0x, -----)  ma: 0.4s (18.0x, -----)
sarp     valgrind-linux:0.02s  no: 0.3s (17.0x, -3.0%)  me: 2.2s (111.0x,  4.7%)  ca: 1.8s (88.0x,  0.0%)  he: 5.4s (271.0x,  3.7%)  ca: 0.9s (46.0x,  0.0%)  dr: 1.0s (48.0x,  5.9%)  ma: 0.4s (18.0x,  0.0%)
-- tinycc --
tinycc   valgrind-reg_alloc3:0.15s  no: 1.2s ( 7.9x, -----)  me: 9.4s (62.7x, -----)  ca: 9.4s (62.5x, -----)  he:11.1s (74.0x, -----)  ca: 9.4s (62.9x, -----)  dr: 9.1s (60.9x, -----)  ma: 1.6s (10.5x, -----)
tinycc   valgrind-linux:0.15s  no: 1.2s ( 8.0x, -0.8%)  me: 8.6s (57.6x,  8.1%)  ca: 9.6s (64.1x, -2.6%)  he:10.7s (71.3x,  3.6%)  ca: 9.1s (60.5x,  3.8%)  dr: 8.8s (58.9x,  3.3%)  ma: 1.6s (10.6x, -0.6%)
Comment 14 Julian Seward 2017-06-29 23:21:37 UTC
(In reply to Ivo Raisr from comment #13)
> Patches 005 implement function find_vreg_to_spill().

That's excellent.  With this in place, I can start and quit Firefox on
Memcheck, which requires processing more than 700000 blocks.

So here are some numbers with Memcheck, from --stats=yes.  For a small
program:

v2
transtab: new 2,793 (62,574 -> 995,820; ratio 15.9) [0 scs] avg tce size 356

v3
transtab: new 2,793 (62,574 -> 1,285,601; ratio 20.5) [0 scs] avg tce size 460


For Firefox (main process), it's similar:

v2
transtab: new 705,840 (18,604,302 -> 319,510,676; ratio 17.2) [73,827 scs] avg tce size 452

v3
transtab: new 722,290 (19,026,251 -> 436,975,315; ratio 23.0) [86,640 scs] avg tce size 604


So there's a significant space loss.  At least part of it comes I think from
a simple cause.  Memcheck instrumented code contains a lot of helper function
calls, and so the v2 allocator tries to use callee-save registers rather than
caller-save registers whenever it can, so it doesn't have to save registers
around the calls.  The v3 allocator doesn't appear to have that optimisation.
For example:

v2:
  1   movq $0x4A286E5,0xB8(%rbp)
  2   movq 0x3D8(%rbp),%r12
  3   movq 0x38(%rbp),%r13
  4   leaq 0xFFFFFFC0(%r13),%r14
  5   cmpq $0x0,%r12
  6   callnz[0,RLPri_None] 0x5805B7C0

v3:
  1   movq $0x4A286E5,0xB8(%rbp)
  2   movq 0x3D8(%rbp),%rsi
  3   movq 0x38(%rbp),%rdi
  4   leaq 0xFFFFFFC0(%rdi),%r8
  5   cmpq $0x0,%rsi
  6   movq %rsi,0xAF0(%rbp)
  7   movq %rdi,0xAE8(%rbp)
  8   movq %r8,0xAE0(%rbp)
  9   callnz[0,RLPri_None] 0x5805B7C0

The v2 allocator starts off using r12, r13, r14, so it doesn't have to save
them across the call.  But the v3 allocator starts off using rsi, rdi, r8
and so has to spill all of them across the call.

I remember that doing this (preferring callee-saved to caller-saved regs) in
the v2 allocator was very important to performance, but right now I can't
figure out from the v2 code how this is implemented.
Comment 15 Julian Seward 2017-06-29 23:26:54 UTC
I think it's done (in a stupid way) using the has_hlrs hint boolean and the
loop beginning at host_generic_reg_alloc2.c:1367 (after applying the patch).
If the code contains any calls at all then the caller-save registers will be
marked with .has_hlrs = True, and so the search loop tries to avoid these
(by assigning them to k_suboptimal rather than k) if possible.
Comment 16 Julian Seward 2017-06-29 23:27:51 UTC
(In reply to Julian Seward from comment #15)
> loop beginning at host_generic_reg_alloc2.c:1367 (after applying the patch).

Sorry, line 1367 in the unpatched sources.
Comment 17 Ivo Raisr 2017-06-30 08:02:02 UTC
Thank you for pointers where to focus next.
Indeed, that critical functionality is not implemented yet:

line 873 of host_generic_reg_alloc3.c:
            /* Find a free rreg of the correct class. 
               TODO-JIT: Find an rreg whose live range (if any) is as far ahead
               as possible. */
Comment 18 Julian Seward 2017-07-07 13:01:35 UTC
Created attachment 106492 [details]
Use callee-saved registers in preference to caller-saved registers, if possible

This small change applies on top of "106378: patch 005 - VEX part".
On a simple test (/usr/bin/date on memcheck) it reduces the code 
expansion, relative to the v2 allocator, from 6.9% to 0.64%, in 
other words, removes almost all the generated-code performance loss
from the new allocator.
Comment 19 Julian Seward 2017-07-07 13:02:28 UTC
Actual numbers:

r2
2,773 (60,618 -> 979,296; ratio 16.2) [0 scs] avg tce size 353

r3-before
2,767 (60,481 -> 1,047,119; ratio 17.3) [0 scs] avg tce size 378

r3-after
2,767 (60,481 -> 985,570; ratio 16.3) [0 scs] avg tce size 356
Comment 20 Julian Seward 2017-07-07 14:40:46 UTC
For the actual cost of doing register allocation, for /usr/bin/date on Memcheck,
I have the following profile results from Callgrind.  These numbers are for
doRegisterAllocation_{v2,v3} and all of their callees.

v2  193.4 million insns
v3  156.7 million insns

The v3 allocator beats the v2 allocator on all Callgrind-measured numbers
(memory accesses, misses, etc) except that it is about 10% worse in 
mispredicted conditional branches, for reasons I don't know yet.
Comment 21 Ivo Raisr 2017-07-08 07:00:43 UTC
I am currently on vacations so won't probably make any progress until the end of July...
Comment 22 Ivo Raisr 2017-07-10 21:08:01 UTC
Memcheck "--stats=yes" results running on Firefox under amd64/Linux (Ubuntu):

v2:
716,447 (19,412,410 -> 331,710,658; ratio 17.1) [213,213 scs] avg tce size 462

v3 (patches 006):
685,657 (18,614,387 -> 318,876,497; ratio 17.1) [179,498 scs] avg tce size 465
Comment 23 Ivo Raisr 2017-07-10 21:09:40 UTC
Created attachment 106549 [details]
patch 006 - Valgrind part
Comment 24 Ivo Raisr 2017-07-10 21:10:03 UTC
Created attachment 106550 [details]
patch 006 - VEX part
Comment 25 Ivo Raisr 2017-07-10 21:12:00 UTC
Right now the only remaining TODOs from patches 006 are:

/* TODO-JIT: To solve more efficiently questions such as "Find me a free
   register of the given class" it would help if registers in the RRegUniverse
   were grouped by their class. */

/* TODO-JIT: Contrary to what the algorithm overview above
   says, we simplify it for the time being and simply spill
   the vreg. Regardless if it is used by the instruction or
   not. This is very suboptimal but it can be refined later.*/

/* --- TODO-JIT: Direct reload --- */
Comment 26 Ivo Raisr 2017-07-13 22:04:26 UTC
Created attachment 106595 [details]
patch 007 - Valgrind part
Comment 27 Ivo Raisr 2017-07-13 22:04:58 UTC
Created attachment 106596 [details]
patch 007 - VEX part
Comment 28 Ivo Raisr 2017-07-13 22:06:19 UTC
The attached patches 007 implement groups of real registers in RRegUniverse.
They help to efficiently answer questions such as "Find me a free register of the given class".
Comment 29 Ivo Raisr 2017-07-13 22:16:16 UTC
I tried to do performance comparison of reg_alloc v3 implementation as it progresses in terms of patches. However the performance tests are not reliable - I could not get consistent results out of them on my amd64/Linux Ubuntu. Variations of up to +- 2% were observed on the same code in different runs. No other CPU intensive process was running in parallel.

Patches description:
005: the initial register allocator v3 implementation which was able to run the complete regression test suite
005-J: 005 + Julian's enhancement to prefer callee saved registers
006: 005 + ranking of real registers based on their live ranges
007: 006 + groups of real registers in RRegUniverse
[vanilla]: baseline Valgrind


Therefore the only conclusions I can draw from these are:
- patches 005 () produce worse behaviour than existing v3;
- patches 005-J, 006 and 007 perform roughly equivalent or slightly better than existing register allocator v2
- it cannot be deduced on this level if there any performance improvements among 005-J, 006, and 007 (there should be!)

Here are the results. Only speedups are printed with a new command line option "--terse".

$ perl ./perf/vg_perf --tools=none,memcheck,callgrind,helgrind,cachegrind,drd,massif \
    --terse --reps=10 --vg=./vanilla --vg=./ra3-005 --vg=./ra3-005-J --vg=./ra3-006 \
    --vg=./ra3-007 vanilla/perf
-- Running  tests in vanilla/perf --------------------------------------
-- bigcode1 --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -4.7%  ca:  0.2%  he: -1.1%  ca: -1.5%  dr: -0.6%  ma:  0.0%
ra3-005-J :  no:  0.0%  me:  0.4%  ca:  0.3%  he: -0.5%  ca: -0.9%  dr:  0.0%  ma:  0.6%
ra3-006   :  no:  0.0%  me:  0.4%  ca:  0.2%  he: -0.0%  ca: -1.2%  dr:  0.0%  ma:  0.0%
ra3-007   :  no:  0.0%  me:  0.7%  ca:  0.5%  he: -0.5%  ca: -0.6%  dr:  0.0%  ma:  1.2%
-- bigcode2 --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no: -0.7%  me: -2.1%  ca: -0.0%  he: -0.5%  ca: -1.3%  dr: -1.1%  ma:  0.9%
ra3-005-J :  no: -0.3%  me:  1.0%  ca:  0.1%  he: -1.6%  ca: -0.2%  dr: -0.3%  ma: -0.3%
ra3-006   :  no: -0.7%  me:  0.8%  ca: -0.2%  he: -0.5%  ca: -0.2%  dr: -0.5%  ma:  0.3%
ra3-007   :  no: -0.3%  me:  0.2%  ca:  0.1%  he: -0.5%  ca:  0.0%  dr:  0.0%  ma:  0.9%
-- bz2 --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.5%  me:-13.2%  ca:  0.0%  he: -6.0%  ca: -4.3%  dr: -5.4%  ma:  1.1%
ra3-005-J :  no:  1.1%  me: -0.2%  ca:  0.4%  he:  0.2%  ca: -2.0%  dr: -0.2%  ma:  2.1%
ra3-006   :  no:  0.5%  me:  0.0%  ca:  0.2%  he:  0.5%  ca: -0.8%  dr:  0.5%  ma:  1.6%
ra3-007   :  no:  0.5%  me:  0.0%  ca:  0.2%  he:  0.3%  ca: -1.0%  dr:  0.6%  ma:  1.1%
-- fbench --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -4.0%  ca: -0.4%  he: -1.9%  ca: -1.6%  dr: -3.5%  ma:  0.9%
ra3-005-J :  no:  0.9%  me:  0.0%  ca: -1.2%  he:  1.1%  ca: -0.3%  dr: -0.8%  ma:  0.9%
ra3-006   :  no:  0.0%  me:  0.3%  ca: -0.2%  he:  1.1%  ca:  0.3%  dr:  0.0%  ma:  1.8%
ra3-007   :  no:  0.0%  me: -0.3%  ca: -1.0%  he:  1.1%  ca: -0.3%  dr: -1.9%  ma:  0.9%
-- ffbench --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no: -0.9%  me: -2.1%  ca:  0.5%  he: -3.2%  ca: -5.0%  dr: -2.2%  ma:  0.0%
ra3-005-J :  no:  0.0%  me: -0.6%  ca:  0.5%  he:  0.2%  ca:  0.3%  dr: -0.9%  ma:  1.9%
ra3-006   :  no: -0.9%  me:  0.0%  ca:  0.0%  he: -0.4%  ca:  0.8%  dr:  0.0%  ma:  0.0%
ra3-007   :  no: -0.9%  me:  0.3%  ca:  0.0%  he:  0.4%  ca:  0.3%  dr: -0.6%  ma:  0.0%
-- heap --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -2.2%  ca:  0.9%  he: -1.9%  ca: -2.2%  dr: -2.4%  ma: -0.8%
ra3-005-J :  no: -2.6%  me: -0.2%  ca:  1.4%  he: -0.6%  ca:  0.0%  dr: -0.5%  ma: -1.2%
ra3-006   :  no: -2.6%  me: -0.6%  ca:  0.9%  he: -0.6%  ca:  0.0%  dr:  0.0%  ma:  0.4%
ra3-007   :  no: -1.3%  me: -0.6%  ca:  0.2%  he:  1.0%  ca:  0.0%  dr: -2.9%  ma:  2.9%
-- heap_pdb4 --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  1.2%  me: -2.4%  ca:  0.6%  he: -1.5%  ca: -3.0%  dr: -3.5%  ma: -0.4%
ra3-005-J :  no:  1.2%  me: -0.7%  ca:  0.0%  he: -2.1%  ca:  0.0%  dr: -0.2%  ma: -0.8%
ra3-006   :  no:  1.2%  me: -0.5%  ca:  0.0%  he: -2.5%  ca: -0.5%  dr: -0.2%  ma:  0.0%
ra3-007   :  no:  1.2%  me: -0.4%  ca:  0.3%  he: -0.4%  ca:  0.0%  dr: -3.3%  ma:  3.1%
-- many-loss-records --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no: -3.6%  me: -1.6%  ca: -1.0%  he: -0.9%  ca: -1.4%  dr: -1.9%  ma:  0.0%
ra3-005-J :  no: -7.1%  me: -0.8%  ca: -1.0%  he:  1.8%  ca:  0.0%  dr: -1.0%  ma:  1.5%
ra3-006   :  no: -3.6%  me:  0.8%  ca: -1.0%  he:  0.0%  ca:  0.0%  dr: -1.0%  ma:  0.0%
ra3-007   :  no: -7.1%  me:  0.8%  ca: -1.0%  he: -0.9%  ca:  0.0%  dr: -1.0%  ma: -1.5%
-- many-xpts --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -4.7%  ca:  0.4%  he: -3.6%  ca: -4.2%  dr: -0.7%  ma:  0.0%
ra3-005-J :  no:  0.0%  me:  0.0%  ca:  0.0%  he: -0.5%  ca:  0.0%  dr:  1.5%  ma:  0.7%
ra3-006   :  no:  0.0%  me:  0.7%  ca:  0.4%  he:  0.5%  ca:  1.0%  dr:  2.2%  ma:  0.0%
ra3-007   :  no: -2.6%  me:  0.0%  ca:  0.0%  he: -0.5%  ca:  2.1%  dr:  2.9%  ma:  0.7%
-- memrw --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -7.6%  ca:  0.8%  he: -1.2%  ca: -1.6%  dr: -5.0%  ma: -1.9%
ra3-005-J :  no:  0.0%  me:  1.7%  ca:  0.8%  he:  0.5%  ca:  0.0%  dr: -1.7%  ma: -3.8%
ra3-006   :  no:  0.0%  me:  3.4%  ca:  1.3%  he:  0.0%  ca:  1.0%  dr:  0.0%  ma: -1.9%
ra3-007   :  no:  0.0%  me:  0.8%  ca:  0.8%  he:  0.2%  ca:  0.5%  dr: -2.5%  ma:  0.0%
-- sarp --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no:  0.0%  me: -2.8%  ca: -0.6%  he: -0.9%  ca:  0.0%  dr: -6.4%  ma:  0.0%
ra3-005-J :  no: -3.0%  me: -0.5%  ca: -1.1%  he:  0.4%  ca: -1.1%  dr:  0.0%  ma:  0.0%
ra3-006   :  no:  3.0%  me: -0.5%  ca:  0.0%  he:  0.6%  ca:  2.2%  dr:  1.1%  ma: -2.9%
ra3-007   :  no:  0.0%  me:  6.4%  ca:  2.3%  he:  1.3%  ca:  1.1%  dr:  0.0%  ma: -2.9%
-- tinycc --
vanilla   :  no: -----  me: -----  ca: -----  he: -----  ca: -----  dr: -----  ma: -----
ra3-005   :  no: -0.9%  me: -4.1%  ca:  0.3%  he: -5.3%  ca: -2.1%  dr: -5.6%  ma:  0.0%
ra3-005-J :  no:  0.9%  me:  0.7%  ca:  0.5%  he: -0.3%  ca: -0.8%  dr: -0.2%  ma:  0.7%
ra3-006   :  no:  0.9%  me:  0.8%  ca:  0.6%  he: -0.4%  ca: -1.0%  dr: -0.5%  ma:  0.0%
ra3-007   :  no: -0.9%  me:  0.6%  ca:  0.5%  he: -0.2%  ca: -0.3%  dr:  0.2%  ma:  0.7%
-- Finished tests in vanilla/perf --------------------------------------

== 12 programs, 420 timings =================
Comment 30 Ivo Raisr 2017-07-17 21:19:42 UTC
Created attachment 106687 [details]
patch 008 - Valgrind part
Comment 31 Ivo Raisr 2017-07-17 21:20:01 UTC
Created attachment 106688 [details]
patch 008 - VEX part
Comment 32 Ivo Raisr 2017-07-17 21:21:03 UTC
Patches 008 implement direct reload optimization.

Now the only remaining TODO is the following one:
                  /* TODO-JIT: Contrary to what the algorithm overview above
                     says, we simplify it for the time being and simply spill
                     the vreg. Regardless if it is used by the instruction or
                     not. This is very suboptimal but it can be refined later.*/


I will address it in the coming days.
Comment 33 Ivo Raisr 2017-07-23 17:23:12 UTC
Created attachment 106801 [details]
patch 009 - Valgrind part
Comment 34 Ivo Raisr 2017-07-23 17:23:37 UTC
Created attachment 106802 [details]
patch 009 - VEX part
Comment 35 Ivo Raisr 2017-07-23 17:25:08 UTC
The final set of patches implementing new VEX register allocator version 3
for AMD64 and X86 architectures.
All regression tests pass successfully, Firefox runs ok.
Comment 36 Ivo Raisr 2017-07-23 18:35:48 UTC
Some performance comparison...

Number of instructions executed either in function
doRegisterAllocation() [Valgrind baseline, v2] or
doRegisterAllocation_v3 [new VEX register allocator, v3]:

--- memcheck on /bin/date ---
v2: 123,878,741
v3: 93,544,940

--- memcheck on perf/bigcode ---
v2: 871,980,968
v3: 697,984,555

These numbers are obtained from running outer callgrind on inner memcheck as per README_DEVELOPERS with "--toggle-collect=doRegisterAllocation --toggle-collect=doRegisterAllocation_v3". Then running callgrind_annotate on the callgrind output files.



Also the produced code from v3 is comparable or only marginally worse than from
v2.
./vg-in-place --stats=yes perf/bigcode
v2: transtab: new 30,033 (318,068 -> 9,133,920; ratio 28.7)
v3: transtab: new 30,028 (317,855 -> 9,133,573; ratio 28.7)

./vg-in-place --stats=yes date
v2: transtab: new 2,850 (61,825 -> 1,004,273; ratio 16.2)
v3: transtab: new 2,845 (61,612 -> 1,006,286; ratio 16.3)
Comment 37 Ivo Raisr 2017-07-27 00:51:09 UTC
Created attachment 106876 [details]
patch 010 - Valgrind part
Comment 38 Ivo Raisr 2017-07-27 00:51:35 UTC
Created attachment 106877 [details]
patch 010 - VEX part
Comment 39 Ivo Raisr 2017-07-27 00:52:42 UTC
Patches 010 implement the necessary changes for supporting the register allocator v3 to all architectures supported by VEX.
Comment 40 Christian Borntraeger 2017-07-27 10:48:25 UTC
regtested on SLES12 on s390x. 
regression test suite sees no change. make perf seems to have small improvements.
Comment 41 Petar Jovanovic 2017-07-31 13:15:20 UTC
With the changes (patch 010 Valgrind/VEX) applied, Memcheck is crashing on MIPS32/MIPS64.

$ ./vg-in-place ls
==27412== Memcheck, a memory error detector
==27412== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27412== Using Valgrind-3.14.0.SVN and LibVEX; rerun with -h for copyright info
==27412== Command: ls
==27412==

vex: priv/host_generic_reg_alloc3.c:374 (find_free_rreg): Assertion `0' failed.
vex storage: T total 2656776 bytes allocated
vex storage: P total 0 bytes allocated

valgrind: the 'impossible' happened:
   LibVEX called failure_exit().

host stacktrace:
==27412==    at 0x580829C0: show_sched_status_wrk (m_libcassert.c:355)
==27412==    by 0x58082B70: report_and_quit (m_libcassert.c:426)
==27412==    by 0x58082DD4: panic (m_libcassert.c:502)
==27412==    by 0x58082DD4: vgPlain_core_panic_at (m_libcassert.c:507)
==27412==    by 0x58082E18: vgPlain_core_panic (m_libcassert.c:512)
==27412==    by 0x580A5268: failure_exit (m_translate.c:740)
==27412==    by 0x5817C720: vex_assert_fail (main_util.c:247)
==27412==    by 0x580378D0: find_free_rreg.isra.2.constprop.8 (host_generic_reg_alloc3.c:374)
==27412==    by 0x581BF13C: doRegisterAllocation (host_generic_reg_alloc3.c:917)
==27412==    by 0x5817B0A4: libvex_BackEnd (main_main.c:1086)
==27412==    by 0x5817B0A4: LibVEX_Translate (main_main.c:1185)
==27412==    by 0x580A7D04: vgPlain_translate (m_translate.c:1794)
==27412==    by 0x580EB9F4: handle_chain_me (scheduler.c:1084)
==27412==    by 0x580EDBA4: vgPlain_scheduler (scheduler.c:1428)
==27412==    by 0x581026F4: thread_wrapper (syswrap-linux.c:103)
==27412==    by 0x581026F4: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 27412)
==27412==    at 0x400617C: _dl_start_final (rtld.c:281)
==27412==    by 0x4006528: _dl_start (rtld.c:415)
==27412==    by 0x40023CC: __start (in /lib/mips64el-linux-gnuabi64/ld-2.24.so)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.
Comment 42 Ivo Raisr 2017-08-01 07:39:23 UTC
Created attachment 107004 [details]
patch 010 - VEX part + some debugging for MIPS crashes

This patch adds some debugging statements in find_free_rreg() where MIPS arch was hitting assertion at priv/host_generic_reg_alloc3.c:374.
Comment 43 Ivo Raisr 2017-08-01 07:40:41 UTC
(In reply to Petar Jovanovic from comment #41)
> With the changes (patch 010 Valgrind/VEX) applied, Memcheck is crashing on
> MIPS32/MIPS64.

Petar, thank you for testing the patches.
Please could you test again with the latest patch called
"patch 010 - VEX part + some debugging for MIPS crashes".
It should print a bunch of debugging statements and then I hope to see what went wrong there.
Comment 44 Julian Seward 2017-08-01 11:37:36 UTC
010 works OK on arm64-linux.  On arm(32)-linux, it fails at startup
verifying sanity of the register set, due to a small error in the
definition of getRRegUniverse_ARM: the first allocable_end[] is
indexed at HRcInt64 rather than HRcInt32.  After fixing that it
works fine.
Comment 45 Ivo Raisr 2017-08-01 13:14:59 UTC
Created attachment 107008 [details]
patch 011 - Valgrind part
Comment 46 Ivo Raisr 2017-08-01 13:15:23 UTC
Created attachment 107009 [details]
patch 011 - VEX part
Comment 47 Ivo Raisr 2017-08-01 13:17:38 UTC
Patches 011 (re-)introduce command line option --vex-regalloc-version=2|3 which allows to switch between register allocator implementations. Note that version 3 (new implementation) is now the default.

They also fix license text in host_generic_reg_alloc3.c.

Debugging output around host_generic_reg_alloc3.c:374 is still present to ease root causing assertion hit on MIPS.
Comment 48 Petar Jovanovic 2017-08-01 13:27:24 UTC
Created attachment 107010 [details]
log of MIPS execution
Comment 49 Petar Jovanovic 2017-08-01 13:28:58 UTC
(In reply to Ivo Raisr from comment #43)
> Petar, thank you for testing the patches.
> Please could you test again with the latest patch called
> "patch 010 - VEX part + some debugging for MIPS crashes".
> It should print a bunch of debugging statements and then I hope to see what
> went wrong there.

Check the log in the comment #48.
Comment 50 Ivo Raisr 2017-08-01 13:37:28 UTC
Petar,
Thank you for the debugging output. Unfortunately it is not sufficient
to infer what went wrong.

Please could you set to 1 the following #define's at the beginning of host_generic_reg_alloc3.c:

#define DEBUG_REGALLOC 1
#define SANITY_CHECKS_EVERY_INSTR 1

recompile and attach again the output (will be large).


Thank you!
Comment 51 Petar Jovanovic 2017-08-01 15:00:52 UTC
Created attachment 107011 [details]
log of MIPS execution (large)

Here is the log, compressed.
Comment 52 Ivo Raisr 2017-08-01 16:23:27 UTC
Petar, thank you for the detailed output. Now the root cause is clear and I need to fix slightly the algorithm in find_free_rreg().
I'll prepare a new set of patches...
Comment 53 Ivo Raisr 2017-08-02 10:45:53 UTC
Created attachment 107036 [details]
patch 012 - Valgrind part
Comment 54 Ivo Raisr 2017-08-02 10:46:12 UTC
Created attachment 107037 [details]
patch 012 - VEX part
Comment 55 Ivo Raisr 2017-08-02 10:47:30 UTC
Patches 012 should fix the assertion hit on MIPS platform in function find_free_rreg().

Petar, please could you test please?
Comment 56 Petar Jovanovic 2017-08-02 13:49:04 UTC
Situation is better, yet not ideal. Majority of the programs pass, but some crash.

E.g.

~/radni/valgrind/trunk$ ./vg-in-place -q --track-origins=yes --vex-regalloc-version=3 ./memcheck/tests/atomic_incs

vex: priv/host_generic_reg_alloc3.c:250 (spill_vreg): Assertion `vreg_state[v_idx].dead_before > (Short) current_ii' failed.
vex storage: T total 55230600 bytes allocated
vex storage: P total 0 bytes allocated

valgrind: the 'impossible' happened:
   LibVEX called failure_exit().

host stacktrace:
==19301==    at 0x58082A60: show_sched_status_wrk (m_libcassert.c:355)
==19301==    by 0x58082C10: report_and_quit (m_libcassert.c:426)
==19301==    by 0x58082E74: panic (m_libcassert.c:502)
==19301==    by 0x58082E74: vgPlain_core_panic_at (m_libcassert.c:507)
==19301==    by 0x58082EB8: vgPlain_core_panic (m_libcassert.c:512)
==19301==    by 0x580A53B8: failure_exit (m_translate.c:740)
==19301==    by 0x5817C9C0: vex_assert_fail (main_util.c:247)
==19301==    by 0x58037790: spill_vreg (host_generic_reg_alloc3.c:250)
==19301==    by 0x581C1ACC: doRegisterAllocation_v3 (host_generic_reg_alloc3.c:1048)
==19301==    by 0x5817B310: libvex_BackEnd (main_main.c:1095)
==19301==    by 0x5817B310: LibVEX_Translate (main_main.c:1198)
==19301==    by 0x580A7E54: vgPlain_translate (m_translate.c:1794)
==19301==    by 0x580EBBF4: handle_chain_me (scheduler.c:1084)
==19301==    by 0x580EDDA4: vgPlain_scheduler (scheduler.c:1428)
==19301==    by 0x581028F4: thread_wrapper (syswrap-linux.c:103)
==19301==    by 0x581028F4: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 19301)
==19301==    at 0x4017228: _dl_determine_tlsoffset (dl-tls.c:239)
==19301==    by 0x4002268: init_tls (rtld.c:615)
==19301==    by 0x400492C: dl_main (rtld.c:1685)
==19301==    by 0x401C240: _dl_sysdep_start (dl-sysdep.c:249)
==19301==    by 0x4006220: _dl_start_final (rtld.c:307)
==19301==    by 0x4006528: _dl_start (rtld.c:415)
==19301==    by 0x40023CC: __start (in /lib/mips64el-linux-gnuabi64/ld-2.24.so)
Comment 57 Ivo Raisr 2017-08-02 14:11:19 UTC
Petar, thank you again for testing patches 012!

Please could you set to 1 the following #define's at the beginning of host_generic_reg_alloc3.c:

#define DEBUG_REGALLOC 1
#define SANITY_CHECKS_EVERY_INSTR 1

recompile and attach again the output (will be large).
Comment 58 Petar Jovanovic 2017-08-03 11:50:00 UTC
Ivo, here is a new log.
https://transfer.pcloud.com/download.html?code=5ZbBWcZMexCPcqqmR4ZFfeRZT410zHsBPVS9h74eABPRb4ByVQIX

I could not attach it as it is too big, even after compression.
Comment 59 Ivo Raisr 2017-08-03 21:12:01 UTC
Thank you, Petar. I was able to download the file and decipher it. The root cause is now known and I have to find an elegant way how to fix it.
Comment 60 Ivo Raisr 2017-08-04 10:52:24 UTC
Created attachment 107068 [details]
patch 013 - Valgrind part
Comment 61 Ivo Raisr 2017-08-04 10:52:47 UTC
Created attachment 107069 [details]
patch 013 - VEX part
Comment 62 Ivo Raisr 2017-08-04 10:56:41 UTC
Patches 013 should fix the assertion hit on MIPS platform
when doing MOV coalescing and the destination vreg is becoming dead.

Petar, please could you test please?

Also I think that this scenario happened because MIPS isel backend generated dead code. You can discover more if you run the program with patches 012 and options such as --trace-flags=11111110 --trace-notbelow=0 and let the program crash as before. Then please attach debug output of only the last SB.
Comment 63 Petar Jovanovic 2017-08-04 21:42:52 UTC
(In reply to Ivo Raisr from comment #62)
> Petar, please could you test please?
> 
The latest patch (lucky 13) seems to work fine on one MIPS variant I
have tested. Thanks for the extra work.
I will test other MIPS variants early next week.

> Also I think that this scenario happened because MIPS isel backend generated
> dead code. You can discover more if you run the program with patches 012 and
> options such as --trace-flags=11111110 --trace-notbelow=0 and let the
> program crash as before. Then please attach debug output of only the last SB.

Thanks. Will look into it.
Comment 64 Rhys Kidd 2017-08-06 20:27:33 UTC
Compile and regression suite tested on AMD64/Darwin (macOS 10.11, clang-800.0.42.1). No new regressions seen with v13 of the patches.

Looks good for the macOS platform.
Comment 65 Ivo Raisr 2017-08-10 07:27:19 UTC
(In reply to Petar Jovanovic from comment #63)

> The latest patch (lucky 13) seems to work fine on one MIPS variant I
> have tested. Thanks for the extra work.
> I will test other MIPS variants early next week.

Petar, thank you for the testing so far. Have you had a chance to test other MIPS variants?
Comment 66 Ivo Raisr 2017-08-10 07:35:15 UTC
Pending no other problems arise, I would like to land new VEX register allocator v3 after SVN->GIT migration is completed and the dust settles. That is in 2-3 weeks from now.

The new version v3 will be the default; it will be possible to switch to the old version v2 with command line option:
--vex-regalloc-version=2

Please shout now if you would like me to address your concerns about regalloc v3.
Comment 67 Petar Jovanovic 2017-08-10 16:54:28 UTC
(In reply to Ivo Raisr from comment #65)
> (In reply to Petar Jovanovic from comment #63)
> 
> > The latest patch (lucky 13) seems to work fine on one MIPS variant I
> > have tested. Thanks for the extra work.
> > I will test other MIPS variants early next week.
> 
> Petar, thank you for the testing so far. Have you had a chance to test other
> MIPS variants?

Yes, I can confirm now that v13 has been tested on other MIPS variants and it is good to go.

(In reply to Ivo Raisr from comment #62)
> Also I think that this scenario happened because MIPS isel backend generated
> dead code. You can discover more if you run the program with patches 012 and
> options such as --trace-flags=11111110 --trace-notbelow=0 and let the
> program crash as before. Then please attach debug output of only the last SB.


This seems to be related to the way how MIPS port uses and handles DivModU128to64 and some other Iops.
The issue comes from tricks done due to the lack of appropriate Iops that I am considering to add now.
These would be:

      Iop_DivModU64to64, // :: I64,I64 -> I128
                         // of which lo half is div and hi half is mod
      Iop_DivModS32to32, // :: I32,I32 -> I64
                         // of which lo half is div and hi half is mod
      Iop_DivModU32to32, // :: I32,I32 -> I64
                         // of which lo half is div and hi half is mod

Does anyone object or is there a need to create a separate issue for it?
Comment 68 Ivo Raisr 2017-08-11 08:04:25 UTC
(In reply to Petar Jovanovic from comment #67)

> This seems to be related to the way how MIPS port uses and handles
> DivModU128to64 and some other Iops.
> The issue comes from tricks done due to the lack of appropriate Iops that I
> am considering to add now.
> These would be:
> 
>       Iop_DivModU64to64, // :: I64,I64 -> I128
>                          // of which lo half is div and hi half is mod
>       Iop_DivModS32to32, // :: I32,I32 -> I64
>                          // of which lo half is div and hi half is mod
>       Iop_DivModU32to32, // :: I32,I32 -> I64
>                          // of which lo half is div and hi half is mod
> 
> Does anyone object or is there a need to create a separate issue for it?

If I understand your comment correctly, you are saying that MIPS front end misuses DivModU128to64 for something which is not intended.
Then please go ahead and fix that but please do not associate it in any way with this bug
as these are independent.
Comment 69 Petar Jovanovic 2017-08-11 12:43:16 UTC
(In reply to Ivo Raisr from comment #68)
> If I understand your comment correctly, you are saying that MIPS front end
> misuses DivModU128to64 for something which is not intended.

Yes, instead of using (nonexistent) Iop_DivModU64to64, it uses DivModU128to64 but the back'end implements simplified code it as it was Iop_DivModU64to64.
It will be fixed.

> Then please go ahead and fix that but please do not associate it in any way
> with this bug
> as these are independent.

Don't worry. :)
I just put explanation here of the dead code issue here.
Comment 70 Ivo Raisr 2017-08-25 22:30:14 UTC
Created attachment 107524 [details]
patch 014 (Valgrind + VEX)

Patch 014 combines Valgrind + VEX parts together as this fits with the new git SCM.
Changes compared to 013:
- trailing whitespace removed
- some fields removed from RRegState
- and moved to newly introduced structure RRegLRState
- this decoupling helps to differentiate RRegState and RRegLRState concepts
- and also helps when introducing If-Then-Else support later
Comment 71 Ivo Raisr 2017-08-25 22:31:05 UTC
I am going to integrate patch 014 next week unless I hear otherwise.
Comment 72 Ivo Raisr 2017-08-28 10:41:24 UTC
Pushed as changeset efa1e5ef8d257e3b20facf6f04350d29578ae9e4.
Comment 74 Julian Seward 2018-07-24 12:41:40 UTC
Ivo, just to say, thank you for doing all this reg-alloc work.
I know it was a lot of work.  The results are good!
Comment 75 Ivo Raisr 2018-07-24 12:58:07 UTC
And thank you, Julian, for your guidance.
Unfortunately I won't be able to finish the VEX enhancement/rework residing in (I think) vex-jit-hacks-2 branch. Perhaps some brave soul will be able to continue...