Bug 384526 - reduce number of spill instructions generated by VEX register allocator v3
Summary: reduce number of spill instructions generated by VEX register allocator v3
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.14 SVN
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Ivo Raisr
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-09 15:32 UTC by Ivo Raisr
Modified: 2017-09-10 09:39 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
proposed patch (11.14 KB, patch)
2017-09-10 06:30 UTC, Ivo Raisr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivo Raisr 2017-09-09 15:32:09 UTC
VEX register allocator v2 has a clever optimization which reduces a number of spill instructions generated.

Let's have the following vcode:
...
 97: orq %vR250,%vR570
 98: movslq %vR570d,%vR569
 99: movq %vR329,%vR571
100: orq %vR569,%vR571
101: callnz[0,RLPri_None]

and the following vreg ranges:
%vR250: [ 12, 214)
%vR329: [ 72, 162)
%vR569: [ 98, 101)
%vR570: [ 95,  99)
%vR571: [ 99, 101)

Now reg. alloc. v2 produces the following rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   callnz[0,RLPri_None] 0x5805E8E0

However current regalloc v3 produces more rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   movq %r8,0xB28(%rbp)
 89   callnz[0,RLPri_None] 0x5805E8E0

It can be seen that spilling %r8 (instruction #88) is useless because its value is still equal to that one in the spill slot.
Comment 1 Ivo Raisr 2017-09-10 06:30:29 UTC
Created attachment 107778 [details]
proposed patch
Comment 2 Ivo Raisr 2017-09-10 06:41:40 UTC
We can see the following performance improvement:

running Memcheck on perf/bz2:
baseline: 45,132 M instructions total; 161 M instructions doRegisterAllocation_v3
patched:  45,107 M instructions total; 168 M instructions doRegisterAllocation_v3

running Memcheck on /bin/true:
baseline: 3,511 M instructions total; 59 M instructions doRegisterAllocation_v3
patched:  3,508 M instructions total; 61 M instructions doRegisterAllocation_v3

So it can be seen that even when running the simplest program (/bin/true), small additional cost of register allocation is compensated by much higher gain in terms of total instruction cost.
Comment 3 Ivo Raisr 2017-09-10 09:39:00 UTC
Fixed by changeset 3117cd9637a843cbab5de302fb30e22153fbfc1c:

https://sourceware.org/git/?p=valgrind.git;a=commit;h=3117cd9637a843cbab5de302fb30e22153fbfc1c