Bug 384526

Summary: reduce number of spill instructions generated by VEX register allocator v3
Product: [Developer tools] valgrind Reporter: Ivo Raisr <ivosh>
Component: vexAssignee: Ivo Raisr <ivosh>
Status: RESOLVED FIXED    
Severity: normal CC: ivosh
Priority: NOR    
Version: 3.14 SVN   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=384337
Latest Commit: Version Fixed In:
Attachments: proposed patch

Description Ivo Raisr 2017-09-09 15:32:09 UTC
VEX register allocator v2 has a clever optimization which reduces a number of spill instructions generated.

Let's have the following vcode:
...
 97: orq %vR250,%vR570
 98: movslq %vR570d,%vR569
 99: movq %vR329,%vR571
100: orq %vR569,%vR571
101: callnz[0,RLPri_None]

and the following vreg ranges:
%vR250: [ 12, 214)
%vR329: [ 72, 162)
%vR569: [ 98, 101)
%vR570: [ 95,  99)
%vR571: [ 99, 101)

Now reg. alloc. v2 produces the following rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   callnz[0,RLPri_None] 0x5805E8E0

However current regalloc v3 produces more rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   movq %r8,0xB28(%rbp)
 89   callnz[0,RLPri_None] 0x5805E8E0

It can be seen that spilling %r8 (instruction #88) is useless because its value is still equal to that one in the spill slot.
Comment 1 Ivo Raisr 2017-09-10 06:30:29 UTC
Created attachment 107778 [details]
proposed patch
Comment 2 Ivo Raisr 2017-09-10 06:41:40 UTC
We can see the following performance improvement:

running Memcheck on perf/bz2:
baseline: 45,132 M instructions total; 161 M instructions doRegisterAllocation_v3
patched:  45,107 M instructions total; 168 M instructions doRegisterAllocation_v3

running Memcheck on /bin/true:
baseline: 3,511 M instructions total; 59 M instructions doRegisterAllocation_v3
patched:  3,508 M instructions total; 61 M instructions doRegisterAllocation_v3

So it can be seen that even when running the simplest program (/bin/true), small additional cost of register allocation is compensated by much higher gain in terms of total instruction cost.
Comment 3 Ivo Raisr 2017-09-10 09:39:00 UTC
Fixed by changeset 3117cd9637a843cbab5de302fb30e22153fbfc1c:

https://sourceware.org/git/?p=valgrind.git;a=commit;h=3117cd9637a843cbab5de302fb30e22153fbfc1c