384526 – reduce number of spill instructions generated by VEX register allocator v3

Bug 384526 - reduce number of spill instructions generated by VEX register allocator v3

Summary: reduce number of spill instructions generated by VEX register allocator v3

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (show other bugs)
Version:	3.14 SVN
Platform:	Compiled Sources Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Ivo Raisr

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-09-09 15:32 UTC by Ivo Raisr
Modified:	2017-09-10 09:39 UTC (History)
CC List:	1 user (show)

See Also:	384337
Latest Commit:
Version Fixed In:

Attachments
proposed patch (11.14 KB, patch) 2017-09-10 06:30 UTC, Ivo Raisr	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Ivo Raisr 2017-09-09 15:32:09 UTC

VEX register allocator v2 has a clever optimization which reduces a number of spill instructions generated.

Let's have the following vcode:
...
 97: orq %vR250,%vR570
 98: movslq %vR570d,%vR569
 99: movq %vR329,%vR571
100: orq %vR569,%vR571
101: callnz[0,RLPri_None]

and the following vreg ranges:
%vR250: [ 12, 214)
%vR329: [ 72, 162)
%vR569: [ 98, 101)
%vR570: [ 95,  99)
%vR571: [ 99, 101)

Now reg. alloc. v2 produces the following rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   callnz[0,RLPri_None] 0x5805E8E0

However current regalloc v3 produces more rcode:
 82   orq %rbx,%r8
 83   movslq %r8d,%rdi
 84   movq 0xB28(%rbp),%r8
 85   movq %r8,%rsi
 86   orq %rdi,%rsi
 87   movq %r9,0xAE0(%rbp)
 88   movq %r8,0xB28(%rbp)
 89   callnz[0,RLPri_None] 0x5805E8E0

It can be seen that spilling %r8 (instruction #88) is useless because its value is still equal to that one in the spill slot.

Comment 1 Ivo Raisr 2017-09-10 06:30:29 UTC

Created attachment 107778 [details]
proposed patch

Comment 2 Ivo Raisr 2017-09-10 06:41:40 UTC

We can see the following performance improvement:

running Memcheck on perf/bz2:
baseline: 45,132 M instructions total; 161 M instructions doRegisterAllocation_v3
patched:  45,107 M instructions total; 168 M instructions doRegisterAllocation_v3

running Memcheck on /bin/true:
baseline: 3,511 M instructions total; 59 M instructions doRegisterAllocation_v3
patched:  3,508 M instructions total; 61 M instructions doRegisterAllocation_v3

So it can be seen that even when running the simplest program (/bin/true), small additional cost of register allocation is compensated by much higher gain in terms of total instruction cost.

Comment 3 Ivo Raisr 2017-09-10 09:39:00 UTC

Fixed by changeset 3117cd9637a843cbab5de302fb30e22153fbfc1c:

https://sourceware.org/git/?p=valgrind.git;a=commit;h=3117cd9637a843cbab5de302fb30e22153fbfc1c