Bug 384987 - VEX register allocator: allocate caller-save registers for short lived vregs
Summary: VEX register allocator: allocate caller-save registers for short lived vregs
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.14 SVN
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Ivo Raisr
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-23 07:45 UTC by Ivo Raisr
Modified: 2017-10-11 18:58 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
refactor MOV coalescing (25.99 KB, patch)
2017-09-23 07:45 UTC, Ivo Raisr
Details
register allocation (4.68 KB, patch)
2017-09-23 07:48 UTC, Ivo Raisr
Details
refactor MOV coalescing (53.99 KB, patch)
2017-10-03 07:21 UTC, Ivo Raisr
Details
register allocation (4.69 KB, patch)
2017-10-03 07:21 UTC, Ivo Raisr
Details
mips report (7.59 KB, application/zip)
2017-10-03 13:12 UTC, aleksandra
Details
mips report (17.64 KB, text/plain)
2017-10-04 11:45 UTC, aleksandra
Details
refactor MOV coalescing (52.85 KB, patch)
2017-10-04 20:05 UTC, Ivo Raisr
Details
register allocation (4.76 KB, patch)
2017-10-04 20:05 UTC, Ivo Raisr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ivo Raisr 2017-09-23 07:45:10 UTC
This is a performance improvement based on Peter Bergner's suggestion:
Allocate caller-saved registers for short lived vregs and callee-save registers for vregs which span accross helper calls.

This idea is quite simple but MOV coalescing went in the way.
The first patch enhances MOV coalescing so it keeps better track of what registers form the coalescing chain. The second patch actually changes find_free_rreg().

With just a minor additional complexity, we get more compact generated code.
This not only helps for better performance; it should also help with i-cache misses.

Measuring instruction count, running inner Memcheck:

perf/bz2:
vanilla:       45,112,349,784 total; reg alloc 165,978,807; ratio 15.5 (97,397 -> 1,506,163)
coalesce only: 45,120,691,777 total; reg alloc 167,035,575; ratio 15.5 (97,610 -> 1,509,396)
both patches:  44,943,765,809 total; reg alloc 167,403,237; ratio 15.3 (97,607 -> 1,493,722)

/bin/true:
vanilla:       3,514,906,884 total; reg alloc 60,779,126
coalesce only: 3,506,029,832 total; reg alloc 61,242,900
both patches:  3,503,495,437 total; reg alloc 61,395,160
Comment 1 Ivo Raisr 2017-09-23 07:45:38 UTC
Created attachment 107963 [details]
refactor MOV coalescing
Comment 2 Ivo Raisr 2017-09-23 07:48:17 UTC
Created attachment 107964 [details]
register allocation
Comment 3 Ivo Raisr 2017-09-23 07:49:18 UTC
The only remaining thing is to refactor isMove callbacks in the remaining architectures so that the information is available directly in HRegUsage.
Comment 4 Ivo Raisr 2017-10-03 07:21:23 UTC
Created attachment 108134 [details]
refactor MOV coalescing
Comment 5 Ivo Raisr 2017-10-03 07:21:42 UTC
Created attachment 108135 [details]
register allocation
Comment 6 Ivo Raisr 2017-10-03 07:24:53 UTC
I've tested on amd64, ppc8le and arm64 architectures.
In all cases, the produced code was more compact and overall performance better, when running inner Memcheck on perf/bz2.
Numbers are given as instruction count; ratio as reported by Memcheck with
'--stats=yes'.

amd64:
vanilla:    45,112,349,784 total; 165,978,807 reg alloc; ratio 15.5
v3-reoder:  44,943,765,809 total; 167,403,237 reg alloc; ratio 15.3

power8le:
vanilla:    61,928,020,284 total; 351,285,156 reg alloc; ratio 17.0
v3-reorder: 61,919,130,481 total; 343,001,581 reg alloc; ratio 17.0

arm64 [callgrind does not work on this arch]:
vanilla:    ratio 14.7
v3-reorder: ratio 14.7
Manual inspection of top 200 SB profiled blocks showed VexExpansionRatio
always few instructions better than in vanilla.
Comment 7 Julian Seward 2017-10-03 10:50:45 UTC
(In reply to Ivo Raisr from comment #4)
> Created attachment 108134 [details]
> refactor MOV coalescing

This looks OK to me.  I must say I didn't check the actual merging logic in
host_generic_reg_alloc*.c, because I no longer know how those work and I
assume you checked/tested it fairly carefully.

I did look at the changes for each architecture, for the removal of
isMove_XXXInstr and replacement by an extension to getRegUsage_XXXInstr.
These are potentially dangerous, if we make a mistake there -- if it
mistakenly claims that something is a move when it isn't, then the generated
code will be incorrect.  If the new versions miss out some moves that the
old versions find, then the generated code will be slower.

But it all looks OK to me.  I do have one question though:

This isMove_XXXInstr functions appear to ignore the virtual/real
distinction, whereas the new getRegUsage_XXXInstr bits always do check
hregIsVirtual() before setting u->isVregVregMove = True and recording the
two operands.

I assume that is deliberate.  I would ask -- is it necessary to duplicate
the 'hregIsVirtual?' check for each architecture?  Can
host_generic_reg_alloc*.c instead do that itself just once, immediately
after calling getRegUsage_XXXInstr?
Comment 8 Julian Seward 2017-10-03 10:53:10 UTC
(In reply to Ivo Raisr from comment #5)
> Created attachment 108135 [details]
> register allocation

Looks plausible to me, although I can't make any meaningful
assessment of it.
Comment 9 Julian Seward 2017-10-03 10:55:47 UTC
(In reply to Ivo Raisr from comment #6)
> amd64:
> vanilla:    45,112,349,784 total; 165,978,807 reg alloc; ratio 15.5
> v3-reoder:  44,943,765,809 total; 167,403,237 reg alloc; ratio 15.3

Nice.

> power8le:
> vanilla:    61,928,020,284 total; 351,285,156 reg alloc; ratio 17.0
> v3-reorder: 61,919,130,481 total; 343,001,581 reg alloc; ratio 17.0

Curious as to why this doesn't give much of an improvement.  Is it
because power has more registers than amd64, so the opportunity for
avoiding spill code is smaller?

On that theory .. I would be interested to see the equivalent numbers
for x86, to see if the gains are even larger.  Can you measure those?
Also I think it would be a good safety check.
Comment 10 aleksandra 2017-10-03 13:12:20 UTC
Created attachment 108138 [details]
mips report

Hi, I tested this patches on mips64 and mips32. I couldn't find numbers that you are looking for, so I copied everything, I hope that's ok :)
Comment 11 Julian Seward 2017-10-04 06:27:15 UTC
(In reply to aleksandra from comment #10)
> Created attachment 108138 [details]
> mips report

Aleksandra, thank you for testing on MIPS.  Please can you re-attach
the report as a plain text file?  Opening zip files in bugzilla isn't
so convenient.
Comment 12 aleksandra 2017-10-04 11:45:29 UTC
Created attachment 108160 [details]
mips report

I hope this is better.
Comment 13 Julian Seward 2017-10-04 17:43:30 UTC
(In reply to aleksandra from comment #12)
> Created attachment 108160 [details]
> mips report
> I hope this is better.

Thanks, yes.  So, what we can pull from that is

MIPS64
both    transtab: new 1,837 (44,740 -> 868,504; ratio 19.4) [0 scs] avg tce size 472
neither transtab: new 1,842 (44,992 -> 868,724; ratio 19.3) [0 scs] avg tce size 471

MIPS32
both    transtab: new 1,756 (41,684 -> 826,544; ratio 19.8) [0 scs] avg tce size 470
neither transtab: new 1,769 (42,096 -> 830,100; ratio 19.7) [0 scs] avg tce size 469

(both = with both patches, neither = with neither patch)

Which is a bit strange because it looks from that like the patch gives
a small increase in code size, rather than the expected small decrease.
Maybe I confused the before/after logs?
Comment 14 Ivo Raisr 2017-10-04 20:05:06 UTC
Created attachment 108170 [details]
refactor MOV coalescing
Comment 15 Ivo Raisr 2017-10-04 20:05:23 UTC
Created attachment 108171 [details]
register allocation
Comment 16 Ivo Raisr 2017-10-04 20:08:16 UTC
I've fixed the problem with duplicate hregIsVirtual() in host_arch_defs.c files.

Here are the performance numbers for amd64 and x86:

Running inner Memcheck on perf/bz2 (compiled with -O or -O2).
Numbers are instruction counts:

amd64 -O:    total         register allocator    ratio
vanilla:   45,102,001,869   165,978,807          15.5
patches:   44,928,390,448   169,280,398          15.3

amd64 -O2:
vanilla:   44,309,403,899   191,289,068          16.5
patches:   44,108,763,444   195,193,491          16.3

x86 -O:
vanilla:   48,711,841,474   233,097,543          14.6
patches:   48,077,926,058   235,505,230          14.4

x86 -O2:
vanilla:   50,400,479,522   253,382,282          15.1
patches:   49,953,764,975   256,000,363          14.9

In all cases, the new implementation is faster and produces better (compact) code.

I will investigate the situation on power and post the results here.
Comment 17 Julian Seward 2017-10-05 09:05:25 UTC
(In reply to Ivo Raisr from comment #16)
> Here are the performance numbers for amd64 and x86:

That's some nice improvements, especially for x86.  Thanks for
the measurements.

I expect arm(32) would also get a quite big improvement, because
it's also pretty low on registers.  [This is not a request to measure,
just random speculation.]

> In all cases, the new implementation is faster and produces better (compact)
> code.

Assuming power looks sane (the generated code is at least not any bigger
with the patch), and that everything is stable, I suggest you just land
it whenever you like.
Comment 18 aleksandra 2017-10-05 14:56:43 UTC
I tested this new patches and here are results:

MIPS64
both      transtab 1,837 (44,740 -> 868,632; ratio 19.4) [0 scs] avg tce size 472
neither   transtab 1,834 (44,664 -> 865,872; ratio 19.4) [0 scs] avg tce size 472

MIPS32
both      transtab 1,756 (41,684 -> 826,544; ratio 19.8) [0 scs] avg tce size 470
neither   transtab 1,769 (42,036 -> 820,828; ratio 19.5) [0 scs] avg tce size 464
Comment 19 Ivo Raisr 2017-10-11 18:54:28 UTC
(In reply to Julian Seward from comment #17)
> I expect arm(32) would also get a quite big improvement, because
> it's also pretty low on registers.  [This is not a request to measure,
> just random speculation.]

Unfortunately I do not have any machine with arm32 architecture.

> Assuming power looks sane (the generated code is at least not any bigger
> with the patch), and that everything is stable, I suggest you just land
> it whenever you like.

Actually profiling for the most used 200 blocks in perf/bz2 shows only the following difference:
    -VexExpansionRatio 228 3228   141 :10
    +VexExpansionRatio 228 3196   140 :10
So this means that the majority of the blocks remained the same size;
one block is slightly smaller now.
Comment 20 Ivo Raisr 2017-10-11 18:58:08 UTC
Pushed as commits:
83cabd32492e6d19d483a63522e4e874fa64b617
074de238d44c0cdaf394489ea69a67b76916fbce

https://sourceware.org/git/?p=valgrind.git;a=commit;h=83cabd32492e6d19d483a63522e4e874fa64b617
https://sourceware.org/git/?p=valgrind.git;a=commit;h=074de238d44c0cdaf394489ea69a67b76916fbce
Comment 21 Ivo Raisr 2017-10-11 18:58:55 UTC
Thank you all for your responses! They were really helpful.