384337 – performance improvements to VEX register allocator v2 and v3

Bug 384337 - performance improvements to VEX register allocator v2 and v3

Summary: performance improvements to VEX register allocator v2 and v3

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (show other bugs)
Version:	3.14 SVN
Platform:	Compiled Sources Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Ivo Raisr

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-09-04 07:51 UTC by Ivo Raisr
Modified:	2018-07-30 11:46 UTC (History)
CC List:	2 users (show)

See Also:	384526 384584 384987
Latest Commit:
Version Fixed In:

Attachments
Fix a typo in function find_vreg_to_spill(). (941 bytes, patch) 2017-09-04 07:53 UTC, Ivo Raisr	Details
Track vreg usage on per-instruction precision in v3 (13.16 KB, patch) 2017-09-04 07:54 UTC, Ivo Raisr	Details
Scan 10 instructions ahead instead of just 5 (998 bytes, patch) 2017-09-04 07:56 UTC, Ivo Raisr	Details
Different algorithm for find_vreg_to_spill() in v3. (4.24 KB, patch) 2017-09-04 07:59 UTC, Ivo Raisr	Details
Small performance enhancement to v2. Also makes it the default for testing. (5.16 KB, patch) 2017-09-04 08:02 UTC, Ivo Raisr	Details
Reorder allocatable registers for AMD64 and X86 archs (6.82 KB, patch) 2017-09-06 06:15 UTC, Ivo Raisr	Details
Refactor tracking of MOV coalescing. (25.99 KB, patch) 2017-09-22 20:56 UTC, Ivo Raisr	Details
Show Obsolete (4) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Ivo Raisr 2017-09-04 07:51:33 UTC

Philippe did some performance measurements of VEX register allocator v2 and v3 with combination of chasing on/off.

Basically his observations conclude that register allocator v3 is faster to generate code, but the generated code is less efficient,
and relatively quickly (at around 1 minute), the total becomes worse.

Here are some proof of concepts for some ideas I had how to speed up v2 and v3.

Comment 1 Ivo Raisr 2017-09-04 07:53:36 UTC

Created attachment 107674 [details]
Fix a typo in function find_vreg_to_spill().

Fixes a typo in function find_vreg_to_spill() for v3.
This should produce better/more correct spilling decisions but my measurements did not prove that...

Comment 2 Ivo Raisr 2017-09-04 07:54:53 UTC

Created attachment 107675 [details]
Track vreg usage on per-instruction precision in v3

Tracks vreg usage on per-instruction basis, using bitset. Should produce optimal spilling decisions.

Comment 3 Ivo Raisr 2017-09-04 07:56:46 UTC

Created attachment 107676 [details]
Scan 10 instructions ahead instead of just 5

Scans 10 instructions ahead in function find_vreg_to_spill() instead of just 5 in v3. Hard to say if it improves anything...

Comment 4 Ivo Raisr 2017-09-04 07:59:36 UTC

Created attachment 107677 [details]
Different algorithm for find_vreg_to_spill() in v3.

Different algorithm for find_vreg_to_spill() in v3.
Also scans 10 instructions instead of just 5.

Comment 5 Ivo Raisr 2017-09-04 08:02:33 UTC

Created attachment 107678 [details]
Small performance enhancement to v2. Also makes it the default for testing.

Small performance enhancement to VEX register allocator  v2.
Iterates only over real registers of the target hreg class.

Also makes VEX register allocator v2 the default just for the sake of testing.

Comment 6 Ivo Raisr 2017-09-04 08:09:01 UTC

Philippe, please could you measure these guys on your performance tests.
Although I was able to determine total instruction count on Memcheck+perf/bz2, I was not able to measure real time in a deterministic way using 'taskset 1'. The standard deviation on my laptop with amd64/Linux Ubuntu was over 3%.


I think https://bugsfiles.kde.org/attachment.cgi?id=107678 can be integrated right away but I'd like to see in your testing how much it improves the things.


Also https://bugsfiles.kde.org/attachment.cgi?id=107674 is quite interesting. My measurements did not show any improvement but the code (without the fix) is obviously wrong. I still do not understand what this is trying to tell me...

Comment 7 Ivo Raisr 2017-09-04 08:17:07 UTC

What I would like to do next is to compare how v2 and v3 allocate registers on some hot paths in perf/bz2. However I have difficulty how to locate such SBs.

Although Callgrind shows addresses of such hot translated blocks, I am not able to correlate them to --flags=10000110 output. Is there any clever way how to do that?

Comment 8 Julian Seward 2017-09-04 08:25:53 UTC

One thing you could try is this:

./vg-in-place --vex-regalloc-version=2 \
   --profile-flags=00000010 [whatever tool and args you like] ./perf/bz2 x

That both shows you the hot blocks and also the regalloc output for them.

and then compare against the same for the v3 allocator.  Because perf/bz2
is very deterministic, you should be able to find matching block-pairs
easily.

It might also be worth trying with perf/fbench and ffbench because they 
are small, have hot loops and use FP registers a lot.

Also, maybe worth comparing on x86 rather than amd64?  Given that 
there are fewer allocatable registers on x86, differences in
spilling strategies between v2 and v3 might be more obvious.
(Just a guess.)

Comment 9 Ivo Raisr 2017-09-06 06:15:39 UTC

Created attachment 107711 [details]
Reorder allocatable registers for AMD64 and X86 archs

Reorder allocatable registers for AMD64 and X86 so that the callee saved are listed first.

Helper calls always trash all caller saved registers. By listing the callee saved first the register allocator is more likely to pick them and does not need to spill that much before helper calls.

If this proves to improve performance (at least somewhat) then the v3 register allocation algorithm needs to be revisited.

Comment 10 Ivo Raisr 2017-09-06 11:39:11 UTC

(In reply to Ivo Raisr from comment #9)

Here are my findings for running Memcheck on perf/bz2:
v2 baseline: 45.214 G instructions
v2 with improvement: 45.183 G instructions
v3 baseline: 45.132 G instructions
v3 with reordering: 45.116 G instructions

Comment 11 Julian Seward 2017-09-06 11:46:37 UTC

(In reply to Ivo Raisr from comment #10)

Do you have information about the regalloc cost vs the cost of
the generated code?  In other words, is there a way to tell if
the patch actually changes the generated code, or whether it
only causes some find-a-register loops inside the allocator to
iterate less often?

Comment 12 Ivo Raisr 2017-09-06 11:54:01 UTC

(In reply to Ivo Raisr from comment #10)

Included is also total cost of running doRegisterAllocation_v2/3:

Here are my findings for running Memcheck on perf/bz2:
v2 baseline: 45.214 G instructions total; 209 M instructions regalloc
v2 with improvement: 45.183 G instructions; 208 M instructions regalloc
v3 baseline: 45.132 G instructions; 161 M instructions regalloc
v3 with reordering: 45.116 G instructions; 159 M instructions regalloc

Comment 13 Ivo Raisr 2017-09-09 08:16:07 UTC

Interestingly enough, VEX register allocator v2 also benefits from register reordering:

For Memcheck on perf/bz2:
ra2-baseline: 45,214 M instructions total; 209 M doRegisterAllocation_v2
ra2-reordered: 45,189 M instructions total; 205 M doRegisterAllocation_v2
ra3-baseline: 45,132 M instructions total; 160 M doRegisterAllocation_v3
ra3-reorder: 45,116 M instructions total; 159 M doRegisterAllocation_v3


For Memcheck on perf/ffbench:
ra2-baseline: 15,646 M instructions total; 100 M regalloc
ra2-reorder: 15,615 M instructions total; 99 M regalloc
ra3-baseline: 15,749 M instructions total; 79 M regalloc
ra3-reorder: 15,692 M instructions total; 78 M regalloc

Comment 14 Ivo Raisr 2017-09-22 20:56:03 UTC

Created attachment 107958 [details]
Refactor tracking of MOV coalescing.

    VReg<->VReg MOV coalescing status is now a part of the HRegUsage.
    This allows register allocation to query it two times without incurring
    a performance penalty. This in turn allows to better keep track of
    vreg<->vreg MOV coalescing so that all vregs in the coalesce chain
    get the effective |dead_before| of the last vreg.
    
    A small performance improvement has been observed because this allows
    to coalesce even spilled vregs (previously only assigned ones).

Comment 15 Julian Seward 2018-07-25 07:21:32 UTC

Ivo, can we close this?  I assume that all of these improvements
have long since landed in the v3 allocator (and also, that we're
now shipping v3 by default!)  But do let me know if I assume wrongly.

Comment 16 Ivo Raisr 2018-07-30 11:46:59 UTC

Julian, I have closed this bug now. Register allocator v3 has been the default in Valgrid for many months and all the improvements from this bug which made sense have been already implemented.