Bug 385843 - [PATCH] ARM: mark caller-save VFP registes as trashed by calls
Summary: [PATCH] ARM: mark caller-save VFP registes as trashed by calls
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (other bugs)
Version First Reported In: 3.14 SVN
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Ivo Raisr
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-17 04:45 UTC by Sindre Aamås
Modified: 2017-10-18 05:27 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
ARMv7 NEON test program (1.27 KB, text/plain)
2017-10-17 04:45 UTC, Sindre Aamås
Details
[PATCH] VEX/ARM: mark caller-save VFP registes as trashed by calls (1.82 KB, patch)
2017-10-17 04:48 UTC, Sindre Aamås
Details
[PATCH] VEX/ARM: mark caller-save VFP registes as trashed by calls (2.04 KB, patch)
2017-10-17 05:36 UTC, Sindre Aamås
Details
Stats before (4.79 KB, text/plain)
2017-10-17 07:01 UTC, Sindre Aamås
Details
Stats after (4.79 KB, text/plain)
2017-10-17 07:02 UTC, Sindre Aamås
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sindre Aamås 2017-10-17 04:45:05 UTC
Created attachment 108391 [details]
ARMv7 NEON test program

The attached test program gives the wrong output when run with the memcheck tool. The attached patch makes the issue disappear. The issue depends on how valgrind is built.

With the right combination of compiler and compiler flags (e.g. using -mfpu=neon makes a difference here with GCC 7.2.0), the helper functions in memcheck/mc_main.c end up using the same (caller-save) VFP registers as used by VEX.

According to http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf (§5.1.2.1 VFP register usage conventions): VFP registers d16-d31 (q8-q15), if present, do not need to preserved (by the callee).
Comment 1 Sindre Aamås 2017-10-17 04:48:24 UTC
Created attachment 108392 [details]
[PATCH] VEX/ARM: mark caller-save VFP registes as trashed by calls
Comment 2 Ivo Raisr 2017-10-17 04:56:25 UTC
The comment in getRRegUniverse_ARM() should be more explicit about which registers are caller save and callee save.

And if this is indeed the case, then the patch needs to take care of supplying different set of Q registers for register allocator use. Callee saved registers are preferred over caller saved ones.
Comment 3 Sindre Aamås 2017-10-17 05:09:35 UTC
(In reply to Ivo Raisr from comment #2)
> The comment in getRRegUniverse_ARM() should be more explicit about which
> registers are caller save and callee save.
> 
> And if this is indeed the case, then the patch needs to take care of
> supplying different set of Q registers for register allocator use. Callee
> saved registers are preferred over caller saved ones.

The callee-save registers are used up for S and D registers (which alias with the Q registers). Can we first get this fixed and then look at eventually improving the register selection? (Or do you have a suggestion for a better set of registers?)
Comment 4 Sindre Aamås 2017-10-17 05:36:45 UTC
Created attachment 108396 [details]
[PATCH] VEX/ARM: mark caller-save VFP registes as trashed by calls
Comment 5 Ivo Raisr 2017-10-17 05:46:14 UTC
Your fix touches register allocation. It is crucial that both ARMInstr_Call() and  getRRegUniverse_ARM() are kept in sync (hinted in getRRegUniverse_ARM as well)
and that register allocator is presented with the workable set of registers.

By marking all Q ones as caller saved (trashed for call), register allocator would need to spill them all before the call. This creates a performance penalty and bloats the generated r-code.

Would you try running Memcheck on a program which uses 128-bit VFP (q) registers  with '--stats=yes' and note the difference on 'ratio' reported?

How prevalent are programs utilizing 128-bit VFP registers compared to 64-bit ones? In other words, are compilers (gcc) likely to utilize those registers a lot?


A clue for your answer might lie in the document you cited first:
"Registers s16-s31 (d8-d15, q4-q7) must be preserved across subroutine calls; registers s0-s15 (d0-d7, q0-q3) do not need to be preserved (and can be used for passing arguments or returning results in standard procedure-call
variants). Registers d16-d31 (q8-q15), if present, do not need to be preserved."
Comment 6 Sindre Aamås 2017-10-17 07:00:47 UTC
Q registers should be mostly relevant for NEON (often intrinsics or assembly routines with limited control flow). I ran memcheck with and without the change on a fairly NEON-intensive component (mostly DSP and neural networks), and see little performance difference, if any, (Cortex A57 running 32-bit code, which may seem like a weird choice, but it was convenient). I have not looked at the stats, but the output is as follows.
Comment 7 Sindre Aamås 2017-10-17 07:01:59 UTC
Created attachment 108397 [details]
Stats before
Comment 8 Sindre Aamås 2017-10-17 07:02:46 UTC
Created attachment 108398 [details]
Stats after
Comment 9 Ivo Raisr 2017-10-17 07:13:06 UTC
(In reply to Sindre Aamås from comment #6)
> I have not looked at the stats, but the output is as follows.

As suspected, there is a code bloat in the generated code because of the additional spilling before helper calls.
Ratio before: 15.9
Ratio after:  16.0
Comment 10 Sindre Aamås 2017-10-18 05:16:38 UTC
The reason why I ran across this is likely that our toolchain has -mfpu=neon as default, which may be uncommon, which would explain why this has not been reported previously.
Comment 11 Ivo Raisr 2017-10-18 05:27:36 UTC
We do have a number of test programs under none/tests/arm which are supposed to test VFP.

Please could you have a look if your test program would belong into one of these; otherwise add it there.

Then post the results of running the regression test suite before and after your changes are applied. I will then take your patch in.