Bug 314365 - enable VEX to run asm helpers that do callee register saving
Summary: enable VEX to run asm helpers that do callee register saving
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: unspecified
Platform: Other All
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-03 17:16 UTC by Dragos Tatulea
Modified: 2014-03-21 17:19 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments
initial VEX support for asm helpers with callee register save restore (12.18 KB, patch)
2013-02-03 20:05 UTC, Dragos Tatulea
Details
asm version of LOADV, plain copy of what gcc generated (3.90 KB, patch)
2013-02-03 20:06 UTC, Dragos Tatulea
Details
assembly version of mc_LOADV32le for x86-linux (2.18 KB, patch)
2013-02-07 20:51 UTC, Julian Seward
Details
Perf improvements for arm32-linux, 21 Mar 2014 (vex) (26.42 KB, patch)
2014-03-21 17:16 UTC, Julian Seward
Details
Perf improvements for arm32-linux, 21 Mar 2014 (val) (11.35 KB, patch)
2014-03-21 17:19 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dragos Tatulea 2013-02-03 17:16:48 UTC
Currently VEX manages the thrashed CPU registers for a helper call by saving/restoring all standard registers for calling ABI. 

SewardJ suggested adding asm helpers that do the register saving/restoring themselves. Considering the fact that some helpers are intensely called by tools, having the fast path written in asm and executing less instructions could bring some speed benefit. 

Reproducible: Always
Comment 1 Dragos Tatulea 2013-02-03 17:38:06 UTC
Helpers would need to be added for each architecture, but that can be done incrementally if some benefit is expected.
Comment 2 Julian Seward 2013-02-03 17:40:12 UTC
Here's some motivating examples.  On x86-linux, the fast path through
LOADV32le -- the most commonly called helper -- looks like this.
Neither of the branches are taken, btw:

<vgMemCheck_helperc_LOADV32le>:
       53                      push   %ebx
       89 c3                   mov    %eax,%ebx
       83 ec 08                sub    $0x8,%esp
       a8 03                   test   $0x3,%al    // alignment check
       75 2b                   jne    380093b5 <vgMemCheck_helperc_LOADV32le+0x35>
       c1 e8 10                shr    $0x10,%eax
       8b 14 85 40 6a 30 38    mov    0x38306a40(,%eax,4),%edx
       0f b7 c3                movzwl %bx,%eax
       c1 e8 02                shr    $0x2,%eax
       0f b6 14 02             movzbl (%edx,%eax,1),%edx
       81 fa aa 00 00 00       cmp    $0xaa,%edx  // check expected case -- all defined
       75 07                   jne    380093ad <vgMemCheck_helperc_LOADV32le+0x2d>
       31 c0                   xor    %eax,%eax
       83 c4 08                add    $0x8,%esp
       5b                      pop    %ebx
       c3                      ret    

That's 16 instructions, containing 2 conditional branches and 2 loads,
or 3 loads and a store if we take into account the saving and
restoring of the return address.  Now, AFAICS, the adjustments to the
stack pointer -- sub $0x8,%esp and add $0x8,%esp -- seem to me to be
redundant on the fast path.  Hence coding it by hand they could be
moved to the slow path(s) and we'd save 2 insns on the fast path.

For arm-linux, the code produced by gcc is poorer, and the potential
gains are larger:

<vgMemCheck_helperc_LOADV32le>:
       e2102003        ands    r2, r0, #3
(1a)   e92d4008        push    {r3, lr}
       1a00000c        bne     38008710 <vgMemCheck_helperc_LOADV32le+0x40>
       e1a0c820        lsr     ip, r0, #16
(2)    e59f1044        ldr     r1, [pc, #68]   ; 3800872c <vgMemCheck_helperc_LOADV32le+0x5c>
       e1a03800        lsl     r3, r0, #16
       e081110c        add     r1, r1, ip, lsl #2
       e5911054        ldr     r1, [r1, #84]   ; 0x54
       e7d13923        ldrb    r3, [r1, r3, lsr #18]
       e35300aa        cmp     r3, #170        ; 0xaa
       01a00002        moveq   r0, r2
(1b)   08bd8008        popeq   {r3, pc}

Saving and restoring the link register on the fast path (1a, 1b)
generates pointless memory traffic.  That could be done on the
slow paths instead.  Also, r3 is caller saved, so I'm not clear
why gcc is saving/restoring it here -- maybe as a way of keeping
the stack 8-aligned, once it commits to saving lr.  

Getting the address of pri_map[0] by doing a load (2) further
burdens the memory system.  We could possibly do better with
movw/movt to synthesise the address.
Comment 3 Dragos Tatulea 2013-02-03 20:05:00 UTC
Created attachment 76887 [details]
initial VEX support for asm helpers with callee register save restore
Comment 4 Dragos Tatulea 2013-02-03 20:06:30 UTC
Created attachment 76888 [details]
asm version of LOADV, plain copy of what gcc generated

oh, and it crashes valgrind too!
Comment 5 Julian Seward 2013-02-06 15:05:37 UTC
> Here's some motivating examples.  On x86-linux, the fast path through

Oh, and it's saving/restoring ebx each time, unnecessarily afaics.  Even more
pointless memory traffic we could nuke using a custom assembly version.
Comment 6 Julian Seward 2013-02-07 20:51:09 UTC
Created attachment 76989 [details]
assembly version of mc_LOADV32le for x86-linux

Is worth about 3-4% in performance for memcheck, compared to the
version created by gcc-4.7.
Comment 7 Julian Seward 2014-03-21 17:16:48 UTC
Created attachment 85672 [details]
Perf improvements for arm32-linux, 21 Mar 2014 (vex)
Comment 8 Julian Seward 2014-03-21 17:19:54 UTC
Created attachment 85673 [details]
Perf improvements for arm32-linux, 21 Mar 2014 (val)

Vex/Val perf improvements for 32-bit ARM, with some infrastructure
that could be used for other targets.  These improve performance by
9%-10% running bzip2 on Memcheck on 1.7GHz Cortex-A15.  Mainly by
reducing memory traffic; approximately a 15+% reduction in loads and
stores in generated code and helper function calls.  WIP; do not
commit.