Currently VEX manages the thrashed CPU registers for a helper call by saving/restoring all standard registers for calling ABI. SewardJ suggested adding asm helpers that do the register saving/restoring themselves. Considering the fact that some helpers are intensely called by tools, having the fast path written in asm and executing less instructions could bring some speed benefit. Reproducible: Always
Helpers would need to be added for each architecture, but that can be done incrementally if some benefit is expected.
Here's some motivating examples. On x86-linux, the fast path through LOADV32le -- the most commonly called helper -- looks like this. Neither of the branches are taken, btw: <vgMemCheck_helperc_LOADV32le>: 53 push %ebx 89 c3 mov %eax,%ebx 83 ec 08 sub $0x8,%esp a8 03 test $0x3,%al // alignment check 75 2b jne 380093b5 <vgMemCheck_helperc_LOADV32le+0x35> c1 e8 10 shr $0x10,%eax 8b 14 85 40 6a 30 38 mov 0x38306a40(,%eax,4),%edx 0f b7 c3 movzwl %bx,%eax c1 e8 02 shr $0x2,%eax 0f b6 14 02 movzbl (%edx,%eax,1),%edx 81 fa aa 00 00 00 cmp $0xaa,%edx // check expected case -- all defined 75 07 jne 380093ad <vgMemCheck_helperc_LOADV32le+0x2d> 31 c0 xor %eax,%eax 83 c4 08 add $0x8,%esp 5b pop %ebx c3 ret That's 16 instructions, containing 2 conditional branches and 2 loads, or 3 loads and a store if we take into account the saving and restoring of the return address. Now, AFAICS, the adjustments to the stack pointer -- sub $0x8,%esp and add $0x8,%esp -- seem to me to be redundant on the fast path. Hence coding it by hand they could be moved to the slow path(s) and we'd save 2 insns on the fast path. For arm-linux, the code produced by gcc is poorer, and the potential gains are larger: <vgMemCheck_helperc_LOADV32le>: e2102003 ands r2, r0, #3 (1a) e92d4008 push {r3, lr} 1a00000c bne 38008710 <vgMemCheck_helperc_LOADV32le+0x40> e1a0c820 lsr ip, r0, #16 (2) e59f1044 ldr r1, [pc, #68] ; 3800872c <vgMemCheck_helperc_LOADV32le+0x5c> e1a03800 lsl r3, r0, #16 e081110c add r1, r1, ip, lsl #2 e5911054 ldr r1, [r1, #84] ; 0x54 e7d13923 ldrb r3, [r1, r3, lsr #18] e35300aa cmp r3, #170 ; 0xaa 01a00002 moveq r0, r2 (1b) 08bd8008 popeq {r3, pc} Saving and restoring the link register on the fast path (1a, 1b) generates pointless memory traffic. That could be done on the slow paths instead. Also, r3 is caller saved, so I'm not clear why gcc is saving/restoring it here -- maybe as a way of keeping the stack 8-aligned, once it commits to saving lr. Getting the address of pri_map[0] by doing a load (2) further burdens the memory system. We could possibly do better with movw/movt to synthesise the address.
Created attachment 76887 [details] initial VEX support for asm helpers with callee register save restore
Created attachment 76888 [details] asm version of LOADV, plain copy of what gcc generated oh, and it crashes valgrind too!
> Here's some motivating examples. On x86-linux, the fast path through Oh, and it's saving/restoring ebx each time, unnecessarily afaics. Even more pointless memory traffic we could nuke using a custom assembly version.
Created attachment 76989 [details] assembly version of mc_LOADV32le for x86-linux Is worth about 3-4% in performance for memcheck, compared to the version created by gcc-4.7.
Created attachment 85672 [details] Perf improvements for arm32-linux, 21 Mar 2014 (vex)
Created attachment 85673 [details] Perf improvements for arm32-linux, 21 Mar 2014 (val) Vex/Val perf improvements for 32-bit ARM, with some infrastructure that could be used for other targets. These improve performance by 9%-10% running bzip2 on Memcheck on 1.7GHz Cortex-A15. Mainly by reducing memory traffic; approximately a 15+% reduction in loads and stores in generated code and helper function calls. WIP; do not commit.