VEX register allocator v2 has a clever optimization which reduces a number of spill instructions generated. Let's have the following vcode: ... 97: orq %vR250,%vR570 98: movslq %vR570d,%vR569 99: movq %vR329,%vR571 100: orq %vR569,%vR571 101: callnz[0,RLPri_None] and the following vreg ranges: %vR250: [ 12, 214) %vR329: [ 72, 162) %vR569: [ 98, 101) %vR570: [ 95, 99) %vR571: [ 99, 101) Now reg. alloc. v2 produces the following rcode: 82 orq %rbx,%r8 83 movslq %r8d,%rdi 84 movq 0xB28(%rbp),%r8 85 movq %r8,%rsi 86 orq %rdi,%rsi 87 movq %r9,0xAE0(%rbp) 88 callnz[0,RLPri_None] 0x5805E8E0 However current regalloc v3 produces more rcode: 82 orq %rbx,%r8 83 movslq %r8d,%rdi 84 movq 0xB28(%rbp),%r8 85 movq %r8,%rsi 86 orq %rdi,%rsi 87 movq %r9,0xAE0(%rbp) 88 movq %r8,0xB28(%rbp) 89 callnz[0,RLPri_None] 0x5805E8E0 It can be seen that spilling %r8 (instruction #88) is useless because its value is still equal to that one in the spill slot.
Created attachment 107778 [details] proposed patch
We can see the following performance improvement: running Memcheck on perf/bz2: baseline: 45,132 M instructions total; 161 M instructions doRegisterAllocation_v3 patched: 45,107 M instructions total; 168 M instructions doRegisterAllocation_v3 running Memcheck on /bin/true: baseline: 3,511 M instructions total; 59 M instructions doRegisterAllocation_v3 patched: 3,508 M instructions total; 61 M instructions doRegisterAllocation_v3 So it can be seen that even when running the simplest program (/bin/true), small additional cost of register allocation is compensated by much higher gain in terms of total instruction cost.
Fixed by changeset 3117cd9637a843cbab5de302fb30e22153fbfc1c: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3117cd9637a843cbab5de302fb30e22153fbfc1c