Currently, each execution of (instrumented) basic blocks (superblocks, really) require a table lookup that is done by the assembly stubs in coregrind/m_dispatch/dispatch-arch-os.S. Despite various attempts at optimisation, this generates a cache miss when looking up in the table, and a branch mispredict when jumping to the generated code. These give a significant performance hit. This patch adds chaining support. It applies to basic blocks that end in unconditional or conditional branches to destinations known at JIT time (more generally, to destinations that are known to be constant after initial IR optimisation). Such blocks will have their exits patched so as to jump directly to their destinations, without having to go through the assembly dispatcher. This avoids the associated cache misses and branch mispredicts, and also quite simply saves a substantial number of instructions, since each trip though the dispatcher costs 10 to 20 instructions. Branches to destinations not known at JIT time (returns, mostly, and indirect jumps) still have to use the dispatcher. However, they are relatively speaking a minority. This patch adds a second, minor but useful optimisation. Up till now, each trip through the dispatcher has decremented and tested a counter -- an "event check". If that goes to zero then we fall out of the dispatcher and do housekeeping stuff that periodically needs to be done (let another thread run, deliver signals). This costs one load and store per block. The patch restricts event checking to backward edges in the control flow graph, which removes about 2/3 of them and hence about 2/3 of their cost. Currently the patch works stably for large apps, but for amd64-linux only. Initial tests give show around an 8% performance improvement for Memcheck and correspondingly larger improvements for more lightweight tools. The GDB server is currently broken; this is under investigation. This initial patch is large and intrusive, in part because it contains not only the amd64-specifics, but also the general infrastructure changes for all platforms. The basic idea sounds simple in theory, but making everything work (event checks, redirection, GDB server, code discarding) cross-architecture is fairly complex.
Created attachment 69754 [details] (1 of 2) vex-side patches against 20 March 2012 trunk (r2265/r12452)
Created attachment 69755 [details] (2 of 2) valgrind-side patches against 20 March 2012 trunk (r2265/r12452) This together with the immediately previous patch give support for amd64-linux, all tools and facilities, except gdb server support, which will assert. Support for other architectures is missing whilst the basic framework is finalised. To build, configure with --enable-only64bit.
Here's some numbers from perf/, running on two 64-bit cpus close to opposite ends of the spectrum of microarchitectural sophistication: a 3.46 GHz Core i5 670, w/ 4MB L3, 256K per-core L2s, and a 1.6 GHz dual core Atom, with, iirc, 24k/32k/512k I1/D1/L2. What's striking is that the relative speedup is generally much larger on the Atom, in one case more than halving the run time with --tool=none (for bz2). That's not surprising considering that the Atom has fewer cache and branch predictor resources to deal with the storm of mispredicts and misses caused by the trunk's always-use-the-dispatcher scheme. IOW, chaining is a relatively bigger win for lower-end CPUs. 10 iterations on 3.47 GHz Core i5 670 -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 tchain :0.11s no: 1.7s (15.3x, -----) me: 3.4s (30.6x, -----) bigcode1 trunk :0.11s no: 2.1s (19.4x,-26.8%) me: 3.8s (34.3x,-11.9%) -- bigcode2 -- bigcode2 tchain :0.11s no: 4.2s (38.1x, -----) me: 8.9s (80.5x, -----) bigcode2 trunk :0.11s no: 4.6s (41.7x, -9.5%) me: 9.2s (84.0x, -4.3%) -- bz2 -- bz2 tchain :0.63s no: 1.8s ( 2.9x, -----) me: 6.2s ( 9.8x, -----) bz2 trunk :0.63s no: 2.4s ( 3.8x,-31.1%) me: 6.7s (10.7x, -8.2%) -- fbench -- fbench tchain :0.24s no: 1.1s ( 4.4x, -----) me: 3.8s (16.0x, -----) fbench trunk :0.24s no: 1.2s ( 5.0x,-12.3%) me: 4.2s (17.3x, -8.3%) -- ffbench -- ffbench tchain :0.22s no: 0.9s ( 4.2x, -----) me: 3.0s (13.8x, -----) ffbench trunk :0.22s no: 0.9s ( 4.3x, -2.2%) me: 3.1s (14.0x, -1.3%) -- heap -- heap tchain :0.08s no: 0.6s ( 8.0x, -----) me: 5.5s (68.2x, -----) heap trunk :0.08s no: 0.9s (11.0x,-37.5%) me: 5.4s (68.0x, 0.4%) -- heap_pdb4 -- heap_pdb4 tchain :0.11s no: 0.7s ( 6.4x, -----) me: 9.3s (84.5x, -----) heap_pdb4 trunk :0.11s no: 0.9s ( 8.5x,-34.3%) me: 9.7s (87.9x, -4.0%) -- many-loss-records -- many-loss-records tchain :0.01s no: 0.2s (18.0x, -----) me: 1.4s (138.0x, -----) many-loss-records trunk :0.01s no: 0.2s (22.0x,-22.2%) me: 1.4s (138.0x, -0.0%) -- many-xpts -- many-xpts tchain :0.04s no: 0.3s ( 6.8x, -----) me: 1.9s (46.8x, -----) many-xpts trunk :0.04s no: 0.3s ( 8.2x,-22.2%) me: 1.8s (46.0x, 1.6%) -- sarp -- sarp tchain :0.02s no: 0.2s (11.0x, -----) me: 2.3s (116.0x, -----) sarp trunk :0.02s no: 0.2s (11.0x, 0.0%) me: 2.4s (121.5x, -4.7%) -- tinycc -- tinycc tchain :0.18s no: 1.4s ( 7.7x, -----) me: 9.3s (51.9x, -----) tinycc trunk :0.18s no: 2.2s (12.3x,-59.7%) me:10.4s (57.8x,-11.2%) -- Finished tests in perf ---------------------------------------------- 5 iterations on a 1.6 GHz Atom D510 -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 tchain :0.44s no: 7.0s (16.0x, -----) me:14.1s (32.1x, -----) bigcode1 trunk :0.44s no:15.5s (35.2x,-120.7%) me:22.8s (51.7x,-61.3%) -- bigcode2 -- bigcode2 tchain :0.44s no:17.8s (40.5x, -----) me:37.7s (85.6x, -----) bigcode2 trunk :0.44s no:25.8s (58.5x,-44.5%) me:45.8s (104.1x,-21.6%) -- bz2 -- bz2 tchain :2.27s no: 7.4s ( 3.2x, -----) me:45.6s (20.1x, -----) bz2 trunk :2.27s no:16.7s ( 7.3x,-126.1%) me:54.2s (23.9x,-18.7%) -- fbench -- fbench tchain :1.71s no: 6.5s ( 3.8x, -----) me:25.9s (15.1x, -----) fbench trunk :1.71s no: 9.0s ( 5.3x,-39.0%) me:27.8s (16.3x, -7.5%) -- ffbench -- ffbench tchain :3.38s no: 5.7s ( 1.7x, -----) me:20.6s ( 6.1x, -----) ffbench trunk :3.38s no: 6.7s ( 2.0x,-17.9%) me:21.2s ( 6.3x, -3.1%) -- heap -- heap tchain :0.50s no: 3.7s ( 7.3x, -----) me:33.1s (66.3x, -----) heap trunk :0.50s no: 6.5s (12.9x,-76.0%) me:32.2s (64.4x, 2.8%) -- heap_pdb4 -- heap_pdb4 tchain :0.61s no: 3.9s ( 6.4x, -----) me:52.3s (85.7x, -----) heap_pdb4 trunk :0.61s no: 6.9s (11.3x,-75.6%) me:51.3s (84.0x, 2.0%) -- many-loss-records -- many-loss-records tchain :0.06s no: 0.9s (14.5x, -----) me: 7.1s (118.8x, -----) many-loss-records trunk :0.06s no: 1.3s (21.5x,-48.3%) me: 7.1s (118.5x, 0.3%) -- many-xpts -- many-xpts tchain :0.19s no: 1.4s ( 7.1x, -----) me:10.0s (52.7x, -----) many-xpts trunk :0.19s no: 2.1s (11.1x,-56.3%) me: 9.8s (51.7x, 1.8%) -- sarp -- sarp tchain :0.13s no: 1.1s ( 8.2x, -----) me:12.0s (92.0x, -----) sarp trunk :0.13s no: 1.3s (10.3x,-26.4%) me:12.1s (93.3x, -1.4%) -- tinycc -- tinycc tchain :0.91s no: 8.1s ( 8.9x, -----) me:47.0s (51.6x, -----) tinycc trunk :0.91s no:11.2s (12.4x,-38.2%) me:50.6s (55.6x, -7.8%) -- Finished tests in perf ----------------------------------------------
Interesting. This might have some interesting effects for s390, since there are some heavily-used instructions that are basically a for loop around an op (so no REP prefix, the instruction itself has a length field). One example is MVC (move character -> like memcpy, but the overlap case is defined as being one-by-one character left-to-right).These are implemented by jumping back to itself and are therefore much more expensive on valgrind than the native version since this always goes through the dispatcher for each byte. We also have to see how the invalidation works out performance-wise. Remember the EXecute instruction? This was implemented by self checking code that will invalidate code sections.
T-chaining changes -- summary * The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact more closely with Valgrind than before. In particular the instruction selectors must use one of 3 different kinds of control-transfer instructions: XDirect, XIndir and XAssisted. All archs must use these the same; no more ad-hoc control transfer instructions. (more detail below) * With T-chaining, translations can jump between each other without going through the dispatcher loop every time. This means that the event check (counter dec, and exit if negative) the dispatcher loop previously did now needs to be compiled into each translation. * The assembly dispatcher code (dispatch-arch-os.S) is still present. It still provides table lookup services for indirect branches, but it also provides a new feature: dispatch points, to which the generated code jumps. There are 5: VG_(disp_cp_chain_me_to_slowEP): VG_(disp_cp_chain_me_to_fastEP): These are chain-me requests, used for Boring conditional and unconditional jumps to destinations known at JIT time. The generated code calls these (doesn't jump to them) and the stub recovers the return address. These calls never return; instead the call is done so that the stub knows where the calling point is. It needs to know this so it can patch the calling point to the requested destination. VG_(disp_cp_xindir): Old-style table lookup and go; used for indirect jumps VG_(disp_cp_xassisted): Most general and slowest kind. Can transfer to anywhere, but first returns to scheduler to do some other event (eg a syscall) before continuing. VG_(disp_cp_evcheck_fail): Code jumps here when the event check fails. * new instructions in backends: XDirect, XIndir and XAssisted. XDirect is used for chainable jumps. It is compiled into a call to VG_(disp_cp_chain_me_to_slowEP) or VG_(disp_cp_chain_me_to_fastEP). XIndir is used for indirect jumps. It is compiled into a jump to VG_(disp_cp_xindir) XAssisted is used for "assisted" (do something first, then jump) transfers. It is compiled into a jump to VG_(disp_cp_xassisted) All 3 of these may be conditional. More complexity: in some circumstances (no-redir translations) all transfers must be done with XAssisted. In such cases the instruction selector will be told this. * Patching: XDirect is compiled basically into %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP) call *%r11 Backends must provide a function (eg) chainXDirect_AMD64 which converts it into a jump to a specified destination jmp $delta-of-PCs or %r11 = 64-bit immediate jmpq *%r11 depending on branch distance. Backends must provide a function (eg) unchainXDirect_AMD64 which restores the original call-to-the-stub version. * Event checks. Each translation now has two entry points, the slow one (slowEP) and fast one (fastEP). Like this: slowEP: counter-- if (counter < 0) goto VG_(disp_cp_evcheck_fail) fastEP: (rest of the translation) slowEP is used for control flow transfers that are or might be a back edge in the control flow graph. Insn selectors are given the address of the highest guest byte in the block so they can determine which edges are definitely not back edges. The counter is placed in the first 8 bytes of the guest state, and the address of VG_(disp_cp_evcheck_fail) is placed in the next 8 bytes. This allows very compact checks on all targets, since no immediates need to be synthesised, eg: decq 0(%baseblock-pointer) jns fastEP jmpq *8(baseblock-pointer) fastEP: On amd64 a non-failing check is therefore 2 insns; all 3 occupy just 8 bytes. On amd64 the event check is created by a special single pseudo-instruction AMD64_EvCheck. * BB profiling (for --profile-flags=). The dispatch assembly dispatch-arch-os.S no longer deals with this and so is much simplified. Instead the profile inc is compiled into each translation, as the insn immediately following the event check. Again, on amd64 a pseudo-insn AMD64_ProfInc is used. Counters are now 64 bit even on 32 bit hosts, to avoid overflow. One complexity is that at JIT time it is not known where the address of the counter is. To solve this, VexTranslateResult now returns the offset of the profile inc in the generated code. When the counter address is known, VEX can be called again to patch it in. Backends must supply eg patchProfInc_AMD64 to make this happen. * Front end changes (guest_blah_toIR.c) The way the guest program counter is handled has changed significantly. Previously, the guest PC was updated (in IR) at the start of each instruction, except for the first insn in an IRSB. This is inconsistent and doesn't work with the new framework. Now, each instruction must update the guest PC as its last IR statement -- not its first. And no special exemption for the first insn in the block. As before most of these are optimised out by ir_opt, so no concerns about efficiency. As a logical side effect of this, exits (IRStmt_Exit) and the block-end transfer are both considered to write to the guest state (the guest PC) and so need to be told the offset of it. IR generators (eg disInstr_AMD64) are no longer allowed to set the IRSB::next, to specify the block-end transfer address. Instead they now indicate, to the generic steering logic that drives them (iow, guest_generic_bb_to_IR.c), that the block has ended. This then generates effectively "goto GET(PC)" (which, again, is optimised away). What this does mean is that if the IR generator function ends the IR of the last instruction in the block with an incorrect assignment to the guest PC, execution will transfer to an incorrect destination -- making the error obvious quickly.
(In reply to comment #4) > We also have to see how the invalidation works out performance-wise. > Remember the EXecute instruction? This was implemented by self checking code > that will invalidate code sections. Invalidation is more expensive (and much more complex, internally) since to delete a block now requires undoing (unchaining) any direct jumps to it. To do this reasonably efficiently requires m_transtab to maintain the directed graph created by chaining -- as mutually redundant forward and reverse edges, for efficiency.
Created attachment 69797 [details] (1 of 2) amd64-only diffs, as a guide for other archs (vex-side)
Created attachment 69798 [details] (2 of 2) amd64-only diffs, as a guide for other archs (valgrind-side) This and the previous patch show just the amd64 specific parts of the main patches (69754/69755) but omit the generic framework stuff that is the same for all architectures. This makes it a lot easier to see what needs to be changed for any specific architecture.
Created attachment 69847 [details] (1 of 2) vex-side patches against 24 March 2012 trunk (r2266/r12455)
Created attachment 69849 [details] (2 of 2) valgrind-side patches against 24 March 2012 trunk (r2266/r12455) Updated version of the patch-pair that adds support for x86-linux. Hence x86-linux and amd64-linux, all tools and facilities except the gdb server, are supported.
Created attachment 70066 [details] (1 of 2) vex-side patches against 1 April 2012 trunk (r2269/r12475)
Created attachment 70067 [details] (2 of 2) valgrind-side patches against 1 April 2012 trunk (r2269/r12475) Adds support for arm-linux.
All of this work has been committed now at svn://svn.valgrind.org/valgrind/branches/TCHAIN (r2273/r12484)