Bug 296422 - Add translation chaining support
Summary: Add translation chaining support
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.7 SVN
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-20 09:53 UTC by Julian Seward
Modified: 2012-07-01 19:56 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
(1 of 2) vex-side patches against 20 March 2012 trunk (r2265/r12452) (168.04 KB, patch)
2012-03-20 09:58 UTC, Julian Seward
Details
(2 of 2) valgrind-side patches against 20 March 2012 trunk (r2265/r12452) (93.61 KB, patch)
2012-03-20 10:02 UTC, Julian Seward
Details
(1 of 2) amd64-only diffs, as a guide for other archs (vex-side) (76.59 KB, patch)
2012-03-22 00:29 UTC, Julian Seward
Details
(2 of 2) amd64-only diffs, as a guide for other archs (valgrind-side) (13.46 KB, patch)
2012-03-22 00:31 UTC, Julian Seward
Details
(1 of 2) vex-side patches against 24 March 2012 trunk (r2266/r12455) (234.73 KB, patch)
2012-03-24 11:21 UTC, Julian Seward
Details
(2 of 2) valgrind-side patches against 24 March 2012 trunk (r2266/r12455) (108.90 KB, patch)
2012-03-24 11:24 UTC, Julian Seward
Details
(1 of 2) vex-side patches against 1 April 2012 trunk (r2269/r12475) (310.13 KB, patch)
2012-04-01 23:50 UTC, Julian Seward
Details
(2 of 2) valgrind-side patches against 1 April 2012 trunk (r2269/r12475) (136.71 KB, patch)
2012-04-01 23:52 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Seward 2012-03-20 09:53:38 UTC
Currently, each execution of (instrumented) basic blocks (superblocks,
really) require a table lookup that is done by the assembly stubs in
coregrind/m_dispatch/dispatch-arch-os.S.  Despite various attempts at
optimisation, this generates a cache miss when looking up in the
table, and a branch mispredict when jumping to the generated code.
These give a significant performance hit.

This patch adds chaining support.  It applies to basic blocks that end
in unconditional or conditional branches to destinations known at JIT
time (more generally, to destinations that are known to be constant
after initial IR optimisation).  Such blocks will have their exits
patched so as to jump directly to their destinations, without having
to go through the assembly dispatcher.  This avoids the associated
cache misses and branch mispredicts, and also quite simply saves a
substantial number of instructions, since each trip though the
dispatcher costs 10 to 20 instructions.

Branches to destinations not known at JIT time (returns, mostly, and
indirect jumps) still have to use the dispatcher.  However, they are
relatively speaking a minority.

This patch adds a second, minor but useful optimisation.  Up till now,
each trip through the dispatcher has decremented and tested a counter
-- an "event check".  If that goes to zero then we fall out of the
dispatcher and do housekeeping stuff that periodically needs to be
done (let another thread run, deliver signals).

This costs one load and store per block.  The patch restricts event
checking to backward edges in the control flow graph, which removes
about 2/3 of them and hence about 2/3 of their cost.

Currently the patch works stably for large apps, but for amd64-linux
only.  Initial tests give show around an 8% performance improvement
for Memcheck and correspondingly larger improvements for more
lightweight tools.  The GDB server is currently broken; this is under
investigation.

This initial patch is large and intrusive, in part because it contains
not only the amd64-specifics, but also the general infrastructure
changes for all platforms.  The basic idea sounds simple in theory,
but making everything work (event checks, redirection, GDB server,
code discarding) cross-architecture is fairly complex.
Comment 1 Julian Seward 2012-03-20 09:58:44 UTC
Created attachment 69754 [details]
(1 of 2) vex-side patches against 20 March 2012 trunk (r2265/r12452)
Comment 2 Julian Seward 2012-03-20 10:02:10 UTC
Created attachment 69755 [details]
(2 of 2) valgrind-side patches against 20 March 2012 trunk (r2265/r12452)

This together with the immediately previous patch give support for
amd64-linux, all tools and facilities, except gdb server support,
which will assert.  Support for other architectures is missing whilst
the basic framework is finalised.  To build, configure with
--enable-only64bit.
Comment 3 Julian Seward 2012-03-20 18:39:35 UTC
Here's some numbers from perf/, running on two 64-bit cpus close to
opposite ends of the spectrum of microarchitectural sophistication: a
3.46 GHz Core i5 670, w/ 4MB L3, 256K per-core L2s, and a 1.6 GHz dual
core Atom, with, iirc, 24k/32k/512k I1/D1/L2.

What's striking is that the relative speedup is generally much larger
on the Atom, in one case more than halving the run time with
--tool=none (for bz2).  That's not surprising considering that the
Atom has fewer cache and branch predictor resources to deal with the
storm of mispredicts and misses caused by the trunk's
always-use-the-dispatcher scheme.  IOW, chaining is a relatively
bigger win for lower-end CPUs.



10 iterations on 3.47 GHz Core i5 670

-- Running  tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 tchain    :0.11s  no: 1.7s (15.3x, -----)  me: 3.4s (30.6x, -----)
bigcode1 trunk     :0.11s  no: 2.1s (19.4x,-26.8%)  me: 3.8s (34.3x,-11.9%)
-- bigcode2 --
bigcode2 tchain    :0.11s  no: 4.2s (38.1x, -----)  me: 8.9s (80.5x, -----)
bigcode2 trunk     :0.11s  no: 4.6s (41.7x, -9.5%)  me: 9.2s (84.0x, -4.3%)
-- bz2 --
bz2      tchain    :0.63s  no: 1.8s ( 2.9x, -----)  me: 6.2s ( 9.8x, -----)
bz2      trunk     :0.63s  no: 2.4s ( 3.8x,-31.1%)  me: 6.7s (10.7x, -8.2%)
-- fbench --
fbench   tchain    :0.24s  no: 1.1s ( 4.4x, -----)  me: 3.8s (16.0x, -----)
fbench   trunk     :0.24s  no: 1.2s ( 5.0x,-12.3%)  me: 4.2s (17.3x, -8.3%)
-- ffbench --
ffbench  tchain    :0.22s  no: 0.9s ( 4.2x, -----)  me: 3.0s (13.8x, -----)
ffbench  trunk     :0.22s  no: 0.9s ( 4.3x, -2.2%)  me: 3.1s (14.0x, -1.3%)
-- heap --
heap     tchain    :0.08s  no: 0.6s ( 8.0x, -----)  me: 5.5s (68.2x, -----)
heap     trunk     :0.08s  no: 0.9s (11.0x,-37.5%)  me: 5.4s (68.0x,  0.4%)
-- heap_pdb4 --
heap_pdb4 tchain    :0.11s  no: 0.7s ( 6.4x, -----)  me: 9.3s (84.5x, -----)
heap_pdb4 trunk     :0.11s  no: 0.9s ( 8.5x,-34.3%)  me: 9.7s (87.9x, -4.0%)
-- many-loss-records --
many-loss-records tchain    :0.01s  no: 0.2s (18.0x, -----)  me: 1.4s (138.0x, -----)
many-loss-records trunk     :0.01s  no: 0.2s (22.0x,-22.2%)  me: 1.4s (138.0x, -0.0%)
-- many-xpts --
many-xpts tchain    :0.04s  no: 0.3s ( 6.8x, -----)  me: 1.9s (46.8x, -----)
many-xpts trunk     :0.04s  no: 0.3s ( 8.2x,-22.2%)  me: 1.8s (46.0x,  1.6%)
-- sarp --
sarp     tchain    :0.02s  no: 0.2s (11.0x, -----)  me: 2.3s (116.0x, -----)
sarp     trunk     :0.02s  no: 0.2s (11.0x,  0.0%)  me: 2.4s (121.5x, -4.7%)
-- tinycc --
tinycc   tchain    :0.18s  no: 1.4s ( 7.7x, -----)  me: 9.3s (51.9x, -----)
tinycc   trunk     :0.18s  no: 2.2s (12.3x,-59.7%)  me:10.4s (57.8x,-11.2%)
-- Finished tests in perf ----------------------------------------------



5 iterations on a 1.6 GHz Atom D510

-- Running  tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 tchain    :0.44s  no: 7.0s (16.0x, -----)  me:14.1s (32.1x, -----)
bigcode1 trunk     :0.44s  no:15.5s (35.2x,-120.7%)  me:22.8s (51.7x,-61.3%)
-- bigcode2 --
bigcode2 tchain    :0.44s  no:17.8s (40.5x, -----)  me:37.7s (85.6x, -----)
bigcode2 trunk     :0.44s  no:25.8s (58.5x,-44.5%)  me:45.8s (104.1x,-21.6%)
-- bz2 --
bz2      tchain    :2.27s  no: 7.4s ( 3.2x, -----)  me:45.6s (20.1x, -----)
bz2      trunk     :2.27s  no:16.7s ( 7.3x,-126.1%)  me:54.2s (23.9x,-18.7%)
-- fbench --
fbench   tchain    :1.71s  no: 6.5s ( 3.8x, -----)  me:25.9s (15.1x, -----)
fbench   trunk     :1.71s  no: 9.0s ( 5.3x,-39.0%)  me:27.8s (16.3x, -7.5%)
-- ffbench --
ffbench  tchain    :3.38s  no: 5.7s ( 1.7x, -----)  me:20.6s ( 6.1x, -----)
ffbench  trunk     :3.38s  no: 6.7s ( 2.0x,-17.9%)  me:21.2s ( 6.3x, -3.1%)
-- heap --
heap     tchain    :0.50s  no: 3.7s ( 7.3x, -----)  me:33.1s (66.3x, -----)
heap     trunk     :0.50s  no: 6.5s (12.9x,-76.0%)  me:32.2s (64.4x,  2.8%)
-- heap_pdb4 --
heap_pdb4 tchain    :0.61s  no: 3.9s ( 6.4x, -----)  me:52.3s (85.7x, -----)
heap_pdb4 trunk     :0.61s  no: 6.9s (11.3x,-75.6%)  me:51.3s (84.0x,  2.0%)
-- many-loss-records --
many-loss-records tchain    :0.06s  no: 0.9s (14.5x, -----)  me: 7.1s (118.8x, -----)
many-loss-records trunk     :0.06s  no: 1.3s (21.5x,-48.3%)  me: 7.1s (118.5x,  0.3%)
-- many-xpts --
many-xpts tchain    :0.19s  no: 1.4s ( 7.1x, -----)  me:10.0s (52.7x, -----)
many-xpts trunk     :0.19s  no: 2.1s (11.1x,-56.3%)  me: 9.8s (51.7x,  1.8%)
-- sarp --
sarp     tchain    :0.13s  no: 1.1s ( 8.2x, -----)  me:12.0s (92.0x, -----)
sarp     trunk     :0.13s  no: 1.3s (10.3x,-26.4%)  me:12.1s (93.3x, -1.4%)
-- tinycc --
tinycc   tchain    :0.91s  no: 8.1s ( 8.9x, -----)  me:47.0s (51.6x, -----)
tinycc   trunk     :0.91s  no:11.2s (12.4x,-38.2%)  me:50.6s (55.6x, -7.8%)
-- Finished tests in perf ----------------------------------------------
Comment 4 Christian Borntraeger 2012-03-20 19:40:49 UTC
Interesting. This might have some interesting effects for s390, since there are some heavily-used instructions that are basically a for loop around an op (so no REP prefix, the instruction itself has a length field). One example is MVC (move character -> like memcpy, but the overlap case is defined as being one-by-one character left-to-right).These are implemented by jumping back to itself and are therefore much more expensive on valgrind than the native version since this always goes through the dispatcher for each byte.

We also have to see how the invalidation works out performance-wise. Remember the EXecute instruction? This was implemented by self checking code that will invalidate code sections.
Comment 5 Julian Seward 2012-03-20 23:19:56 UTC
T-chaining changes -- summary

* The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact
  more closely with Valgrind than before.  In particular the
  instruction selectors must use one of 3 different kinds of
  control-transfer instructions: XDirect, XIndir and XAssisted.
  All archs must use these the same; no more ad-hoc control transfer
  instructions.
  (more detail below)


* With T-chaining, translations can jump between each other without
  going through the dispatcher loop every time.  This means that the
  event check (counter dec, and exit if negative) the dispatcher loop
  previously did now needs to be compiled into each translation.


* The assembly dispatcher code (dispatch-arch-os.S) is still
  present.  It still provides table lookup services for 
  indirect branches, but it also provides a new feature: 
  dispatch points, to which the generated code jumps.  There
  are 5:

  VG_(disp_cp_chain_me_to_slowEP):
  VG_(disp_cp_chain_me_to_fastEP):
    These are chain-me requests, used for Boring conditional and
    unconditional jumps to destinations known at JIT time.  The
    generated code calls these (doesn't jump to them) and the
    stub recovers the return address.  These calls never return;
    instead the call is done so that the stub knows where the
    calling point is.  It needs to know this so it can patch
    the calling point to the requested destination.
  VG_(disp_cp_xindir):
    Old-style table lookup and go; used for indirect jumps
  VG_(disp_cp_xassisted):
    Most general and slowest kind.  Can transfer to anywhere, but
    first returns to scheduler to do some other event (eg a syscall)
    before continuing.
  VG_(disp_cp_evcheck_fail):
    Code jumps here when the event check fails.


* new instructions in backends: XDirect, XIndir and XAssisted.
  XDirect is used for chainable jumps.  It is compiled into a
  call to VG_(disp_cp_chain_me_to_slowEP) or
  VG_(disp_cp_chain_me_to_fastEP).

  XIndir is used for indirect jumps.  It is compiled into a jump
  to VG_(disp_cp_xindir)

  XAssisted is used for "assisted" (do something first, then jump)
  transfers.  It is compiled into a jump to VG_(disp_cp_xassisted)

  All 3 of these may be conditional.

  More complexity: in some circumstances (no-redir translations)
  all transfers must be done with XAssisted.  In such cases the
  instruction selector will be told this.


* Patching: XDirect is compiled basically into
     %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP)
     call *%r11
  Backends must provide a function (eg) chainXDirect_AMD64
  which converts it into a jump to a specified destination
     jmp $delta-of-PCs
  or
     %r11 = 64-bit immediate
     jmpq *%r11
  depending on branch distance.

  Backends must provide a function (eg) unchainXDirect_AMD64
  which restores the original call-to-the-stub version.


* Event checks.  Each translation now has two entry points,
  the slow one (slowEP) and fast one (fastEP).  Like this:

     slowEP:
        counter--
        if (counter < 0) goto VG_(disp_cp_evcheck_fail)
     fastEP:
        (rest of the translation)

  slowEP is used for control flow transfers that are or might be
  a back edge in the control flow graph.  Insn selectors are
  given the address of the highest guest byte in the block so
  they can determine which edges are definitely not back edges.

  The counter is placed in the first 8 bytes of the guest state,
  and the address of VG_(disp_cp_evcheck_fail) is placed in
  the next 8 bytes.  This allows very compact checks on all
  targets, since no immediates need to be synthesised, eg:

    decq 0(%baseblock-pointer)
    jns  fastEP
    jmpq *8(baseblock-pointer)
    fastEP:

  On amd64 a non-failing check is therefore 2 insns; all 3 occupy
  just 8 bytes.

  On amd64 the event check is created by a special single
  pseudo-instruction AMD64_EvCheck.


* BB profiling (for --profile-flags=).  The dispatch assembly
  dispatch-arch-os.S no longer deals with this and so is much
  simplified.  Instead the profile inc is compiled into each
  translation, as the insn immediately following the event
  check.  Again, on amd64 a pseudo-insn AMD64_ProfInc is used.
  Counters are now 64 bit even on 32 bit hosts, to avoid overflow.

  One complexity is that at JIT time it is not known where the
  address of the counter is.  To solve this, VexTranslateResult
  now returns the offset of the profile inc in the generated
  code.  When the counter address is known, VEX can be called
  again to patch it in.  Backends must supply eg
  patchProfInc_AMD64 to make this happen.


* Front end changes (guest_blah_toIR.c)

  The way the guest program counter is handled has changed
  significantly.  Previously, the guest PC was updated (in IR)
  at the start of each instruction, except for the first insn
  in an IRSB.  This is inconsistent and doesn't work with the
  new framework.

  Now, each instruction must update the guest PC as its last
  IR statement -- not its first.  And no special exemption for
  the first insn in the block.  As before most of these are
  optimised out by ir_opt, so no concerns about efficiency.

  As a logical side effect of this, exits (IRStmt_Exit) and the
  block-end transfer are both considered to write to the guest state
  (the guest PC) and so need to be told the offset of it.

  IR generators (eg disInstr_AMD64) are no longer allowed to set the
  IRSB::next, to specify the block-end transfer address.  Instead they
  now indicate, to the generic steering logic that drives them (iow,
  guest_generic_bb_to_IR.c), that the block has ended.  This then
  generates effectively "goto GET(PC)" (which, again, is optimised
  away).  What this does mean is that if the IR generator function
  ends the IR of the last instruction in the block with an incorrect
  assignment to the guest PC, execution will transfer to an incorrect
  destination -- making the error obvious quickly.
Comment 6 Julian Seward 2012-03-20 23:26:05 UTC
(In reply to comment #4)
> We also have to see how the invalidation works out performance-wise.
> Remember the EXecute instruction? This was implemented by self checking code
> that will invalidate code sections.

Invalidation is more expensive (and much more complex, internally)
since to delete a block now requires undoing (unchaining) any direct
jumps to it.  To do this reasonably efficiently requires m_transtab
to maintain the directed graph created by chaining -- as mutually
redundant forward and reverse edges, for efficiency.
Comment 7 Julian Seward 2012-03-22 00:29:12 UTC
Created attachment 69797 [details]
(1 of 2) amd64-only diffs, as a guide for other archs (vex-side)
Comment 8 Julian Seward 2012-03-22 00:31:55 UTC
Created attachment 69798 [details]
(2 of 2) amd64-only diffs, as a guide for other archs (valgrind-side)

This and the previous patch show just the amd64 specific parts of the main patches
(69754/69755) but omit the generic framework stuff that is the same for all
architectures.  This makes it a lot easier to see what needs to be changed for
any specific architecture.
Comment 9 Julian Seward 2012-03-24 11:21:47 UTC
Created attachment 69847 [details]
(1 of 2) vex-side patches against 24 March 2012 trunk (r2266/r12455)
Comment 10 Julian Seward 2012-03-24 11:24:57 UTC
Created attachment 69849 [details]
(2 of 2) valgrind-side patches against 24 March 2012 trunk (r2266/r12455)

Updated version of the patch-pair that adds support for x86-linux.
Hence x86-linux and amd64-linux, all tools and facilities except
the gdb server, are supported.
Comment 11 Julian Seward 2012-04-01 23:50:35 UTC
Created attachment 70066 [details]
(1 of 2) vex-side patches against 1 April 2012 trunk (r2269/r12475)
Comment 12 Julian Seward 2012-04-01 23:52:14 UTC
Created attachment 70067 [details]
(2 of 2) valgrind-side patches against 1 April 2012 trunk (r2269/r12475)

Adds support for arm-linux.
Comment 13 Julian Seward 2012-04-02 21:57:59 UTC
All of this work has been committed now at
svn://svn.valgrind.org/valgrind/branches/TCHAIN
(r2273/r12484)