Bug 402781 - Redo the cache used to process indirect branch targets
Summary: Redo the cache used to process indirect branch targets
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.15 SVN
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-02 09:15 UTC by Julian Seward
Modified: 2019-01-25 08:32 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
WIP patch (27.23 KB, patch)
2019-01-02 19:23 UTC, Julian Seward
Details
Revised WIP patch (28.79 KB, patch)
2019-01-05 07:49 UTC, Julian Seward
Details
Supports x86_{32,64}-linux, arm{32,64}-linux (39.29 KB, patch)
2019-01-12 18:45 UTC, Julian Seward
Details
Supports x86_{32,64}-linux, arm{32,64}-linux, ppc{64,32}be-linux, ppc64le-linux (58.15 KB, patch)
2019-01-16 19:36 UTC, Julian Seward
Details
WIP patch (69.10 KB, patch)
2019-01-17 14:26 UTC, Julian Seward
Details
WIP patch (75.35 KB, patch)
2019-01-18 19:05 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Seward 2019-01-02 09:15:57 UTC
In the baseline simulator, jumps to guest code addresses that are not known at
JIT time have to be looked up in a guest->host mapping table.  That means:
indirect branches, indirect calls and most commonly, returns.  Since there are
huge numbers of these (often 10+ million/second) the mapping mechanism needs
to be extremely cheap.

Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with
2^15 (guest_addr, host_addr) pairs.  This is queried in handwritten assembly
in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S.  If there is a miss in the
cache then we fall back out to C land, and do a slow lookup using
VG_(search_transtab).

Given that the size of the translation table(s) in recent years have expanded
significantly in order to keep pace with increasing applicatin sizes, two bad
things have happened: (1) the cost of a miss in the fast cache has risen
significantly, and (2) the miss rate on the fast cache has also increased
significantly.  This means that large (~ one-million-basic-blocks-JITted)
applications that run for a long time end up spending a lot of time in
VG_(search_transtab).

The proposed fix is to increase associativity of the fast cache, from 1
(direct mapped) to 4.  Simulations of various cache configurations using
indirect-branch traces from a large application show that is the best of
various configurations.  In an extreme case with 5.7 billion indirect
branches:

* The increase of associativity from 1 way to 4 way, whilst keeping the
  overall cache size the same (32k guest/host pairs), reduces the miss rate by
  around a factor of 3, from 4.02% to 1.30%.

* The use of a slightly better hash function than merely slicing off the
  bottom 15 bits of the address, reduces the miss rate further, from 1.30% to
  0.53%.

Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but
reduced by a factor of up to almost 8 on large workloads.

By implementing each (4-entry) cache set using a move-to-front scheme in the
case of hits in ways 1, 2 or 3, the vast majority of hits can be made to
happen in way 0.  Hence the cost of having this extra associativity is almost
zero in the case of a hit.  The improved hash function costs an extra 2 ALU
shots (a shift and an xor) but overall this seems performance neutral to a
win.
Comment 1 Julian Seward 2019-01-02 19:23:45 UTC
Created attachment 117251 [details]
WIP patch

Contains all C-level changes, plus assembly changes for amd64-linux and
x86-linux only, and therefore works only on those platforms.
Comment 2 Julian Seward 2019-01-05 07:49:02 UTC
Created attachment 117291 [details]
Revised WIP patch

A more suitable baseline patch to use as a starting point to implement the
non {x86,amd64}-linux versions of VG_(disp_cp_xindir).
Comment 3 Julian Seward 2019-01-12 18:45:08 UTC
Created attachment 117424 [details]
Supports x86_{32,64}-linux, arm{32,64}-linux
Comment 4 Julian Seward 2019-01-16 19:36:12 UTC
Created attachment 117497 [details]
Supports x86_{32,64}-linux, arm{32,64}-linux, ppc{64,32}be-linux, ppc64le-linux
Comment 5 Julian Seward 2019-01-17 14:26:54 UTC
Created attachment 117510 [details]
WIP patch

Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux,
mips{32,64}-linux, ppc64le-linux
Comment 6 Julian Seward 2019-01-18 19:05:33 UTC
Created attachment 117546 [details]
WIP patch

Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux,
mips{32,64}-linux, ppc64le-linux, s390x-linux
Comment 7 Julian Seward 2019-01-25 08:20:45 UTC
Pushed, all targets except amd64-solaris and x86-solaris.
commit 50bb127b1df8d31812141aafa567d325d1fbc1b3
Comment 8 Julian Seward 2019-01-25 08:30:25 UTC
Implementation for amd64-solaris and x86-solaris (UNTESTED!)
commit f96d131ce24cb403cc7a43c19bb651dd25fbe122