402781 – Redo the cache used to process indirect branch targets

Bug 402781 - Redo the cache used to process indirect branch targets

Summary: Redo the cache used to process indirect branch targets

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	3.15 SVN
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2019-01-02 09:15 UTC by Julian Seward
Modified:	2019-01-25 08:32 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
WIP patch (27.23 KB, patch) 2019-01-02 19:23 UTC, Julian Seward	Details
Revised WIP patch (28.79 KB, patch) 2019-01-05 07:49 UTC, Julian Seward	Details
Supports x86_{32,64}-linux, arm{32,64}-linux (39.29 KB, patch) 2019-01-12 18:45 UTC, Julian Seward	Details
Supports x86_{32,64}-linux, arm{32,64}-linux, ppc{64,32}be-linux, ppc64le-linux (58.15 KB, patch) 2019-01-16 19:36 UTC, Julian Seward	Details
WIP patch (69.10 KB, patch) 2019-01-17 14:26 UTC, Julian Seward	Details
WIP patch (75.35 KB, patch) 2019-01-18 19:05 UTC, Julian Seward	Details
Show Obsolete (5) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Julian Seward 2019-01-02 09:15:57 UTC

In the baseline simulator, jumps to guest code addresses that are not known at
JIT time have to be looked up in a guest->host mapping table.  That means:
indirect branches, indirect calls and most commonly, returns.  Since there are
huge numbers of these (often 10+ million/second) the mapping mechanism needs
to be extremely cheap.

Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with
2^15 (guest_addr, host_addr) pairs.  This is queried in handwritten assembly
in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S.  If there is a miss in the
cache then we fall back out to C land, and do a slow lookup using
VG_(search_transtab).

Given that the size of the translation table(s) in recent years have expanded
significantly in order to keep pace with increasing applicatin sizes, two bad
things have happened: (1) the cost of a miss in the fast cache has risen
significantly, and (2) the miss rate on the fast cache has also increased
significantly.  This means that large (~ one-million-basic-blocks-JITted)
applications that run for a long time end up spending a lot of time in
VG_(search_transtab).

The proposed fix is to increase associativity of the fast cache, from 1
(direct mapped) to 4.  Simulations of various cache configurations using
indirect-branch traces from a large application show that is the best of
various configurations.  In an extreme case with 5.7 billion indirect
branches:

* The increase of associativity from 1 way to 4 way, whilst keeping the
  overall cache size the same (32k guest/host pairs), reduces the miss rate by
  around a factor of 3, from 4.02% to 1.30%.

* The use of a slightly better hash function than merely slicing off the
  bottom 15 bits of the address, reduces the miss rate further, from 1.30% to
  0.53%.

Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but
reduced by a factor of up to almost 8 on large workloads.

By implementing each (4-entry) cache set using a move-to-front scheme in the
case of hits in ways 1, 2 or 3, the vast majority of hits can be made to
happen in way 0.  Hence the cost of having this extra associativity is almost
zero in the case of a hit.  The improved hash function costs an extra 2 ALU
shots (a shift and an xor) but overall this seems performance neutral to a
win.

Comment 1 Julian Seward 2019-01-02 19:23:45 UTC

Created attachment 117251 [details]
WIP patch

Contains all C-level changes, plus assembly changes for amd64-linux and
x86-linux only, and therefore works only on those platforms.

Comment 2 Julian Seward 2019-01-05 07:49:02 UTC

Created attachment 117291 [details]
Revised WIP patch

A more suitable baseline patch to use as a starting point to implement the
non {x86,amd64}-linux versions of VG_(disp_cp_xindir).

Comment 3 Julian Seward 2019-01-12 18:45:08 UTC

Created attachment 117424 [details]
Supports x86_{32,64}-linux, arm{32,64}-linux

Comment 4 Julian Seward 2019-01-16 19:36:12 UTC

Created attachment 117497 [details]
Supports x86_{32,64}-linux, arm{32,64}-linux, ppc{64,32}be-linux, ppc64le-linux

Comment 5 Julian Seward 2019-01-17 14:26:54 UTC

Created attachment 117510 [details]
WIP patch

Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux,
mips{32,64}-linux, ppc64le-linux

Comment 6 Julian Seward 2019-01-18 19:05:33 UTC

Created attachment 117546 [details]
WIP patch

Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux,
mips{32,64}-linux, ppc64le-linux, s390x-linux

Comment 7 Julian Seward 2019-01-25 08:20:45 UTC

Pushed, all targets except amd64-solaris and x86-solaris.
commit 50bb127b1df8d31812141aafa567d325d1fbc1b3

Comment 8 Julian Seward 2019-01-25 08:30:25 UTC

Implementation for amd64-solaris and x86-solaris (UNTESTED!)
commit f96d131ce24cb403cc7a43c19bb651dd25fbe122