In the baseline simulator, jumps to guest code addresses that are not known at JIT time have to be looked up in a guest->host mapping table. That means: indirect branches, indirect calls and most commonly, returns. Since there are huge numbers of these (often 10+ million/second) the mapping mechanism needs to be extremely cheap. Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with 2^15 (guest_addr, host_addr) pairs. This is queried in handwritten assembly in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S. If there is a miss in the cache then we fall back out to C land, and do a slow lookup using VG_(search_transtab). Given that the size of the translation table(s) in recent years have expanded significantly in order to keep pace with increasing applicatin sizes, two bad things have happened: (1) the cost of a miss in the fast cache has risen significantly, and (2) the miss rate on the fast cache has also increased significantly. This means that large (~ one-million-basic-blocks-JITted) applications that run for a long time end up spending a lot of time in VG_(search_transtab). The proposed fix is to increase associativity of the fast cache, from 1 (direct mapped) to 4. Simulations of various cache configurations using indirect-branch traces from a large application show that is the best of various configurations. In an extreme case with 5.7 billion indirect branches: * The increase of associativity from 1 way to 4 way, whilst keeping the overall cache size the same (32k guest/host pairs), reduces the miss rate by around a factor of 3, from 4.02% to 1.30%. * The use of a slightly better hash function than merely slicing off the bottom 15 bits of the address, reduces the miss rate further, from 1.30% to 0.53%. Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but reduced by a factor of up to almost 8 on large workloads. By implementing each (4-entry) cache set using a move-to-front scheme in the case of hits in ways 1, 2 or 3, the vast majority of hits can be made to happen in way 0. Hence the cost of having this extra associativity is almost zero in the case of a hit. The improved hash function costs an extra 2 ALU shots (a shift and an xor) but overall this seems performance neutral to a win.
Created attachment 117251 [details] WIP patch Contains all C-level changes, plus assembly changes for amd64-linux and x86-linux only, and therefore works only on those platforms.
Created attachment 117291 [details] Revised WIP patch A more suitable baseline patch to use as a starting point to implement the non {x86,amd64}-linux versions of VG_(disp_cp_xindir).
Created attachment 117424 [details] Supports x86_{32,64}-linux, arm{32,64}-linux
Created attachment 117497 [details] Supports x86_{32,64}-linux, arm{32,64}-linux, ppc{64,32}be-linux, ppc64le-linux
Created attachment 117510 [details] WIP patch Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux, mips{32,64}-linux, ppc64le-linux
Created attachment 117546 [details] WIP patch Supports x86_{64,32}-linux, arm{64,32}-linux, ppc{64,32}be-linux, mips{32,64}-linux, ppc64le-linux, s390x-linux
Pushed, all targets except amd64-solaris and x86-solaris. commit 50bb127b1df8d31812141aafa567d325d1fbc1b3
Implementation for amd64-solaris and x86-solaris (UNTESTED!) commit f96d131ce24cb403cc7a43c19bb651dd25fbe122