Bug 380942 - Experimental: add MESI protocol simulation to Callgrind
Summary: Experimental: add MESI protocol simulation to Callgrind
Status: ASSIGNED
Alias: None
Product: valgrind
Classification: Developer tools
Component: callgrind (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Josef Weidendorfer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-07 16:31 UTC by Julian Seward
Modified: 2023-02-23 09:26 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments
Initial implementation (15.12 KB, patch)
2017-06-07 16:56 UTC, Julian Seward
Details
valgrind-bug380942-Callgrind-MESI.diff-2023Feb23-rebased (16.87 KB, patch)
2023-02-23 09:26 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Seward 2017-06-07 16:31:22 UTC
It would be really great if it was possible to use Callgrind to simulate
enough of the MESI protocol to find out where multithreaded code is
interacting badly with the protocol.  In practice that means, finding places
where cache lines are transferred between threads, and (ideally) detecting
whether those transfers are due to true or false sharing.
Comment 1 Julian Seward 2017-06-07 16:56:20 UTC
Created attachment 105965 [details]
Initial implementation

Here's an initial implementation.  Some notes:

* You *must* use --fair-sched=yes.  If you don't, you'll get garbage
  results and no warning.

* The implementation reduces SCHEDULING_QUANTUM from 100000 to 1200,
  so as to give reasonable pseudo-parallel interleaving of threads.
  That is a fine enough interleaving to pick up more than 90% of all
  RFOs from execution of parallel layout algorithms in Firefox.  If
  you have very frequent cache line transfers then you may need to
  set this even lower to get accurate stats.

* The reduction in SCHEDULING_QUANTUM makes Valgrind run more slowly,
  although not catastrophically so.

* This doesn't really simulate the MESI protocol directly.  Instead
  it keeps track of which thread has most recently accessed each
  cache line sized piece of memory, and observes transfers of ownership
  between these cache line sized pieces of memory.

* As a result you can simulate MESI (like) behaviour for any number of
  threads.  It's unrelated to the hardware in your machine.  The line
  size is currently hardwired to 64 bytes.

* Use the flag --simulate-mesi=yes.  You can't use any of the other
  Callgrind simulation facilities (caches, branches, prefetchers etc)
  at the same time.

* The simulation tries to distinguish between true and false sharing.

* This is all very experimental.  It's not really clear whether the RFO
  detection and the true-vs-false sharing detection reflect reality
  accurately enough to be useful.

* 4 different kinds of costs are computed:
     RFOr      -- requests for ownership (transfers) due to reads
     RFOw      -- RFOs due to writes
     RFOrTrue  -- RFOs due to reads, with true sharing
     RFOwTrue  -- RFOs due to writes, with true sharing

* You can display the results in KCachegrind in the normal way.

* Although it is not necessary, it is useful to program the following
  formulas into KCachegrind:

     RFO = RFOr + RFOw              -- total number of RFOs

     RFOTrue = RFOrTrue + RFOwTrue  -- total number of RFOs from true sharing

     RFOFalse = RFOr + RFOw - RFOrTrue - RFOwTrue
                                    -- total number of RFOs from false sharing
Comment 2 Josef Weidendorfer 2017-07-06 20:50:55 UTC
Nice!

State transitions look good.
Getting read of the empty Ir simulation can be done after merging.

But to really distinguish between true and false sharing, remembering the last access is not really enough. It may identify true sharing as false sharing.

E.g. multiple accesses from the same thread could cover all of a cache line, yet it is identified as false sharing if the first access of another thread does not overlap with the last access of the original thread.

The real solution would be to maintain a bitmap (on a 64-bit architecture,
using a 64-bit integer for a cacheline size of 64 bytes is a perfect match).
Updating the bitmap may be slightly slower, but checking for overlap is
just an AND operation.
I actually do that in the simulator for "cache use"; we can copy it from
there.
Comment 3 Julian Seward 2023-02-23 09:26:46 UTC
Created attachment 156636 [details]
valgrind-bug380942-Callgrind-MESI.diff-2023Feb23-rebased

Here's the same patch rebased to V sources of 23 Feb 2023.  --fair-sched=yes
has been hardwired (see comment 1) and the scheduling quantum has been reduced
from 1200 to 800, for safety.

TL;DR to build is now:

  git clone https://sourceware.org/git/valgrind.git mesi2023
  cd mesi2023
  patch -p1 < valgrind-bug380942-Callgrind-MESI.diff-2023Feb23-rebased
  ./autogen.sh
  ./configure --prefix=`pwd`/Inst --enable-only64bit
  make -j 8 all
  make -j 8 install

To run:

  /path/to/mesi2023/Inst/bin/valgrind \
     --tool=callgrind --simulate-mesi=yes <program and args>
  kcachegrind callgrind.out.<PID>

See comment 1 for more details on the RFO events.  These are the "requests for
ownership" (cache line transfers) which contain the info of interest here.