Bug 465465

Summary: Eviction Emulation in Cachegrind for Detecting Cache Conflicts
Product: [Developer tools] valgrind Reporter: Ivica <ibogosavljevic>
Component: cachegrindAssignee: Nicholas Nethercote <njn>
Status: REPORTED ---    
Severity: wishlist CC: mark, n.nethercote, pjfloyd
Priority: NOR    
Version First Reported In: 3.20 GIT   
Target Milestone: ---   
Platform: Other   
OS: All   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:
Attachments: Eviction Simulation Patch
Cache conflicts log file

Description Ivica 2023-02-08 11:44:25 UTC
Created attachment 156067 [details]
Eviction Simulation Patch

Let me first give some background:

I am looking for a tool that can detect cache conflicts, but I am not
finding any. There are a few that are mostly academic, and thus not
maintained. I think it is important for the performance analysis
community to have a tool that to some extent can detect cache
conflicts. Is it possible to implement support for detecting source
code lines where cache conflicts occur? More info on cache conflicts
below.

=== What are cache conflicts? ===
Cache conflict happens when a cache line is brought up from the memory
to the cache, but very soon has to be evicted to the main memory
because another cache line is mapped to the same entry.

The problem with detecting cache conflicts is that it is normal that
one cache line gets evicted because it is replaced by another cache
line. Therefore, a cache conflict is an outlier: the cache line spent
very little time in the cache before it got evicted.

=== How to detect cache conflicts? ===
There are several ways to detect cache conflicts, but I here propose one
that is easy to implement in cachegrind.

When an instruction that touches the data cache executes, there is 
a possibility that this instruction evicts a certain line from the cache.
When the line enters cache, we timestamp it. When it gets evicted, we 
can measure how much time it the line spent in the cache.

Then, for each instruction we count the number of evictions and total
time line spent in cache to calculate the average time spent in cache
before eviction. The instructions where the average time before eviction is low
compared to the rest of the program are lines where cache conflicts
happen.

== Advantages and disadvantages ==
Advantages of this model is that it is very straightforward to implement
and the results are good enough. The disadvantage is that Valgrind is
not cycle-accurate, so the times spent in cache will not be given in 
cycles, but in memory accesses (e.g. the average cache line spent
800.000 loads in the LL data cache before it got evicted).

== Does it work ==

I already implemented this in cachegrind and ran it on a matrix
multiplication example with matrix sizes 1008, 1024 and 1040. The 
tool discovered the following anomaly:
average stay in cache, LL cache, data reads for different matrix size:
1008: 890,688 memory accesses
1024: 98,394 memory accesses
1040: 890,965 memory accesses


== Implementation ==
The first draft of the implementation is attached as a patch and 
should be applied on top of VALGRIND 3.20 tag in Github.
Comment 1 Paul Floyd 2023-02-08 21:09:36 UTC
Do you have any testcases?

Have you thought of how the new eviction event types will affect other tools (cg_annotate, cg_merge, cg_merge, kcachegrind).

It looks like you've changed the default to be for eviction. I suppose that will affect the existing test (perl tests/vg_regtest cachegrind/tests).

This will also need documentation change (and probably the web site as well).

I just had a very quick look at the diff. I'll try to look at it properly next week.
Comment 2 Ivica 2023-02-09 08:31:42 UTC
Ok, first, this is a proof of concept patch that I will need to fix.

The eviction emulation is NOT the default, cache emulation should remain the default. All the existing tests will need to pass without change. I need to add new command-line switches for eviction emulation.

I have two examples with cache conflicts (matrix multiplication and binary tree lookup). I could also try to find other examples, e.g. hashsort or in-array tree. I will upload them soon.

I could write the documentation for the new features. cg_annotate works with new command line, but I don't know about other tools. Also, I don't know how to write in PERL.
Comment 3 Ivica 2023-02-09 08:52:00 UTC
Also, if there is some way to send the patch to an official code review (similar to GitHub), let me know.
Comment 4 Paul Floyd 2023-02-09 10:02:05 UTC
(In reply to Ivica from comment #3)
> Also, if there is some way to send the patch to an official code review
> (similar to GitHub), let me know.

Not at present. This is the place for patch submission. Review can be slow!
Comment 5 Ivica 2023-02-09 14:33:45 UTC
I created a github mirror of valgrind, and created a patch as a Pull Request, so it can be reviewed more easily. I sent you an invitation on GitHub for the repository, so we can perform the review there.
Comment 6 Mark Wielaard 2023-02-09 14:59:20 UTC
(In reply to Ivica from comment #5)
> I created a github mirror of valgrind, and created a patch as a Pull
> Request, so it can be reviewed more easily. I sent you an invitation on
> GitHub for the repository, so we can perform the review there.

Please do make sure that all patches and reviews/comments are also added to this bug or at least posted to the valgrind-developers mailinglist.
Comment 7 Ivica 2023-02-09 15:26:41 UTC
Will do!

On Thu, Feb 9, 2023 at 3:59 PM Mark Wielaard <bugzilla_noreply@kde.org> wrote:
>
> https://bugs.kde.org/show_bug.cgi?id=465465
>
> Mark Wielaard <mark@klomp.org> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |mark@klomp.org
>
> --- Comment #6 from Mark Wielaard <mark@klomp.org> ---
> (In reply to Ivica from comment #5)
> > I created a github mirror of valgrind, and created a patch as a Pull
> > Request, so it can be reviewed more easily. I sent you an invitation on
> > GitHub for the repository, so we can perform the review there.
>
> Please do make sure that all patches and reviews/comments are also added to
> this bug or at least posted to the valgrind-developers mailinglist.
>
> --
> You are receiving this mail because:
> You reported the bug.
Comment 8 Paul Floyd 2023-02-09 15:42:18 UTC
(In reply to Mark Wielaard from comment #6)

> Please do make sure that all patches and reviews/comments are also added to
> this bug or at least posted to the valgrind-developers mailinglist.

Specifically, a few devs to use GH (myself, Nick and Bart). But from what I see nobody else uses it.

Nicolas Nethercode has done the most work on cachegrind (or at least, pushed the most changes to git, not necessarily the same thing)
Comment 9 Ivica 2023-02-13 11:48:27 UTC
Created attachment 156192 [details]
Cache conflicts log file

Hi!

As promised, attached is the cachegrind cg_annotate output on a code that has cache conflicts. Cachegrind was running on Raspberry Pi Ubuntu. The source code is available here:
https://github.com/ibogosavljevic/johnysswlab/tree/master/2023-02-cache-conflicts

The test includes two scenarios:
* Matrix multiplication, with matrices sizes 496, 512 and 528 doubles (corresponding to 3968, 4096 and 4224 bytes).
* Binary search with sizes 15.96 MB, 16 MB and 16.04 MB.

For each dataset size there is a separate function, e.g.
matrix_mul1, matrix_mul2, matrix_mul3
binary_search1, binary_search2, binary_search3

matrix_mul2 and binary_search2 have cache conflicts. matrix_mul2 also has 4K aliasing, which is related to the Intel CPU.

If you look at the eviction statistics, for matrix_mul2 and binary_search2 data gets evicted much faster compared to the other cases.
Comment 10 Ivica 2023-02-16 16:57:56 UTC
Hi!

I would like to know the status of this issue. Is it simply waiting because other higher priority stuff is pending, or something is missing in the description of the issue or it has been silently abandoned. Please be honest with me, I don't like wasting energy trying to move the unmovable.

Ivica
Comment 11 Paul Floyd 2023-02-16 17:27:05 UTC
I've started looking at the code.

Things can be very slow with Valgrind. There are only a few of us active at the moment. Some of us just work in our spare time and everyone else has lots of other commitments.

No problems building the code. I did a test with kcachegrind. It worked, but the menu items for the new events don't have any description - we'll need to work with the kcachegrind maintainers to update that. There was an error with cg_merge, something about inconsistent totals.

I need to look more at the time/memory overhead that gets added. I also want to try a few more examples to get more of a feel for the new events that get produced.
Comment 12 Ivica 2023-02-16 19:41:39 UTC
Thanks for the reply. In that case I will let you do your thing. In case there is any way I can help you, write me a note.

Ivica
Comment 13 Ivica 2023-07-17 07:57:29 UTC
Any new development on this?
Comment 14 Paul Floyd 2023-08-28 08:48:02 UTC
I think that the idea is good.

Nick, what do you think of this?

My main question is how useful this is with the known inaccurate cache modeling. Cache and branch simulation have been turned off by default.

Ivica, could you rebase your patch from git head?
Comment 15 Nick Nethercote 2023-08-29 04:29:08 UTC
The existing cache and branch simulations are very simplistic, and about 20 years out of date. I think hardware counters are a much better way of getting cache statistics. I recently made `--cache-sim=no` the default for this reason.

I understand that the statistics you've added aren't available via hardware counters. But still I worry that using an unrealistic simulation as the foundation could lead to inaccurate results.

Another thing: in the example output file you have lines like these:

> 4,496,605,023 (51.80%) 47,791,341,277 (5545.9%) 127,816 (2819.7%) 147,917 (23112.0%) 73,046 (1014.2%) 154,413 (203.7%)   8,192 (51.35%)  binary_search_test


The percentages in the all the new columns (everything other than `Ir`) greatly exceed 100%, which seems wrong.