Created attachment 156067 [details] Eviction Simulation Patch Let me first give some background: I am looking for a tool that can detect cache conflicts, but I am not finding any. There are a few that are mostly academic, and thus not maintained. I think it is important for the performance analysis community to have a tool that to some extent can detect cache conflicts. Is it possible to implement support for detecting source code lines where cache conflicts occur? More info on cache conflicts below. === What are cache conflicts? === Cache conflict happens when a cache line is brought up from the memory to the cache, but very soon has to be evicted to the main memory because another cache line is mapped to the same entry. The problem with detecting cache conflicts is that it is normal that one cache line gets evicted because it is replaced by another cache line. Therefore, a cache conflict is an outlier: the cache line spent very little time in the cache before it got evicted. === How to detect cache conflicts? === There are several ways to detect cache conflicts, but I here propose one that is easy to implement in cachegrind. When an instruction that touches the data cache executes, there is a possibility that this instruction evicts a certain line from the cache. When the line enters cache, we timestamp it. When it gets evicted, we can measure how much time it the line spent in the cache. Then, for each instruction we count the number of evictions and total time line spent in cache to calculate the average time spent in cache before eviction. The instructions where the average time before eviction is low compared to the rest of the program are lines where cache conflicts happen. == Advantages and disadvantages == Advantages of this model is that it is very straightforward to implement and the results are good enough. The disadvantage is that Valgrind is not cycle-accurate, so the times spent in cache will not be given in cycles, but in memory accesses (e.g. the average cache line spent 800.000 loads in the LL data cache before it got evicted). == Does it work == I already implemented this in cachegrind and ran it on a matrix multiplication example with matrix sizes 1008, 1024 and 1040. The tool discovered the following anomaly: average stay in cache, LL cache, data reads for different matrix size: 1008: 890,688 memory accesses 1024: 98,394 memory accesses 1040: 890,965 memory accesses == Implementation == The first draft of the implementation is attached as a patch and should be applied on top of VALGRIND 3.20 tag in Github.
Do you have any testcases? Have you thought of how the new eviction event types will affect other tools (cg_annotate, cg_merge, cg_merge, kcachegrind). It looks like you've changed the default to be for eviction. I suppose that will affect the existing test (perl tests/vg_regtest cachegrind/tests). This will also need documentation change (and probably the web site as well). I just had a very quick look at the diff. I'll try to look at it properly next week.
Ok, first, this is a proof of concept patch that I will need to fix. The eviction emulation is NOT the default, cache emulation should remain the default. All the existing tests will need to pass without change. I need to add new command-line switches for eviction emulation. I have two examples with cache conflicts (matrix multiplication and binary tree lookup). I could also try to find other examples, e.g. hashsort or in-array tree. I will upload them soon. I could write the documentation for the new features. cg_annotate works with new command line, but I don't know about other tools. Also, I don't know how to write in PERL.
Also, if there is some way to send the patch to an official code review (similar to GitHub), let me know.
(In reply to Ivica from comment #3) > Also, if there is some way to send the patch to an official code review > (similar to GitHub), let me know. Not at present. This is the place for patch submission. Review can be slow!
I created a github mirror of valgrind, and created a patch as a Pull Request, so it can be reviewed more easily. I sent you an invitation on GitHub for the repository, so we can perform the review there.
(In reply to Ivica from comment #5) > I created a github mirror of valgrind, and created a patch as a Pull > Request, so it can be reviewed more easily. I sent you an invitation on > GitHub for the repository, so we can perform the review there. Please do make sure that all patches and reviews/comments are also added to this bug or at least posted to the valgrind-developers mailinglist.
Will do! On Thu, Feb 9, 2023 at 3:59 PM Mark Wielaard <bugzilla_noreply@kde.org> wrote: > > https://bugs.kde.org/show_bug.cgi?id=465465 > > Mark Wielaard <mark@klomp.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |mark@klomp.org > > --- Comment #6 from Mark Wielaard <mark@klomp.org> --- > (In reply to Ivica from comment #5) > > I created a github mirror of valgrind, and created a patch as a Pull > > Request, so it can be reviewed more easily. I sent you an invitation on > > GitHub for the repository, so we can perform the review there. > > Please do make sure that all patches and reviews/comments are also added to > this bug or at least posted to the valgrind-developers mailinglist. > > -- > You are receiving this mail because: > You reported the bug.
(In reply to Mark Wielaard from comment #6) > Please do make sure that all patches and reviews/comments are also added to > this bug or at least posted to the valgrind-developers mailinglist. Specifically, a few devs to use GH (myself, Nick and Bart). But from what I see nobody else uses it. Nicolas Nethercode has done the most work on cachegrind (or at least, pushed the most changes to git, not necessarily the same thing)
Created attachment 156192 [details] Cache conflicts log file Hi! As promised, attached is the cachegrind cg_annotate output on a code that has cache conflicts. Cachegrind was running on Raspberry Pi Ubuntu. The source code is available here: https://github.com/ibogosavljevic/johnysswlab/tree/master/2023-02-cache-conflicts The test includes two scenarios: * Matrix multiplication, with matrices sizes 496, 512 and 528 doubles (corresponding to 3968, 4096 and 4224 bytes). * Binary search with sizes 15.96 MB, 16 MB and 16.04 MB. For each dataset size there is a separate function, e.g. matrix_mul1, matrix_mul2, matrix_mul3 binary_search1, binary_search2, binary_search3 matrix_mul2 and binary_search2 have cache conflicts. matrix_mul2 also has 4K aliasing, which is related to the Intel CPU. If you look at the eviction statistics, for matrix_mul2 and binary_search2 data gets evicted much faster compared to the other cases.
Hi! I would like to know the status of this issue. Is it simply waiting because other higher priority stuff is pending, or something is missing in the description of the issue or it has been silently abandoned. Please be honest with me, I don't like wasting energy trying to move the unmovable. Ivica
I've started looking at the code. Things can be very slow with Valgrind. There are only a few of us active at the moment. Some of us just work in our spare time and everyone else has lots of other commitments. No problems building the code. I did a test with kcachegrind. It worked, but the menu items for the new events don't have any description - we'll need to work with the kcachegrind maintainers to update that. There was an error with cg_merge, something about inconsistent totals. I need to look more at the time/memory overhead that gets added. I also want to try a few more examples to get more of a feel for the new events that get produced.
Thanks for the reply. In that case I will let you do your thing. In case there is any way I can help you, write me a note. Ivica
Any new development on this?
I think that the idea is good. Nick, what do you think of this? My main question is how useful this is with the known inaccurate cache modeling. Cache and branch simulation have been turned off by default. Ivica, could you rebase your patch from git head?
The existing cache and branch simulations are very simplistic, and about 20 years out of date. I think hardware counters are a much better way of getting cache statistics. I recently made `--cache-sim=no` the default for this reason. I understand that the statistics you've added aren't available via hardware counters. But still I worry that using an unrealistic simulation as the foundation could lead to inaccurate results. Another thing: in the example output file you have lines like these: > 4,496,605,023 (51.80%) 47,791,341,277 (5545.9%) 127,816 (2819.7%) 147,917 (23112.0%) 73,046 (1014.2%) 154,413 (203.7%) 8,192 (51.35%) binary_search_test The percentages in the all the new columns (everything other than `Ir`) greatly exceed 100%, which seems wrong.