| Summary: | prefetch instructions are ignored | ||
|---|---|---|---|
| Product: | [Developer tools] valgrind | Reporter: | David Mathog <mathog> |
| Component: | cachegrind | Assignee: | Julian Seward <jseward> |
| Status: | RESOLVED INTENTIONAL | ||
| Severity: | normal | CC: | info |
| Priority: | NOR | ||
| Version First Reported In: | 2.1.1 | ||
| Target Milestone: | --- | ||
| Platform: | Compiled Sources | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
Cachegrind completely ignores the prefetch instructions. Doing anything else is currently tricky, because Valgrind ignores prefetch instructions, and doesn't represent them in the intermediate representation that it uses at all; therefore, Cachegrind never even sees them. That's too bad - cachegrind shows a cache miss problem and then won't show how well the prefetch put in to correct this problem works. Optimization on Athlons (and I guess Pentiums) is such a black art. It would really be a big help if valgrind/cachegrind could implement prefetch. Even better if they could show the state of various memory blocks in cache line by line through the program. I've seen way too many cases where trivial code changes (such as changing the order of two nominally "independent" lines, or slightly resizing an array) causes run times to explode due to cache misses or pipeline stalls. On Mon, 7 Jun 2004, David Mathog wrote: > > Cachegrind completely ignores the prefetch instructions. Doing > > anything else is currently tricky, because Valgrind ignores prefetch > > instructions, and doesn't represent them in the intermediate > > representation that it uses at all; therefore, Cachegrind never even > > sees them. > > Giving up on Prefetch for a minute, does Valgrind have a way > to determine what parts of memory are in cache? Currently I > can tell where cache misses are taking place but I cannot > determine where else in a program some other chunk of memory > is coming in to force the one which subsequently cache misses > out of memory. (I don't know what to call it when memory > is swapped out of cache, but a "pre-cache miss" pretty > much sums up the end result.) It's usually called an "eviction". > Valgrind, as I understand it, runs an emulator, including a cache > emulator. When it moves lines of memory into data cache, and especially > when it moves them out again, could be it coerced into logging these > events? > > In many of the applications I've been trying to optimize I keep > running into the same problem. I see cache misses on a[] in some > part of the code but prefetch doesn't work well to resolve this, > because something else, b[], c[], whatever is coming in elsewhere > in the loop and forcing a[] back out of memory. It's really, > really hard to guess where this is happening in code that has 7 > or 8 arrays in play (and isn't my code in the first place.) A tool > which would illuminate the flow of data into/out of the data caches > would be a huge help in this regard. For instance, if it could > just watch a[] and report both "cache miss" and "pre-cache miss" > events it would pretty much point out the other parts of the code > which are competing for cache. Cachegrind can't do this in its current form. I can see why it would be useful. Cachegrind certainly could be modified to record evictions, but it would be a large change. In particular, you'd have to choose a way to specify which chunks of memory (eg. arrays) you are interested in, and only record evictions about them. You could file another bug report/wishlist item about evictions, but it won't be implemented unless somebody decides it's important to them. Was forget to set as RESOLVED. |
Below my signature are some snippets from code I've been trying to optimize on an Athlon MP2000. When the prefetch block is present (as shown here) the code runs 15% faster than when it is absent, so the prefetch must be working and avoiding cache misses. Unfortunately, this is not at all what cachegrind shows. Instead, cachegrind shows exactly the same (large) number of cache misses either way. For instance, this line: 700,000,000 0 0 400,000,000 12,500,004 12,500,004 50,000,000 0 0 if((aE [i] - aD[i]) >= (aC[i] - aB[i])){ Am I doing something wrong or does cachegrind just not emulate prefetch instructions? Or maybe it counts the prefetch on an address not in L1/L2 cache as a cache miss? I'm trying to determine how well the prefetch is working (ie, is it far enough "upstream" to complete in time) by counting cache misses, but the information is not being revealed by cachegrind. Thanks, David Mathog mathog@caltech.edu Compiled with: % gcc -pg -O3 -Wall -ansi -pedantic -march=athlon-mp -m3dnow -msse -mmmx -o testtransit -g testtransit.c gcc 3.3.3 static __inline__ void CPU_prefetch(const void *s) { __asm__ ("prefetch 128(%0)" :: "r" ((s)) ); } int *array; int *aA,*aB,*aC,*aD,*aE; int A,B,C,D,E; int i; int sum=0; array=malloc(asize*sizeof(int)); (void) initarray(array,asize); E=asize/2; D=E/2 + 1; C=D/2 + 1; B=C/2 + 1; A=0; aE=&(array[asize+E-E]); aD=&(array[asize+D-E]); aC=&(array[asize+C-E]); aB=&(array[asize+B-E]); aA=&(array[asize+A-E]); (void)fprintf(stdout,"DEBUG loop variant\n"); i=-(asize-E); toploop: if((aE[i] - aD[i]) >= (aC[i] - aB[i])){ sum += aA[i]; } else if((aE[i] - aB[i]) >= (aD[i] - aC[i])){ sum += aB[i]; } else { sum += aC[i]; } i++; /* hit each 64byte cache line once, 128 bytes upstream */ if(!(i&15)){ CPU_prefetch(&(aA[i])); CPU_prefetch(&(aB[i])); CPU_prefetch(&(aC[i])); CPU_prefetch(&(aD[i])); CPU_prefetch(&(aE[i])); } if(i)goto toploop;