Bug 82654

Summary: prefetch instructions are ignored
Product: [Developer tools] valgrind Reporter: David Mathog <mathog>
Component: cachegrindAssignee: Julian Seward <jseward>
Status: RESOLVED INTENTIONAL    
Severity: normal CC: info
Priority: NOR    
Version First Reported In: 2.1.1   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description David Mathog 2004-06-01 20:22:36 UTC
Below my signature are some snippets from code I've been trying to optimize
on an Athlon MP2000.  When the prefetch block is present
(as shown here) the code runs 15% faster than when it is absent,
so the prefetch must be working and avoiding cache misses.  Unfortunately,
this is not at all what cachegrind shows.  Instead, cachegrind shows exactly
the same (large) number of cache misses either way.  For instance, this
line:

 700,000,000    0    0 400,000,000 12,500,004 12,500,004  50,000,000         0 
       0 
if((aE [i] - aD[i]) >=  (aC[i] - aB[i])){

Am I doing something wrong or does cachegrind just not emulate prefetch
instructions?  Or maybe it counts the prefetch on an address not
in L1/L2 cache as a cache miss?  I'm trying to determine how well the
prefetch is working (ie, is it far enough "upstream" to complete in time)
by counting cache misses, but the information is not being revealed
by cachegrind.

Thanks,

David Mathog
mathog@caltech.edu

Compiled with:

%  gcc -pg -O3 -Wall -ansi -pedantic -march=athlon-mp -m3dnow -msse -mmmx  -o
testtransit -g testtransit.c

gcc 3.3.3


 static __inline__ void CPU_prefetch(const void *s) {
        __asm__ ("prefetch 128(%0)"  :: "r" ((s)) );
}

int *array;
int *aA,*aB,*aC,*aD,*aE;
int A,B,C,D,E;
int i;
int sum=0;

   array=malloc(asize*sizeof(int));
   (void) initarray(array,asize);

  E=asize/2;
  D=E/2 + 1;
  C=D/2 + 1;
  B=C/2 + 1;
  A=0;

    aE=&(array[asize+E-E]);
    aD=&(array[asize+D-E]);
    aC=&(array[asize+C-E]);
    aB=&(array[asize+B-E]);
    aA=&(array[asize+A-E]);
    (void)fprintf(stdout,"DEBUG loop variant\n");
    i=-(asize-E);
toploop: 
      if((aE[i] - aD[i]) >=  (aC[i] - aB[i])){
        sum += aA[i];
      }
      else if((aE[i] - aB[i]) >=  (aD[i] - aC[i])){
        sum += aB[i];
      }
      else {
        sum += aC[i];
      }
      i++;
      /* hit each 64byte cache line once, 128 bytes upstream */
      if(!(i&15)){
        CPU_prefetch(&(aA[i]));
        CPU_prefetch(&(aB[i]));
        CPU_prefetch(&(aC[i]));
        CPU_prefetch(&(aD[i]));
        CPU_prefetch(&(aE[i]));
      }
      if(i)goto toploop;
Comment 1 Nicholas Nethercote 2004-06-01 21:21:10 UTC
Cachegrind completely ignores the prefetch instructions.  Doing anything
else is currently tricky, because Valgrind ignores prefetch instructions,
and doesn't represent them in the intermediate representation that it uses
at all;  therefore, Cachegrind never even sees them.

Comment 2 David Mathog 2004-06-01 22:32:42 UTC
That's too bad - cachegrind shows a cache miss
problem and then won't show how well the prefetch put in
to correct this problem works.

Optimization on Athlons (and I guess Pentiums) is such a black art.
It would really be a big help if valgrind/cachegrind could 
implement prefetch.  Even better if they could
show the state of various memory blocks in cache line by line
through the program.  I've seen way too many cases where
trivial code changes (such as changing the order
of two nominally "independent" lines, or slightly resizing an array)
causes run times to explode due to cache misses or pipeline stalls.  
Comment 3 Nicholas Nethercote 2004-06-15 14:31:23 UTC
On Mon, 7 Jun 2004, David Mathog wrote:

> > Cachegrind completely ignores the prefetch instructions.  Doing
> > anything else is currently tricky, because Valgrind ignores prefetch
> > instructions, and doesn't represent them in the intermediate
> > representation that it uses at all;  therefore, Cachegrind never even
> > sees them.
>
> Giving up on Prefetch for a minute, does Valgrind have a way
> to determine what parts of memory are in cache?  Currently I
> can tell where cache misses are taking place but I cannot
> determine where else in a program some other chunk of memory
> is coming in to force the one which subsequently cache misses
> out of memory.  (I don't know what to call it when memory
> is swapped out of cache, but a "pre-cache miss" pretty
> much sums up the end result.)

It's usually called an "eviction".

> Valgrind, as I understand it, runs an emulator, including a cache
> emulator.  When it moves lines of memory into data cache, and especially
> when it moves them out again, could be it coerced into logging these
> events?
>
> In many of the applications I've been trying to optimize I keep
> running into the same problem.  I see cache misses on a[] in some
> part of the code but prefetch doesn't work well to resolve this,
> because something else, b[], c[], whatever is coming in elsewhere
> in the loop and forcing a[] back out of memory.  It's really,
> really hard to guess where this is happening in code that has 7
> or 8 arrays in play (and isn't my code in the first place.)  A tool
> which would illuminate the flow of data into/out of the data caches
> would be a huge help in this regard.  For instance, if it could
> just watch a[] and report both "cache miss" and "pre-cache miss"
> events it would pretty much point out the other parts of the code
> which are competing for cache.

Cachegrind can't do this in its current form.  I can see why it would be
useful.  Cachegrind certainly could be modified to record evictions, but
it would be a large change.  In particular, you'd have to choose a way to
specify which chunks of memory (eg. arrays) you are interested in, and
only record evictions about them.

You could file another bug report/wishlist item about evictions, but it
won't be implemented unless somebody decides it's important to them.

Comment 4 T I Z E N 2025-02-17 18:54:36 UTC
Was forget to set as RESOLVED.