Bug 176628

Summary: Sampling?
Product: [Developer tools] valgrind Reporter: Mike Dunlavey <mdunlavey>
Component: callgrindAssignee: Josef Weidendorfer <josef.weidendorfer>
Status: REPORTED ---    
Severity: wishlist CC: jonjo442, mdunlavey
Priority: NOR    
Version First Reported In: unspecified   
Target Milestone: ---   
Platform: Unlisted Binaries   
OS: Unspecified   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:
Attachments: My article in Dr. Dobbs
2nd half of article
my SIGPLAN article

Description Mike Dunlavey 2008-12-01 14:54:43 UTC
Are there any plans to implement sampling (of the PC and call stack) as a performance analysis technique? That is my favorite method, and I've been doing it manually on all software, large and small, for at least 30 years. Some others have also discovered the method and swear by it, while those who have not cannot imagine that it works.

The basic method is that a number of samples of the call stack are taken during the interval of interest. Then, for each instruction that appears on those stacks, the fraction of stack samples on which it appears is calculated. The instructions are then displayed in descending order by that fraction.

Any such instruction, if it could be removed, would save that fraction of the execution time. Any bottleneck taking significant time will appear in that list.

Pros:
  - Captures cost at the fine-grain instruction level (not function). It is easily summarized to the function level if desired.
  - It is immune to cycles. A cycle appears as a call instruction appearing more than once on a stack sample, but that still only counts as one sample.

Cons:
  - It slows down the program to take the samples. This is only a problem if there is confusion between the ends (executing fewer instructions) and means (finding the unnecessary instructions).
  - It consumes more storage for the samples. Offsetting this is the fact that it is not necessary to achieve high statistical accuracy with a large number of samples. Typically the smallest bottlenecks consume 5% of execution cycles, and are easily found with 50-100 samples.

For UI presentation, a butterfly view is easy to present, and quickly leads the user to the problem.

I would be interested in helping to implement this, if that is appropriate. I built such a sampler and UI in the early 90s, for DOS programs, but it used a TSR and is way out of date.

Thanks,
Mike Dunlavey

P.S. I also wrote a couple articles
Comment 1 Mike Dunlavey 2008-12-01 15:06:12 UTC
Created attachment 28979 [details]
My article in Dr. Dobbs

It's just a winzip file. Nothing fancy.
Comment 2 Mike Dunlavey 2008-12-01 15:08:49 UTC
Created attachment 28980 [details]
2nd half of article

zip file containing page images
Comment 3 Mike Dunlavey 2008-12-01 15:18:51 UTC
Created attachment 28981 [details]
my SIGPLAN article
Comment 4 Mike Dunlavey 2008-12-01 15:26:03 UTC
P.S. It also captures time spent in function calls that are invisible in the source code, such as compiler-inserted calls.
Comment 5 Josef Weidendorfer 2008-12-01 18:46:17 UTC
Is this wish about implementing call path profiling with sampling
as part of Callgrind, or as a separate profiling tool?

Sampling needs some counter or timer measuring events happening, to
trigger the samples. And sampling is usually done to get very low
overhead, using hardware timers or hardware performance counters.
However, in Callgrind/Cachegrind, the cache simulator generates the
events, and hardware resources are not used at all. So by using sampling
here, you will not really get the speedup you would assume: dynamic
instrumentation, the cache simulator, and the updating of the shadow
call stack (in the case of callgrind) needs to be done all the time.

Regarding call path profiling:
If you run callgrind with eg. "--separate-callers=50", you get
all events attributed to the call chain up to length 50.
The symbols shown in KCachegrind are just the concatenation of
all symbols in the call chain. Not the best visualization, but
a start...

Comment 6 Jonas Josefsson 2018-05-17 13:36:34 UTC
This is something we also request.
I have made a modification that does something similar and posted a patch in bug 394307