Are there any plans to implement sampling (of the PC and call stack) as a performance analysis technique? That is my favorite method, and I've been doing it manually on all software, large and small, for at least 30 years. Some others have also discovered the method and swear by it, while those who have not cannot imagine that it works. The basic method is that a number of samples of the call stack are taken during the interval of interest. Then, for each instruction that appears on those stacks, the fraction of stack samples on which it appears is calculated. The instructions are then displayed in descending order by that fraction. Any such instruction, if it could be removed, would save that fraction of the execution time. Any bottleneck taking significant time will appear in that list. Pros: - Captures cost at the fine-grain instruction level (not function). It is easily summarized to the function level if desired. - It is immune to cycles. A cycle appears as a call instruction appearing more than once on a stack sample, but that still only counts as one sample. Cons: - It slows down the program to take the samples. This is only a problem if there is confusion between the ends (executing fewer instructions) and means (finding the unnecessary instructions). - It consumes more storage for the samples. Offsetting this is the fact that it is not necessary to achieve high statistical accuracy with a large number of samples. Typically the smallest bottlenecks consume 5% of execution cycles, and are easily found with 50-100 samples. For UI presentation, a butterfly view is easy to present, and quickly leads the user to the problem. I would be interested in helping to implement this, if that is appropriate. I built such a sampler and UI in the early 90s, for DOS programs, but it used a TSR and is way out of date. Thanks, Mike Dunlavey P.S. I also wrote a couple articles
Created attachment 28979 [details] My article in Dr. Dobbs It's just a winzip file. Nothing fancy.
Created attachment 28980 [details] 2nd half of article zip file containing page images
Created attachment 28981 [details] my SIGPLAN article
P.S. It also captures time spent in function calls that are invisible in the source code, such as compiler-inserted calls.
Is this wish about implementing call path profiling with sampling as part of Callgrind, or as a separate profiling tool? Sampling needs some counter or timer measuring events happening, to trigger the samples. And sampling is usually done to get very low overhead, using hardware timers or hardware performance counters. However, in Callgrind/Cachegrind, the cache simulator generates the events, and hardware resources are not used at all. So by using sampling here, you will not really get the speedup you would assume: dynamic instrumentation, the cache simulator, and the updating of the shadow call stack (in the case of callgrind) needs to be done all the time. Regarding call path profiling: If you run callgrind with eg. "--separate-callers=50", you get all events attributed to the call chain up to length 50. The symbols shown in KCachegrind are just the concatenation of all symbols in the call chain. Not the best visualization, but a start...
This is something we also request. I have made a modification that does something similar and posted a patch in bug 394307