176628 – Sampling?

Bug 176628 - Sampling?

Summary: Sampling?

Status:	REPORTED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	callgrind (other bugs)
Version First Reported In:	unspecified
Platform:	Unlisted Binaries Unspecified

Importance:	NOR wishlist
Target Milestone:	---
Assignee:	Josef Weidendorfer

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-12-01 14:54 UTC by Mike Dunlavey
Modified:	2018-05-17 13:36 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
My article in Dr. Dobbs (909.71 KB, application/octet-stream) 2008-12-01 15:06 UTC, Mike Dunlavey	Details
2nd half of article (997.67 KB, application/octet-stream) 2008-12-01 15:08 UTC, Mike Dunlavey	Details
my SIGPLAN article (65.91 KB, application/pdf) 2008-12-01 15:18 UTC, Mike Dunlavey	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Dunlavey 2008-12-01 14:54:43 UTC

Are there any plans to implement sampling (of the PC and call stack) as a performance analysis technique? That is my favorite method, and I've been doing it manually on all software, large and small, for at least 30 years. Some others have also discovered the method and swear by it, while those who have not cannot imagine that it works.

The basic method is that a number of samples of the call stack are taken during the interval of interest. Then, for each instruction that appears on those stacks, the fraction of stack samples on which it appears is calculated. The instructions are then displayed in descending order by that fraction.

Any such instruction, if it could be removed, would save that fraction of the execution time. Any bottleneck taking significant time will appear in that list.

Pros:
- Captures cost at the fine-grain instruction level (not function). It is easily summarized to the function level if desired.
- It is immune to cycles. A cycle appears as a call instruction appearing more than once on a stack sample, but that still only counts as one sample.

Cons:
- It slows down the program to take the samples. This is only a problem if there is confusion between the ends (executing fewer instructions) and means (finding the unnecessary instructions).
- It consumes more storage for the samples. Offsetting this is the fact that it is not necessary to achieve high statistical accuracy with a large number of samples. Typically the smallest bottlenecks consume 5% of execution cycles, and are easily found with 50-100 samples.

For UI presentation, a butterfly view is easy to present, and quickly leads the user to the problem.

I would be interested in helping to implement this, if that is appropriate. I built such a sampler and UI in the early 90s, for DOS programs, but it used a TSR and is way out of date.

Thanks,
Mike Dunlavey

P.S. I also wrote a couple articles

Comment 1 Mike Dunlavey 2008-12-01 15:06:12 UTC

Created attachment 28979 [details]
My article in Dr. Dobbs

It's just a winzip file. Nothing fancy.

Comment 2 Mike Dunlavey 2008-12-01 15:08:49 UTC

Created attachment 28980 [details]
2nd half of article

zip file containing page images

Comment 3 Mike Dunlavey 2008-12-01 15:18:51 UTC

Created attachment 28981 [details]
my SIGPLAN article

Comment 4 Mike Dunlavey 2008-12-01 15:26:03 UTC

P.S. It also captures time spent in function calls that are invisible in the source code, such as compiler-inserted calls.

Comment 5 Josef Weidendorfer 2008-12-01 18:46:17 UTC

Is this wish about implementing call path profiling with sampling
as part of Callgrind, or as a separate profiling tool?

Sampling needs some counter or timer measuring events happening, to
trigger the samples. And sampling is usually done to get very low
overhead, using hardware timers or hardware performance counters.
However, in Callgrind/Cachegrind, the cache simulator generates the
events, and hardware resources are not used at all. So by using sampling
here, you will not really get the speedup you would assume: dynamic
instrumentation, the cache simulator, and the updating of the shadow
call stack (in the case of callgrind) needs to be done all the time.

Regarding call path profiling:
If you run callgrind with eg. "--separate-callers=50", you get
all events attributed to the call chain up to length 50.
The symbols shown in KCachegrind are just the concatenation of
all symbols in the call chain. Not the best visualization, but
a start...

Comment 6 Jonas Josefsson 2018-05-17 13:36:34 UTC

This is something we also request.
I have made a modification that does something similar and posted a patch in bug 394307