Demo: $ cat demo.c #include <malloc.h> char *ptr; void allocate_reachable() { ptr = malloc(42); } int main(void) { allocate_reachable(); } $ cat demo.supp { Shouldn't be here! Memcheck:Leak fun:malloc fun:allocate_reachable fun:main } $ gcc demo.c $ valgrind --leak-check=full --suppressions=demo.supp -v ./a.out 2>&1 | tail ==6744== still reachable: 0 bytes in 0 blocks ==6744== suppressed: 42 bytes in 1 blocks ==6744== ==6744== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4) --6744-- --6744-- used_suppression: 1 Shouldn't be here! --6744-- used_suppression: 2 U1004-ARM-_dl_relocate_object --6744-- used_suppression: 2 glibc-2.5.x-on-SUSE-10.2-(PPC)-2a ==6744== ==6744== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
Created attachment 64731 [details] Proposed fix
Hm, I wonder what perf improvement this gives us - assuming we have a large amount of still-reachable allocations at the end of the run on something like Chromium...
We ARE observing a speed-up with this patch applied http://code.google.com/p/chromium/issues/detail?id=101098
Created attachment 64749 [details] Proposed fix #2 Looking from the user PoV, IMO we shouldn't bother matching suppressions against the leak types we don't print out (e.g. still-reachable, --show-possibly-lost=no) Attached is the new patch which accounts for that. Test, run with "--leak-check=yes --show-possibly-lost=no" ----demo.c----- #include <malloc.h> char *ptr_reachable, *ptr_possible; void allocate_reachable() { ptr_reachable = malloc(42); } void allocate_possible() { ptr_possible = 4 + (char*)malloc(42); } int main(void) { allocate_reachable(); allocate_possible(); } --------------- ---demo.supp--- { Shouldn't be here! - reachable Memcheck:Leak fun:malloc fun:allocate_reachable fun:main } { Shouldn't be here! - possible Memcheck:Leak fun:malloc fun:allocate_possible fun:main } ---------------
Perf data here. Looks like the comment #1 patch improved the total performance of Chromium ui_tests by 10%. ui_tests consist of ~100 test cases, each doing the following: 1) start new Chromium process 2) trigger a couple of hooks emulating user input 3) check the results of (2) 4) stop Chromium (and perform leak check when running under Valgrind). Each test case takes about 1-2 minutes under Valgrind. Right before the update: http://build.chromium.org/p/chromium.memory.fyi/builders/Linux%20Tests%20%28valgrind%29%281%29/builds/4476/ -> Chromium ui_tests take 33m42s + 30m28s Right after the update: http://build.chromium.org/p/chromium.memory.fyi/builders/Linux%20Tests%20%28valgrind%29%281%29/builds/4476/ -> Chromium ui_tests take 30m23s + 27m09s Same holds for the other shards: http://build.chromium.org/p/chromium.memory.fyi/builders/Linux%20Tests%20%28valgrind%29%282%29/builds/5050 -> 27m01s + 29m13s vs http://build.chromium.org/p/chromium.memory.fyi/builders/Linux%20Tests%20%28valgrind%29%282%29/builds/5051 -> 23m42s + 26m05s
The idea is interesting (at work, we starting once a big system, and then run a bunch of tests with a leak search between each test). So, optimising the leak search is a good idea. The change however creates a situation slightly confusing: Let's imagine a (typical ?) user that does --leak-check=full --show-reachable=no (for the below, the test program definitely leaks 1000 blocks of 1 byte and has a one block of 3 bytes still reachable). Without the patch, the "reachable suppressed" are always counted and shown in the suppressed part: ==29035== still reachable: 0 bytes in 0 blocks ==29035== suppressed: 1,003 bytes in 1,001 blocks With the change, the "reachable suppressed" will be counted and shown in the still reachable when --show-reachable=no: ==29255== still reachable: 3 bytes in 1 blocks ==29255== suppressed: 1,000 bytes in 1,000 blocks If the user then worries about this "still reachable" and re-runs with --show-reachable=yes, then the below is obtained: ==30195== still reachable: 0 bytes in 0 blocks ==30195== suppressed: 1,003 bytes in 1,001 blocks On a simple example, the behaviour is understandable. In a complex big application with many suppressions and a not necessarily 100% reproducible run, this non-intuitive behaviour is somewhat annoying. So, despite the fact that the speed improvement for leak search would be welcome (would provide a speedup at my work probably similar to the speedup on Chromium or maybe more), I am not that enthusiastic to commit the change. I do not see how to present the above behaviour in a not confusing way to the user and!or without introducing some more complexity in the combinations reachable/possibly/definitely X suppressed/not suppressed. Any idea / suggestion ?
Created attachment 72351 [details] patch of Timur updated to a recent 3.8.0
Created attachment 72352 [details] patch of Timur updated to a recent 3.8.0 (only the relevant changes)
> I do not see how to present the above behaviour in a not confusing way > to the user and!or without introducing some more complexity in the > combinations reachable/possibly/definitely X suppressed/not suppressed. Personally I think splitting "suppressed" into reachable/possiblydefinitely is the way to go. It's not adding complexity but actually making the already-complex thing understandable in more than just trivial cases.
(In reply to comment #9) > > I do not see how to present the above behaviour in a not confusing way > > to the user and!or without introducing some more complexity in the > > combinations reachable/possibly/definitely X suppressed/not suppressed. > Personally I think splitting "suppressed" into reachable/possiblydefinitely > is the way to go. It's not adding complexity but actually making the > already-complex thing understandable in more than just trivial cases. Even if we would plit the suppressed per reachable/possibly/definitely, the behaviour would not be straightforward : when the matching between a loss record and a suppression is not done (due to show-reachable=no), where would we put this info ? Now, it is in "not suppressed" (without suppr matching, we cannot decide). If we split the suppressed in 3, then we would put it in "reachable not suppressed". To make it (somewhat) understandable, we would need to have a third state, so as to have: suppressed, not suppressed, undetermined and have up to 12 different combinations reachable/possible/indirectly/definitely X the new third states. A possible solution would be to have: --show-reachable=no|yes|suppress where suppress would mean to automaticailly consider a reachable loss record as part of the suppression. However, this is still some additional complexity, for a gain of about 10% on Chromium test. Even if the --show-reachable=suppress is not that ugly, it looks to me preferrable to just work on optimising the leak search and/or the matching between a lost record and the suppressions. Do you have some statistics about chromium ? In particular: how many reachable loss records, how many suppression entries in your suppression file, and how many Mb of memory is scanned ? you are using
(In reply to comment #10) > Do you have some statistics about chromium ? > In particular: how many reachable loss records, how many suppression entries > in your suppression file, > and how many Mb of memory is scanned ? With about 50_000 loss records, and 1024 suppressions entries, on my slow pc, I get about 24 seconds for suppr matching. With 512 suppressions, about 12 seconds. With 128 suppressions, about 3 seconds. So, performance seems to be linear in nr of suppressions (and I guess is also linear in terms of loss records). Of course, a lot of loss records multiplied by a lot of suppressions will take some time. From reading the suppr matching code, I suppose most of the time is spent in the function supp_pattEQinp. This function will be called at least once for each suppression kind matching the loss record. It will be called as many times as there is something matching. supp_pattEQinp will have to translate an IP address into a function name, and that might be costly (e.g. accessing symtab). So, suppr matching might be improved two ways: * avoid scanning suppression kinds that have no chance to match (this is probably not a big win, that is basically just integer comparison). * avoid repetitively translate an ip address into a soname and/or fnname: still do a lazy evaluation, but keep the result once computed for a loss record, to re-use this for the next suppression matching.
> A possible solution would be to have: > --show-reachable=no|yes|suppress > where suppress would mean to automaticailly consider > a reachable loss record as part of the suppression. I like this one. Probably better names would be: --show-reachable=no|yes|stats > However, this is still some additional complexity, for a gain of about 10% on Chromium test. I'd rather say it speeds up leak checking by Xx where X is large :) On Chromium, the leak checking was so slow that the multi-process stuff was often thinking child processes are hung. Increasing the timeouts didn't help enough or it was just slowing down everything a lot, especially on tests with a lot of processes. > Do you have some statistics about chromium ? > In particular: how many reachable loss records, how many suppression entries in your suppression file, I don't remember exactly anymore. Roughly, A LOT. > and how many Mb of memory is scanned ? About 1-2GB of address space I suppose. You can look at hacking the code at http://code.google.com/p/valgrind-variant/wiki/LeakCheckingOn32bits as one more benchmark. > * avoid repetitively translate an ip address into a soname and/or fnname: > still do a lazy evaluation, but keep the result once computed for a loss record, > to re-use this for the next suppression matching. This sounds like a good improvement to be done in the Valgrind framework (in the lookup_pc() API) if it's not done yet.
(In reply to comment #12) > > However, this is still some additional complexity, for a gain of about 10% on Chromium test. > I'd rather say it speeds up leak checking by Xx where X is large :) ??? not too sure I understand how much we really gain in total. Checking in the chromium src website, the memcheck suppression file has about 350 suppressions. If chromium has > On Chromium, the leak checking was so slow that the multi-process stuff was > often thinking child processes are hung. > Increasing the timeouts didn't help enough or it was just slowing down > everything a lot, especially on tests with a lot of processes. > > > Do you have some statistics about chromium ? > > In particular: how many reachable loss records, how many suppression entries in your suppression file, > I don't remember exactly anymore. Roughly, A LOT. > > > and how many Mb of memory is scanned ? > About 1-2GB of address space I suppose. On an Intel(R) Xeon(R) CPU X5670 @ 2.93GHz, Valgrind scans 1 Gb of memory, searching for 2 millions malloc-ed blocks in about 16 seconds (based on the test leak program you have referenced). Adding a file with 1024 suppression does almost not change this time (this is explained by the fact that your test program creates very few different stack traces, and so there is very little matches to do between a loss record and the 1024 suppressions). If I use the same 1024 suppressions with the test program perf/many-loss-records, this has to do about 50_000 loss records X 1024 suppressions. This takes about 8 seconds. I suspect these 8 seconds might become a lot more if a big executable containing many shared libs and/or many debug symbols are used (chromium?). Also, I believe that suppressions starting with ... or having one or more ... might often cause all of the ip of the stacktrace to be translated to a function name or object name. If the above is true, then the below optimisation should help significantly. > > * avoid repetitively translate an ip address into a soname and/or fnname: > > still do a lazy evaluation, but keep the result once computed for a loss record, > > to re-use this for the next suppression matching. > This sounds like a good improvement to be done in the Valgrind framework (in > the lookup_pc() API) if it's not done yet. I am more thinking about doing a specific optimised structure for the error/suppression matching code, as this looks simpler than introducing a result cache in the debug info/symtab handling (and I guess that the most frequent usage of ip to object/fn translation is for error -> suppr matching). The other usages should be a lot less frequent (e.g. when really printing an error).
> On an Intel(R) Xeon(R) CPU X5670 @ 2.93GHz, > Valgrind scans 1 Gb of memory, searching for 2 millions malloc-ed blocks > in about 16 seconds (based on the test leak program you have referenced). That's a lot compared to native execution of the same test in just 1 seconds which theoretically should take 20-40 seconds to run under Valgrind (assuming the 20x-40x slowdown). Adding extra 10-20 seconds *for each process* makes things up to 1.5x-2x slower overall... > Also, I believe that suppressions starting with ... or having one or more ... > might often cause all of the ip of the stacktrace to be translated > to a function name or object name. This sounds very much possible >> This sounds like a good improvement to be done in the Valgrind framework >> (in the lookup_pc() API) if it's not done yet. > I am more thinking about doing a specific optimised structure for the error/suppression matching code, > as this looks simpler than introducing a result cache in the debug info/symtab handling > (and I guess that the most frequent usage of ip to object/fn translation is for error -> suppr matching). > The other usages should be a lot less frequent (e.g. when really printing an error). In ThreadSanitizer we do lookup a PC for every basic block being executed. We ended up having our own caching for that. Not sure about the other tools though, I agree they may only need to lookup only when printing out a report.
> Adding extra 10-20 seconds *for each process* makes things up to 1.5x-2x slower overall... And again, the native shutdown timeout is hundreds of milliseconds; it's OK to multiply it (assuming 20x Valgrind slowdown) but it might not be OK to increase it by adding a 16 seconds constant :)
(In reply to comment #14) > > On an Intel(R) Xeon(R) CPU X5670 @ 2.93GHz, > > Valgrind scans 1 Gb of memory, searching for 2 millions malloc-ed blocks > > in about 16 seconds (based on the test leak program you have referenced). > That's a lot compared to native execution of the same test in just 1 seconds > which theoretically should take 20-40 seconds to run under Valgrind > (assuming the 20x-40x slowdown). > Adding extra 10-20 seconds *for each process* makes things up to 1.5x-2x > slower overall... The 16 seconds above is the time needed to scan the 1 Gb memory. This bug report (and its patch) are not improving the scan time. Basically, the leak search has the following steps: 1. build a list of all allocated blocks (this is very fast) 2. scan the memory (this is the 16 seconds above) and mark all allocated blocks that are still reachable 3. regroup all blocks in loss records, i.e. same status (reachable, definitely lost, ..) and same stack trace. 4. for each loss record, see if it matches a suppression file. output the non matching loss records. Step 1 to 3 are taking about 16 seconds for 1 Gb of memory and 200000 blocks. These 16 seconds looks to be linear in terms of memory size. It should be n log n in terms of nr of blocks (but I could not detect the n log n impact so it is very "flat" n log n it looks). Step 4 is very small for your test program (as there is only a few loss records). When step 4 is working on about 50_000 loss records and 1024 suppressions, it takes 8 seconds. This bug speaks about optimising the step 4. What I do not understand yet is how Chromium tests can imply a huge time for step 4. Having 50_000 loss records is really a lot, as you need 50_000 different stack traces for malloc calls. (having recursive functions helps to obtain that). 50_000 * 1024 suppressions is not that horrible (only 8 seconds compared to the 16 seconds for leak scan). At work, a big application with many recursive algorithms and a big pile of code and several hundreds thousands of blocks allocated takes between 8 to 20 seconds for total leak search). So, it is still not clear to me what exactly has to be optimised. I suspect the speed of the suppression matching code is badly influenced by the kind of suppressions (... or not ... ?) and by the nr of shared libs and/or debug info, which might explain the 8 seconds for the small test case and (huge) timeouts for the chromium tests. (but how much is "huge" ? do you have an idea or can you re-measure ?) Before (prematurely?) optimising a part of the leak search, it would be nice to understand more where the problem is coming from : is it better to work on reducing the 16 seconds ? (maybe not trivial) or to optimise the suppression matching code ? or to "enrich/complexify" the --show-reachable by adding a third value (suppress or stats) ?
Created attachment 72738 [details] small optimisation for suppr matching Performance of leak scanning has been improved in revision 12756. The attached patch (somewhat) improves the suppression matching (following a suggestion from Julian). On a test case with 1024 suppressions and 50_000 loss records, the patch improves the total leak search time by 10 to 15%.
Created attachment 72739 [details] micro_optimisation of the cmp functions in m_libcbase.c Looking at the asm code, the code is slightly smaller and contains 6 jumps instead of 9.
(In reply to comment #17) > Created attachment 72738 [details] > small optimisation for suppr matching Seems fine, except for one logic problem, easy to fix: a pattern string can also have '?' as a wildcard that matches exactly one character. With this case a string containing a '?' but no '*' will be classified as a simple_string and so will only match against a string with literally '?' in that position ==> a change in semantics.
(In reply to comment #18) > Created attachment 72739 [details] > micro_optimisation of the cmp functions in m_libcbase.c Ok iff you are 110% sure these don't change the semantics :-)
Created attachment 72814 [details] implements lazy conversion of stack trace to fun/obj name during suppression matching The attached patch optimises suppr matching by avoiding redoing the debug info search for each suppression match with a stack trace. Running many-loss-records with a 1024 suppression file (all suppressing leaks, i.e. pattern matching is needed) gives the following (also compared with 3.7.0): amd64 x86 3.7.0 9.253 34.0 trunk 9.635 28.5 patch 3.252 10.6 So, basically, a factor 3 improvement on this (rather artificial) test. Would be nice to test with a real setup (e.g. chromium tests).
3x sounds cool! I'll give it a try when I'm back from vacation (late August)
suppr matching improvement committed in rev 12824.
Wait, why have you changed the subject and closed the bug? The original idea was that the output logic is confusing (please re-read comments #0, #1, #4). Your fix doesn't address that, right? So I'm reopening. The original optimization (comment #5) was a nice side-effect and it's good to see that your optimizations made things much faster even without my patch, but it deserves a separate bug (266390?). Also, just recalled: > What I do not understand yet is how Chromium tests > can imply a huge time for step 4. Do you know that we use --num-callers=30 and many suppressions like this: { blah Memcheck:Leak ... fun:someFunction1 fun:someFunction2 } ? I'm pretty sure it makes things much slower than with --num-callers=10.
(In reply to comment #24) > Wait, why have you changed the subject and closed the bug? > The original idea was that the output logic is confusing (please re-read > comments #0, #1, #4). Your fix doesn't address that, right? > So I'm reopening. Ok, misunderstood the bug (I thought the main problem was the efficiency linked with useless matching of suppression matching for reachable). I will re-read more carefully :). Is it ok to close bug 266390 based on patches https://bugs.kde.org/attachment.cgi?id=72738 https://bugs.kde.org/attachment.cgi?id=72739 https://bugs.kde.org/attachment.cgi?id=72814 ? > The original optimization (comment #5) was a nice side-effect and it's good > to see that your optimizations made things much faster even without my > patch, but it deserves a separate bug (266390?). > > Also, just recalled: > > What I do not understand yet is how Chromium tests > > can imply a huge time for step 4. > Do you know that we use --num-callers=30 and many suppressions like this: > { > blah > Memcheck:Leak > ... > fun:someFunction1 > fun:someFunction2 > } > ? > > I'm pretty sure it makes things much slower than with --num-callers=10. I suspect that effectively such ... will imply to always translate the full stack traces. The just committed patch should improve this significantly if you have a lot of such suppression patterns : an IP address will be translated at most once.
> Is it ok to close bug 266390 Let me check it first. I've added a comment there. > The just committed patch should improve this significantly if you have > a lot of such suppression patterns : an IP address will be translated > at most once. Yes, sounds right.
(In reply to comment #24) > The original idea was that the output logic is confusing (please re-read > comments #0, #1, #4). I have re-read, and I do not see what is confusing. A leak check produces loss records. These loss records can be of different kinds: reachable possibly-lost definitely-lost Some of these loss records can be suppressed. The "show-possibly-lost=yes (or no)) only indicates to show (or not) the non suppressed possibly lost. (similar for show-reachable). It does not influence the fact that a suppression is or not suppressing the loss record. If it would influence, then we are back to the (real IMO) confusion described in comments 6 and 10. I think the current behaviour is simple: a suppression entry always suppress what it has to suppress. Then the user can decide to show (or not) the non suppressed parts. It is not clear to me what we can gain by introducing new user visible categories. Maybe I am missing the point ? Can you tell exactly what is confusing in the current behavior (i.e. what information the user is missing, or what wrong info is being given to the user ?)
> The "show-possibly-lost=yes (or no)) only indicates to show (or not) the > non suppressed possibly lost. (similar for show-reachable). That's the confusion by itself too. In Chromium we don't care at all about still reachable stuff as there are too many of them by design. (*) We have a lot of suppressions due to the "commit, oh Valgrind found something, suppress, fix someday later" development process. We often rely on the "used suppressions" section to judge whether a particular leak/bug is still present. (**) Due to (*) and (**), we don't want the still reachable allocations matching the suppressions to show up in the stats. -- Re: comment 6 + comment 10: 1) Splitting definitely/possibly/reachable stats is a good thing as it helps to prioritize fixing stuff 2) Then, matching the suppressions and showing the number of suppressed allocations the way you want is OK if a) it doesn't re-introduce the Chromium problem described above and b) if it's OK performance-wise after r12824 We don't want the # of suppressed still-reachable/possibly-lost allocations to show up in the used_suppressions section though. 3) If either (2a) or (2b) is false, we can probably just skip the uninteresting stats sections - under a flag if you want.
(In reply to comment #28) > > The "show-possibly-lost=yes (or no)) only indicates to show (or not) the > > non suppressed possibly lost. (similar for show-reachable). > That's the confusion by itself too. > > In Chromium we don't care at all about still reachable stuff as there are > too many of them by design. (*) > We have a lot of suppressions due to the "commit, oh Valgrind found > something, suppress, fix someday later" development process. > We often rely on the "used suppressions" section to judge whether a > particular leak/bug is still present. (**) > Due to (*) and (**), we don't want the still reachable allocations matching > the suppressions to show up in the stats. To be sure I understand: The problem is that a suppression entry can match definitely-lost loss records, possibly-lost loss records and still-reachable loss records. Due to this, once you have added a suppression to handle a known definitely-lost, it will also "collect" the reachable loss records, which means you cannot track precisely anymore the amount of definitely lost. Is that a correct understanding ? (is correct, the above looks to me to be a real problem: effectively, when you suppress a leak, it is based on the assumption that this leak is leaking a small amount of memory and so it must be possible to differentiate the amount of "suppressed definitely-lost (real) leaks" from the amount of "suppressed reachable (not real) leaks"). > b) if it's OK performance-wise after r12824 I committed the changes in r12824 as the perf. improvement for the artificial test case was confirmed on a big application at work. I am curious to see what it gives on Chromium. It will not be easy to isolate specifically the perf. improvement brought by the suppression matching, as the leak search itself has also be improved in 3.8.0
> To be sure I understand: > The problem is that a suppression entry can match definitely-lost loss > records, possibly-lost loss records and still-reachable loss records. > Due to this, once you have added a suppression to handle > a known definitely-lost, it will also "collect" the reachable loss records, > which means you cannot track precisely anymore the amount of > definitely lost. > > Is that a correct understanding ? > > (is correct, the above looks to me to be a real problem: effectively, when you > suppress a leak, it is based on the assumption that this leak is > leaking a small amount of memory and so it must be possible > to differentiate the amount of "suppressed definitely-lost (real) leaks" from > the amount > of "suppressed reachable (not real) leaks"). Yes, exactly! >> b) if it's OK performance-wise after r12824 > I committed the changes in r12824 as the perf. improvement > for the artificial test case was confirmed on a big application at work. > I am curious to see what it gives on Chromium. > It will not be easy to isolate specifically the perf. improvement > brought by the suppression matching, as the leak search itself > has also be improved in 3.8.0 It gives a lot, here are some numbers: https://bugs.kde.org/show_bug.cgi?id=266390#c6
https://bugs.kde.org/show_bug.cgi?id=307465#c5 should provide the flexibility needed to specifically show and/or count as error and/or suppress precisely what is desired.
Too many words here and some are off-topic; there's a more focused report #307465 with pretty much the same topic as c#29. *** This bug has been marked as a duplicate of bug 307465 ***