Debug info is lost for .so files when they are dlclose'd. This makes some errors much less informative than they should be. Its mostly a problem for errors that are detected and reported long after the relevant code has executed, particularly for memory leaks -- if the allocating code was in a dlclose'd .so, the trace won't identify the allocation point.
*** Bug 79875 has been marked as a duplicate of this bug. ***
*** Bug 84541 has been marked as a duplicate of this bug. ***
The problem ultimately comes from the fact that ExeContexts (ie. stack traces) use an imperfect way of identifying code locations -- they use the memory address where the instruction of note is stored. This uniquely identifies the instruction while it's in memory, but is no good once it's out of memory (ie. unloaded). Better would be to use an object code location (eg. object file name + offset) or even a source code location (eg. source file name + function + line number; if these aren't known fall back on "???"). Esp. with source file locations, if the lookup is done immediately the debug info will be present and fresh and all these problems would disappear. The problem with these is that they don't have the single best property of the memory-addresses -- which are extremely compact and easy to get, ie. a single pointer represents the code location. By comparison, source code locations require looking up debug info (which is wasteful if some of the ExeContexts are not subsequently printed) and storing strings. So it's a matter of coming up with an efficient way to represent unique code locations.
Nicholas Nethercote wrote: > ------- Additional Comments From njn25 cam ac uk 2004-07-06 10:46 ------- > The problem ultimately comes from the fact that ExeContexts (ie. stack traces) > use an imperfect way of identifying code locations -- they use the memory address where the instruction of note is stored. This uniquely identifies the instruction while it's in memory, but is no good once it's out of memory (ie. unloaded). > > Better would be to use an object code location (eg. object file name + offset) > or even a source code location (eg. source file name + function + line number; if these aren't known fall back on "???"). Esp. with source file locations, if the lookup is done immediately the debug info will be present and fresh and all these problems would disappear. I saw this suggestion on the -developer list. I think it's already been shown to be not a great idea, you'll introduce a lot of redundant symbol-table searching by doing that, and with the fallback required, then you've just introduced a special case to your data format. > The problem with these is that they don't have the single best property of the memory-addresses -- which are extremely compact and easy to get, ie. a single pointer represents the code location. By comparison, source code locations require looking up debug info (which is wasteful if some of the ExeContexts are not subsequently printed) and storing strings. > So it's a matter of coming up with an efficient way to represent unique code locations. Please consider my suggestion from 84541. It only adds one element (timestamp or sequence number, sequence number is probably better) to the current context information. Then add a handler to intercept calls to _dl_debug_state() to update valgrind's notion of what objects are mapped in memory, and maintain a list of all the objects that have been added and removed, recording the current sequence number at load and unload time. Parsing the ELF linker map is pretty easy, I wrote similar code for FunctionCheck to do the same thing. You need to record the object name, the base address and size, and the start and end times. When it comes time to display the symbols, you may need to reread the symbol tables of the unloaded objects but I think you already have code to do that.
> I think it's already been shown to be not a great idea Where was this shown?
Nicholas Nethercote wrote: ------- Additional Comments From njn25 cam ac uk 2004-07-06 11:15 ------- >>I think it's already been shown to be not a great idea > Where was this shown? As pointed out here http://sourceforge.net/mailarchive/message.php?msg_id=8864868 object_file + offset doesn't work because ELF files aren't that simple. I guess you can make the assumption that there's only one Text segment of interest in any object file, and that only Text is interesting in a stack trace. But functions like VG_(describe_addr) may need to look at the Data or BSS as well. Mapping immediately to file/fn/line loses information that is often essential to diagnosing an error. Simple things like multi-line preprocessor macros are impossible to debug with just a single source line. (They're a pain to debug anyway, but with actual addresses it's possible to debug in assembler, if nothing else.) There's also the very odd possibility that a particular module gets reloaded multiple times over the life of a process, and due to other activity, gets mapped to a different address each time. I've seen this happen with Cyrus servers as they load and unload authentication mechanisms. I guess your three-level table approach still works here, but it'll have redundant strings in this case. Looking at the problem again from your first statement: you want an ExeContext to uniquely identify an instruction. Right now you treat memory as a one-dimensional address space, with the assumption that a single Address uniquely identifies any instruction. But because this space actually can change over time, it is really a two-dimensional address space.
...clearly in a 2D space, you cannot uniquely represent anything with only a single coordinate (Address). And using more than two coordinates is overkill. You already pointed out that it's wasteful to do full symbol resolution every time, because the majority of symbols won't ever be displayed.
There may be a work-around for this problem that will be good enough for most applications. Try using LD_PRELOAD to load the libraries into memory as valgrind is started. That seems to insure that the symbols are still around when valgrind gets around to doing symbol resolution. Example: LD_PRELOAD=/usr/local/lib/libbuggy.so valgrind --leak-check=full /my/app -x -y -z This isn't a perfect fix for the problem. If the application is sensitive to the order in which libraries are loaded this can cause more problems than it solves.
Confirmed to still be a bug in valgrind 3.2.3
I just ran into this problem also. Besides stack traces with lots of ??'s in them coming from memory leaks in modules that were since unloaded, I also had an even trickier issue: one of the shared libraries registered a "destructor" with pthread_key_create(). By the time the thread exited, the shared library was gone. At least "valgrind -v" allowed me to compare the memory address where the crash occured against the valgrind messages "discarding syms at [...] due to munmap()". How about the following option, made in ignorance of the valgrind internals: At munmap() time, don't discard the symbols associated with this memory region. Just preserve the region as is, marking it non-readable, non-writable and non-executable by the program under test. When another (or the same) shared library is loaded (again), it is assigned a different address. That way, when an error occurs, there is still a unique mapping from memory address to symbol (if not vice versa). That would certainly add to the memory usage, especially if many shared libraries are loaded, unloaded and re-loaded, but I think it would be of immense value. Maybe there could be a command-line option to preserve symbols for unloaded modules or not.
Created attachment 33376 [details] so-thread-test.tar.gz Here is a test case for the issue that I experienced. Tested on linux-x86_64. The main file implements a simple facility to do work asynchronously in a different thread, like a one-thread thread pool. It then loads a shared library and runs a job from that shared library. The job uses pthread_key_create() to arrange for a destructor to be called at thread exit time. But the life of the thread exceeds the life of the shared library, so when thread exit time comes, the shared library and its destructor are long gone. In the test.log file at line 80, the location of the crash is reported as "??", and the stack trace doesn't give you much else to go on. The only way to make some sense of it is to look at line 75, which shows that the address of the crash was within the range of the shared library. Obviously, I would like valgrind to report the symbol associated with the address ("test_destructor"), possibly with some useful hints like: "This address used to belong to function [xx] in file [yy] which was unloaded." Yes, the test_job() function is buggy. But if it wasn't, I wouldn't need valgrind.
Created attachment 33377 [details] so-thread-test.tar.gz Sorry, forgot a pthread_join(). Updated test case.
I have created a patch that resolves this issue for me. Basically when debug info is going to be discarded I place it in a separate list. I then search both lists in many places. A few improvements might be to tweak things so that function name (or perhaps file names) get decorated with something like *UNLOADED*, to give the user a better idea about what happened. This is only tested this very briefly. One other approach might be to just move the debuginfo to the end of the list (possibly also marking a new flag in the structure indicating it as discarded, but I was reluctant to do this given that I noticed another function in the file that messes with order.
Created attachment 55565 [details] potential patch to allow memory leak stack traces for unloaded .so files
Created attachment 55670 [details] update patch with regression tests Same patch as before, but now with +inf% more regression tests.
How do you decide which set of debuginfo to use to decode a given address if there are multiple possibilities? The other obvious question is that of possible unbounded memory usage in a long running process that repeatedly opens and closes shared libraries, possibly even opening the same one repeatedly - do you do anything special to avoid storing duplicates for example?
Currently I stop at the first match I find. I search the list of currently valid symbols, then the list of symbols that have been unloaded. The unloaded list is order from most recently unloaded to least recent. The unbounded memory issue does exist. And AFAICT there is no real good solution if we are going to put off resolving leak stack traces until the process ends. I guess when adding to the unloaded symbols list I could delete anything that was completer shadowed by the new addition. I was reluctant to do this since I was thinking that a follow on to this might be to add a 'generation number' to the address. I would then update the gen# when things get dlopen/dlclose'd which would theoretically allow me to find the right symbols even if a new dlopen used the same address space as something that was dlclosed. That all being said, my (personal) experience with valgrind is that I don't normally use it on long running stuff (in large part b/c it already slows/bloats things) I work hard to make the shortest/smallest case to reproduce the problem. So I don't see the slow 'leak' I am causing as all that problematic.
Created attachment 57256 [details] Create new CLO for keeping unloaded symbols Since the only real feedback I got was that this change would cause a showstopper for programs that repeatedly dlopen/dlclose, I turned this into a CLO so that the user could opt in to the extra memory usage if they need it.
Nonsense - as far as I can see you haven't addressed, in any reasonable way, the question of how to resolve stack traces when multiple different ELF objects may have been loaded at a given address over the life of a program.
I thought I had responded to this a while back, somehow my post never showed up ... anyway. I think that you are letting better be the enemy of good-enough This patch is not perfect, there are cases where it will print the wrong answer, when multiple objects get mapped to the same AVMA over the life of the program. In my experience (which is by no means comprehensive) this is not a common enough situation to worry about. Having it give the right answer most of the time is a huge win (at least for me). The LD_PRELOAD hack mentioned above only kind-of works for me, and introduces a number of new false positives since dlclose doesn't actually clean up completely (since the object is still loaded). And having the object still loaded causes other issues, e.g. if you dlopen/dlclose the same object twice, static variables normally get reinitialized (on the second dlopen), but will still have stale data in them with LD_PRELOAD. I have looked into making a patch that would always get the right answer. This means teaching valgrind that addresses (or at least ExeContext's) are not only about where, but also when. Two potential starting points are 1) Extend Addr to include the 'when' 2) Extend ExeContext to include the when. Option (1) would have far reaching implications all over the code base, so I gave up on that one completely. (2) Is a little less invasive, but it does involve changing a number of function signatures (or adding a number of new functions). Everything that is in the path of turning an ExeContext into a printed stack trace.
(In reply to comment #20) > I have looked into making a patch that would always get the right answer. This > means teaching valgrind that addresses (or at least ExeContext's) are not only > about where, but also when. Two potential starting points are 1) Extend Addr to > include the 'when' 2) Extend ExeContext to include the when. Option (1) would > have far reaching implications all over the code base, so I gave up on that one > completely. (2) Is a little less invasive, but it does involve changing a > number of function signatures (or adding a number of new functions). Everything > that is in the path of turning an ExeContext into a printed stack trace. Actually the ExeContexts themselves don't need any changing, they already include a monotonically increasing serial number "ecu". It might be ugly, but you could export an "ec_current_ecu" as a global to make it available to the debuginfo code, so that you can reference it while registering a symbol table, and again when doing symbol lookups. Then no function signatures need to change.
Actually they would still need to change, the full ExeContext does not get passed all the way down, it quickly devolves into a list of Addr's and then each Addr is treated independently at the point that things actually get translated from an address into file/line information. So we would either need to change things so that ExeContext's passed further down the stack, or have the "ecu" (or some other indication of 'when') get passed along.
(In reply to comment #22) > Actually they would still need to change, the full ExeContext does not get > passed all the way down, it quickly devolves into a list of Addr's and then > each Addr is treated independently at the point that things actually get > translated from an address into file/line information. So we would either need > to change things so that ExeContext's passed further down the stack, or have > the "ecu" (or some other indication of 'when') get passed along. Pretty sure you're overlooking something (although I haven't tried this with your patch yet, will try it in the next day or so). ExeContexts are processed one at a time. All you need to do is set the global ec_current_ecu when you're creating a new ExeContext and when you're processing one (e.g. pp_ExeContext). It doesn't need to get passed along at all.
Created attachment 57564 [details] Addition for ecu tracking This is what I had in mind. Add it on top of your latest patch. It doesn't appear to break anything, anyway.
OOOOoooohhhh. I totally misunderstood where you were going in the previous post. This seems to get the job done with very little new code. Unfortunately I suspect it has some nasty side effects. In particular, there are some places that print out stacktraces without going through pp_ExeContext. So those would end up using a stale value in VG_(ec_curr_ecu). Also, is it a true statement that only 1 thread is ever printing stack traces at a time? I don't have a good feel for all the guarantees of valgrind yet. The case I am thinking about is two threads doing something bad (e.g. accessing off the end of a malloc'ed array) at the same time. They would both print a current stack trace (i.e. where the bad access happened) and a past stack trace (i.e. where the allocation of said array occurred).
(In reply to comment #25) > OOOOoooohhhh. I totally misunderstood where you were going in the previous > post. This seems to get the job done with very little new code. Unfortunately I > suspect it has some nasty side effects. In particular, there are some places > that print out stacktraces without going through pp_ExeContext. So those would > end up using a stale value in VG_(ec_curr_ecu). I was wondering about that. In those cases, we have no idea what ecu applies, so it's hard to say that any particular value is more right or wrong than any other. I was thinking of zeroing VG_(ec_curr_ecu) before pp_ExeContext returns, and then we can skip the ecu checks in debuginfo when VG_(ec_curr_ecu) is zero. But regardless, the odds of returning a wrong value are the same. > Also, is it a true statement that only 1 thread is ever printing stack traces > at a time? I don't have a good feel for all the guarantees of valgrind yet. The > case I am thinking about is two threads doing something bad (e.g. accessing off > the end of a malloc'ed array) at the same time. They would both print a current > stack trace (i.e. where the bad access happened) and a past stack trace (i.e. > where the allocation of said array occurred). valgrind itself is single-threaded, no matter how many threads it simulates on the target.
(In reply to comment #26) > (In reply to comment #25) > > OOOOoooohhhh. I totally misunderstood where you were going in the previous > > post. This seems to get the job done with very little new code. Unfortunately I > > suspect it has some nasty side effects. In particular, there are some places > > that print out stacktraces without going through pp_ExeContext. So those would > > end up using a stale value in VG_(ec_curr_ecu). > > I was wondering about that. In those cases, we have no idea what ecu applies, > so it's hard to say that any particular value is more right or wrong than any > other. I was thinking of zeroing VG_(ec_curr_ecu) before pp_ExeContext returns, > and then we can skip the ecu checks in debuginfo when VG_(ec_curr_ecu) is zero. > But regardless, the odds of returning a wrong value are the same. Having looked at it some more, all of the calls to pp_StackTrace are just using a "current" stack on a current thread. There's probably no reason to examine the unloadedDebugInfo in these cases. Flagging this case by zeroing VG_(ec_curr_ecu) would be fine. (I guess it would be possible for a function to cause its caller to get munmap'd. Good way to crash yourself.)
Created attachment 57625 [details] Revised ecu tracking patch Added changes from comment #27
Created attachment 57651 [details] Cleaned up version of patch to keep syms of unloaded code This contains all the latest work, it should properly decode all symbols, even when multiple objects get loaded at the same AVMA. Some improved testing is probably worth while. This version passes all the same tests (make regtest) that a clean checkout of trunk does for me. I get 6 failures on both: memcheck/tests/linux/stack_switch (stderr) helgrind/tests/pth_barrier3 (stderr) helgrind/tests/tc06_two_races_xml (stderr) drd/tests/pth_barrier3 (stderr) drd/tests/tc09_bad_unlock (stderr) exp-ptrcheck/tests/bad_percentify (stderr) I get very similar numbers (to within .1 sec) on "perl perf/vg_perf perf/" for clean trunk vs this patch. Even when I tweaked the patch to keep symbols by default.
(In reply to comment #29) > Created an attachment (id=57651) [details] > Cleaned up version of patch to keep syms of unloaded code > > This contains all the latest work, it should properly decode all symbols, even > when multiple objects get loaded at the same AVMA. Some improved testing is > probably worth while. This version passes all the same tests (make regtest) > that a clean checkout of trunk does for me. Probably not significant, but why are you using <= instead of < here? Just curious for the reasoning. + for (di = list; di != NULL; di = di->next) { + if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || + VG_(ec_curr_ecu) > di->last_ecu) {
(In reply to comment #30) > Probably not significant, but why are you using <= instead of < here? Just > curious for the reasoning. > > + for (di = list; di != NULL; di = di->next) { > + if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || > + VG_(ec_curr_ecu) > di->last_ecu) { Assuming I understood the ExeContext code correctly, first_ecu actually contains the last ecu that was allocated before this debuginfo was created, and last_ecu contains the last ecu that was allocated before this debuginfo was removed (potentially the same value as first_ecu, if no ExeContexts were allocated between load and unload).
(In reply to comment #31) > (In reply to comment #30) > > Probably not significant, but why are you using <= instead of < here? Just > > curious for the reasoning. > > > > + for (di = list; di != NULL; di = di->next) { > > + if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || > > + VG_(ec_curr_ecu) > di->last_ecu) { > > Assuming I understood the ExeContext code correctly, first_ecu actually > contains the last ecu that was allocated before this debuginfo was created, and > last_ecu contains the last ecu that was allocated before this debuginfo was > removed OK, makes sense. > (potentially the same value as first_ecu, if no ExeContexts were > allocated between load and unload). Which raises another interesting point. If first_ecu and last_ecu are the same at the time of unmap, then we know that this debuginfo is totally unused and could be freed instead of saved.
Comment on attachment 57625 [details] Revised ecu tracking patch Merged into attachment #57651 [details]
Created attachment 58003 [details] keep syms of unloaded code - less dirty hackish Finally got through adding all the new functions with new function signatures. This patch removes the dirty hack in the previous patch where global vars were used to avoid having extra function arguments.
(In reply to comment #34) > Created an attachment (id=58003) [details] > keep syms of unloaded code - less dirty hackish > > Finally got through adding all the new functions with new function signatures. > This patch removes the dirty hack in the previous patch where global vars were > used to avoid having extra function arguments. The comments added in this chunk are no longer relevant, just drop them. >Index: coregrind/m_execontext.c >=================================================================== >--- coregrind/m_execontext.c (revision 11643) >+++ coregrind/m_execontext.c (working copy) >@@ -168,8 +169,11 @@ > > /* Print an ExeContext. */ > void VG_(pp_ExeContext) ( ExeContext* ec ) >-{ >- VG_(pp_StackTrace)( ec->ips, ec->n_ips ); >+{ /* Set the current ecu to decode, this hack avoids changing a number of >+ function signatures in order to pass this info all the way down to >+ debuginfo */ >+ VG_(pp_StackTrace_with_ecu)( ec->ecu, ec->ips, ec->n_ips ); >+ /* And reset to decoding current stack traces */ > } >
*** Bug 281449 has been marked as a duplicate of this bug. ***
Any more action on this bug? Why is its status still "Unconfirmed" when the issue is obviously real, and we have a working patch for it?
We don't generally bother with the NEW/ASSIGNED states very much, so don't read anything into the UNCONFIRMED. I've set it to NEW now though if that makes you happier. As I have made clear I do not consider the current patch suitable, so I won't be merging it. If somebody else with commit access wants to then that is up to them.
(In reply to comment #38) > We don't generally bother with the NEW/ASSIGNED states very much, so don't read > anything into the UNCONFIRMED. I've set it to NEW now though if that makes you > happier. OK. > As I have made clear I do not consider the current patch suitable, so I won't > be merging it. If somebody else with commit access wants to then that is up to > them. Looking back, you expressed objections in comment #16 and comment #19. Comment #19 is a non-issue, the patch tracks all addresses unambiguously. That only leaves the question of unbounded memory growth in comment #16. I don't believe that's a reasonable complaint. If a program does a continuous sequence of mallocs and frees, valgrind's tracking of those events must also be unbounded. What's the difference between these two cases?
Sorry, I hadn't realised there was a version of the patch that handled the multiple objects at one address problem. I'll try and review that when I get a chance.
Hi, Is there any progress on this issue? I would like to have this supported in Valgrind. Thanks
*** Bug 291380 has been marked as a duplicate of this bug. ***
Hi! The patch seams to be based on old baseline (fails to apply..) Would it be possible to get it for 3.7.0? That would be really appreciated. Kind Regards, Kaspars
Created attachment 68205 [details] Refreshed patch against svn r12356 Obsoletes attachment id 58003.
Thanks! There is a missing "{" in m_debuginfo/debuginfo.c:1656 But I'm quite sure it's not working, I still don't get debuginfo ==20034== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3 ==20034== at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343) ==20034== by 0x4034319: ??? ==20034== by 0x40347C6: ??? ==20034== by 0x8064695: main (in /home/kaspars/Projects/device/bin/cfw) P.S. If I leave the library open it will have the traces as needed ==20081== 123 bytes in 1 blocks are definitely lost in loss record 6 of 8 ==20081== at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343) ==20081== by 0x4034319: function() (in /home/kaspars/Projects/device/bin/libcfw_example.so) ==20081== by 0x40347C6: example::start() (in /home/kaspars/Projects/device/bin/libcfw_example.so) ==20081== by 0x8064695: main (in /home/kaspars/Projects/device/bin/cfw)
Created attachment 68213 [details] Refreshed patch against svn r12356 Fixed typo in previous copy
(In reply to comment #45) > Thanks! There is a missing "{" in m_debuginfo/debuginfo.c:1656 Sorry about that. Fixed in updated attachment. > But I'm quite sure it's not working, I still don't get debuginfo It isn't enabled by default, you must use the --keep-unloaded-syms=yes option.
Great! It really does work with all the line numbers and everything :) Thanks a lot! /Kaspars ==30057== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3 ==30057== at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343) ==30057== by 0x4034319: function() (example.cpp:8) ==30057== by 0x40347C6: example::start() (example.cpp:24) ==30057== by 0x8064695: main (core.cpp:76)
(In reply to comment #48) > Great! It really does work with all the line numbers and everything :) > > Thanks a lot! /Kaspars Yes, it works perfectly well. I just hope someone commits it before the patch bitrots again. > ==30057== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3 > ==30057== at 0x40281BB: operator new[](unsigned int) > (vg_replace_malloc.c:343) > ==30057== by 0x4034319: function() (example.cpp:8) > ==30057== by 0x40347C6: example::start() (example.cpp:24) > ==30057== by 0x8064695: main (core.cpp:76)
Created attachment 68609 [details] refresh patch again r12368 This patch should work against r12368, it includes testing related files that were missing in the the previous refresh. I am still very interested in getting this merged into the mainline. I am very willing to put more work into it. But, at the moment I don't know what to do to push it forward. Any feedback would be greatly appreciated.
Are there any plans to include this patch in Valgrind?
8 years.. a working patch.. what is the problem?
Still very interested to see the patch be accepted..... Please please please :)
Created attachment 75779 [details] Refresh patch against r13174. In my testing (x86-64 linux) this patch does not introduce any more failures, and it passes newly introduced tests.
Thank you very much, Kevin Fitch, for this patch. I sincerely hope that it will be accepted into the mainline of valgrind someday. I did find an issue with it, however - it doesn't seem to work with --gen-suppressions, at least for memory leaks. All the stack frames in the .so are reported in the error report, but the generated suppression block just has "obj:*" for each of them. This of course could easily suppress completely different leaks (in completely different libraries even) in addition to the one that triggered the report. I haven't yet tried manually creating a suppressions file with the real symbols in it as reported by --keep-unloaded-syms=yes. Hopefully if that does work, then expanding your patch to also put those symbols into --gen-suppressions output will be straightforward. If that doesn't work, though, I imagine it gets a little hairier.
*** Bug 351536 has been marked as a duplicate of this bug. ***
Tom Hughes wrote: > https://bugs.kde.org/show_bug.cgi?id=79362 > > Tom Hughes <tom@compton.nu> changed: When are you actually going to merge this fix? The bug has been open for over a decade and a clean fix has been available for nearly 5 years. > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |thiago@kde.org > > --- Comment #56 from Tom Hughes <tom@compton.nu> --- > *** Bug 351536 has been marked as a duplicate of this bug. *** >
Please fix this? It causes significant pain for many people using valgrind.
(In reply to Howard Chu from comment #57) > a clean fix has been available for nearly 5 years. I'm trying to understand the proposed fix. What I've gathered is: (1) Each ExeContext has a 32-bit unique ID. (2) m_execontext.c issues unique IDs in a monotonically increasing sequence, from ec_next_ecu. (3) The patch extends struct DebugInfo so as to record the unique ID (2) at both the time the DebugInfo came into being and the point at which the associated .so was unmapped. (4) DebugInfos for objects that have been unmapped are moved into their own linked list (unloadedDebugInfo_list) (5) When symbolicating an ExeContext, I assume that you use its unique to identify which set of DebugInfos in unloadedDebugInfo_list can participate in its symbolisation (viz, should be used to look up symbol names etc). This seems somewhat plausible. But what I don't understand is how this interacts with the fact that m_execontext doesn't necessarily issue a new ExeContext with a new unique every time a stack unwind is requested. Instead, it unwinds the stack, then checks to find if it already has an ExeContext with that stack. If so it merely gives out a pointer to the existing version rather than creating a new one. I'm not saying the patch is wrong. I am saying I don't understand it. Was the m_execontext duplicate-stack-removal behaviour taken into account in the design of the fix? AFAICS, the only thing that you can conclude from the fact that execontext1.ecu > execontext2.ecu is that execontext1's stack was observed for the first time before execontext2's stack was observed for the first time. And that the two stacks are different, since the .ecus would be the same if the stacks were the same.
(In reply to Julian Seward from comment #59) > AFAICS, the only thing that you can conclude from the fact that > > execontext1.ecu > execontext2.ecu > > is that execontext1's stack was observed for the first time before > execontext2's stack Duh. I meant "after", not "before".
Julian Seward wrote: > https://bugs.kde.org/show_bug.cgi?id=79362 > > --- Comment #59 from Julian Seward <jseward@acm.org> --- > (In reply to Howard Chu from comment #57) > >> a clean fix has been available for nearly 5 years. > > I'm trying to understand the proposed fix. What I've gathered is: > > (1) Each ExeContext has a 32-bit unique ID. > > (2) m_execontext.c issues unique IDs in a monotonically increasing sequence, > from ec_next_ecu. > > (3) The patch extends struct DebugInfo so as to record the unique ID (2) at > both the time the DebugInfo came into being and the point at which the > associated .so was unmapped. > > (4) DebugInfos for objects that have been unmapped are moved into their own > linked list (unloadedDebugInfo_list) > > (5) When symbolicating an ExeContext, I assume that you use its unique to > identify which set of DebugInfos in unloadedDebugInfo_list can > participate in its symbolisation (viz, should be used to look up symbol > names etc). > > This seems somewhat plausible. But what I don't understand is how this > interacts with the fact that m_execontext doesn't necessarily issue a new > ExeContext with a new unique every time a stack unwind is requested. > Instead, it unwinds the stack, then checks to find if it already has an > ExeContext with that stack. If so it merely gives out a pointer to the > existing version rather than creating a new one. > > I'm not saying the patch is wrong. I am saying I don't understand it. Was > the m_execontext duplicate-stack-removal behaviour taken into account in the > design of the fix? It's been so long I don't remember the fine details any more. But I think it's safe to say, most of the time, if a shared object has been unloaded and a different one has been loaded, even if mapped to the same base address as the previous object, it is unlikely to have the same sizes and ordering of functions as the previous object, and thus unlikely to produce duplicates of any existing stack traces.
Created attachment 106984 [details] epochs-3.diff Here's a patch against 3.13.0. It works only for the tools "none" and "memcheck" -- currently it disables building of all other tools. It should be able to generate suppressions for unloaded libraries too, although I haven't tested that. This is a from-scratch implementation. I couldn't convince myself that the use of ExeContext uniques was reliable and the more I looked into it, the more doubts I had. This patch adds two new types: * DiEpoch, which is just a wrapper around a 32 bit unsigned integer. These mark points in time where debuginfo is loaded or unloaded. * ExeContextAndEpoch. This is a pairing of an ExeContext (an array of code addresses that is a stack trace) and the DiEpoch in which it should be interpreted. The core idea is that an ExeContext is still a stack trace, but one that can't be symbolised. To symbolise it, we need to specify a DiEpoch too, hence this pairing. In many places, the patch replaces uses of ExeContext* with ExeContextAndEpoch. Each DebugInfo structure acquires two new DiEpoch fields which indicate the first and last epochs for which they are valid. There is still only one top level list of DebugInfo structures, which holds both the "active" and "archived" DebugInfos. They are distinguished by having the last-epoch field be set to DiEpoch_INVALID() for an active structure, and to a valid value to an archived structure. From a user's point of view, the default behaviour of the patch is still to discard debuginfo -- at least for now. To retain it, use --keep-debuginfo=yes. You can see the difference if you run with -v. Then you will see messages of the form Discarding syms at [..] due to munmap() with --keep-debuginfo=no but Archiving syms at [..] due to munmap() with --keep-debuginfo=no Two things that I don't know if I kludged correctly or not, are the GDB server support and the XTree support. In both cases these just symbolise against the current epoch, when it is necessary to do so. So their behaviour should at least be unchanged from before, even if they are not yet "epoch-aware".
I quickly read the patch, here are a few comments * For what concerns the gdbserver code and the xtree code: for sure, using the current epoch will be as good (or as bad) as the current behaviour. But once the patch is committed, I will update the code to use a real epoch where it is better (basically, I think xtree should store an epoch, while gdbserver should normally be good enough with current epoch) But see below the idea to store epoch inside execontext,which should probably solve the xtree case. * there are some tl_assert occurences in coregrind, where vg_assert should be used instead. * for the --help of --keep-debuginfo=.. this is not limited to memory leak stack traces as I understand, but all stack traces recorded as an execontext. E.g. IIUC, this is also valid for other types of errors. Same for the user manual. * For execontext, instead of just storing the execcontext (e.g. in an mc chunk allocated by memecheck for each block), we now store an execontext + epoch. An alternative approach that will reduce the memory would be to store the epoch of an execontext in the execcontext itself. (by default, invalid epoch, indicating 'execontext valid in the current set of di'). Then when a DI is archived, scan the execontext, and check if an execontext contains some addressed that are in the debuginfo being archived. If yes, change the execontext epoch from invalid to the current epoch (before unloading the DI). This should reduce the impact on all execontext users, both in terms of code change, and in terms of memory increase where an execontext has to be stored. This approach spares memory (and I think less impact on tools), but each unload of a DI means scanning the execontext. With this approach, when searching for an execontext that corresponds to a (fresh) stacktrace, only execontext that have an invalid epoch should be returned (as a fresh stack trace cannot match an execontext that has been marked with a real epoch). This should also avoid the problem of having 2 errors being the same, while in reality their execontext are different, as a new error will never give back an 'old execontext' with a non invalid epoch. Similarly, IIUC, origin tracking will store an ec, not an ec + epoch. Storing the epoch inside the execcontext will also provide correct DI epoch for track origin.
Created attachment 107001 [details] epochs-4.diff Now builds for all tools again, although only Memcheck is properly "epoch-aware". All other tools just use the current epoch when needed, so their behaviour should be unchanged related to pre-patch.
(In reply to Philippe Waroquiers from comment #63) Thanks for the feedback. I'll fix the small stuff. > * For execontext, instead of just storing the execcontext > (e.g. in an mc chunk allocated by memecheck for each block), we now store > an execontext + epoch. I did consider this design, in some detail, but decided not to do that. Instead I kept semantics of ExeContext unchanged and added a new type ExeContextAndEpoch. The problem is that we would lose the ability to do pointer-comparisons on ExeContext*s. Currently, if two ExeContext*s (the addresses) are different then we are sure that the vectors of addresses are different (and vice versa), because m_execontext.c carefully de-duplicates them. And I was unsure what to do about maintaining those semantics if we add an epoch to ExeContext. That said, fixing XTree, the GDB server and all the other tools (if it is possible) would be a good thing.
(In reply to Julian Seward from comment #66) > (In reply to Philippe Waroquiers from comment #63) > > Thanks for the feedback. I'll fix the small stuff. > > > * For execontext, instead of just storing the execcontext > > (e.g. in an mc chunk allocated by memecheck for each block), we now store > > an execontext + epoch. > > I did consider this design, in some detail, but decided not to do that. > Instead I kept semantics of ExeContext unchanged and added a new type > ExeContextAndEpoch. > > The problem is that we would lose the ability to do pointer-comparisons > on ExeContext*s. What I suggest still allows to do pointer comparison (or ecu comparison) and fully identify the epoch+addresses. m_execontext.c would de-duplicate the (epoch + addresses). So, even if after unmap and remap of some other lib, a bunch of addresses would (by lack of chance) be the same, they would differ by their epoch, and so that would give another execontext, only differing by the epoch. > Currently, if two ExeContext*s (the addresses) are > different then we are sure that the vectors of addresses are different > (and vice versa), because m_execontext.c carefully de-duplicates them. > And I was unsure what to do about maintaining those semantics if we add > an epoch to ExeContext. What I suggest is to consider the epoch part of the execontext itself, and so, the same set of addresses but another epoch would give another execontext. To ensure this, the idea is to store 'invalid epoch' in an execontext when adding it. When an debug info is unloaded, all execontext with an invalid epoch that have an address inside the unloaded debug info are marked with the current epoch. With that idea, an execontext reference (so either a ptr or an ecu) will uniquely identify a set of address and its epoch. That decreases the memory (e.g. for memcheck in an MC_Chunk) or in the track origin (which can only store an ecu). > > That said, fixing XTree, the GDB server and all the other tools (if it > is possible) would be a good thing.
Created attachment 107003 [details] epochs-5.diff Same as epochs-4.diff with with regression tests from Kevin Fitch's original patch (afaics, comment 15).
Just a quick note: Once this functionality is integrated, remove the corresponding entry from www page http://valgrind.org/help/projects.html: ----------------------- Preserving debugging information Currently, Valgrind unloads debugging information for shared objects when they are unloaded with dlclose(). If the shared object has a memory leak, the stack trace for Memcheck's error message at termination will be missing entries, which makes fixing the leak difficult. This is described in bug report #79362. One way to fix this would be to change things so that Valgrind records source-level information for stack traces, rather than code addresses. It is not entirely clear how to do this in a way that avoids unnecessary debug information lookups, and nor how to avoid increasing the amount of memory used. An implementation would help clarify the problem and show if this approach is feasible. This project is of intermediate difficulty. (Added August 27, 2005) -----------------------
Following discussion on irc of the suggestion discussed in comment 63 till comment 67, I will attach a patch that adds some measurement code for the scanning of the EC contexts that would be needed at each unload of a .so. The measurement is done by applying the patch, and then doing: for n in 10 11 12 13 14 15 16 17 18 19 20 25 do ./vg-in-place --stats=yes --time-stamp=yes --num-callers=$n ./perf/many-xpts done |& grep -e 'exe.*scanning' Detailed results below. Basically, (as expected), the time to scan depends on the nr of execontexts, and the avg size of each execontext. For a huge data set (260_000 EC, 6_300_000 ips) , a scan takes about 0.005 seconds A medium size data set (65_000 EC, 1_100_000 ips), a scan takes about 0.001 seconds. (debian 8, Intel i5-6402P CPU @ 2.80GHz) So, the cost seems reasonable, e.g. 5 seconds CPU in total for a big application that would do 1000 load/unload of a .so. We might possibly avoid this scanning if ever it would be needed, e.g. by maintaining a list of DI valid for a certain epoch, and another data structure that associates the current epoch with past 'compatible' epoch. But that seems more complex than the simple scanning solution at each unload. --00:00:00:01.298 11342-- exectx: scanning 1000 times 515 contexts/5,133 ips --00:00:00:01.299 11342-- exectx: finished scanning 515 contexts/5,133 ips --00:00:00:01.308 11345-- exectx: scanning 1000 times 1,027 contexts/11,277 ips --00:00:00:01.310 11345-- exectx: finished scanning 1,027 contexts/11,277 ips --00:00:00:01.341 11348-- exectx: scanning 1000 times 2,051 contexts/24,589 ips --00:00:00:01.347 11348-- exectx: finished scanning 2,051 contexts/24,589 ips --00:00:00:01.357 11351-- exectx: scanning 1000 times 4,099 contexts/53,261 ips --00:00:00:01.381 11351-- exectx: finished scanning 4,099 contexts/53,261 ips --00:00:00:01.403 11354-- exectx: scanning 1000 times 8,195 contexts/114,701 ips --00:00:00:01.485 11354-- exectx: finished scanning 8,195 contexts/114,701 ips --00:00:00:01.420 11357-- exectx: scanning 1000 times 16,387 contexts/245,773 ips --00:00:00:01.584 11357-- exectx: finished scanning 16,387 contexts/245,773 ips --00:00:00:01.405 11360-- exectx: scanning 1000 times 32,771 contexts/524,301 ips --00:00:00:01.787 11360-- exectx: finished scanning 32,771 contexts/524,301 ips --00:00:00:01.494 11363-- exectx: scanning 1000 times 65,539 contexts/1,114,125 ips --00:00:00:02.447 11363-- exectx: finished scanning 65,539 contexts/1,114,125 ips --00:00:00:01.576 11373-- exectx: scanning 1000 times 131,076 contexts/2,359,327 ips --00:00:00:03.952 11373-- exectx: finished scanning 131,076 contexts/2,359,327 ips --00:00:00:01.782 11376-- exectx: scanning 1000 times 262,149 contexts/4,980,787 ips --00:00:00:06.319 11376-- exectx: finished scanning 262,149 contexts/4,980,787 ips --00:00:00:01.798 11379-- exectx: scanning 1000 times 262,149 contexts/5,242,933 ips --00:00:00:06.832 11379-- exectx: finished scanning 262,149 contexts/5,242,933 ips --00:00:00:01.825 11382-- exectx: scanning 1000 times 262,149 contexts/6,291,514 ips --00:00:00:06.545 11382-- exectx: finished scanning 262,149 contexts/6,291,514 ips
Created attachment 107073 [details] (hack) : patch that adds measurement code to scan the EC for a .so unload
(In reply to Philippe Waroquiers from comment #71) > Created attachment 107073 [details] > (hack) : patch that adds measurement code to scan the EC for a .so unload + for (j = 0; j < ec->n_ips; j++) { + if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) { + break; + } + } This is a side-effect-free loop whose only computed value (j) is unused, and provably terminates. I think it's likely that gcc noticed all 3 facts and deleted the loop. Do you get different results if you change it like this? + for (j = 0; j < ec->n_ips; j++) { + if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) { __asm__ __volatile__("":::"cc","memory"); + break; + } + } The (empty) inline assembly claims to make arbitrary changes to memory and might scare gcc into keeping the loop alive.
(In reply to Julian Seward from comment #72) > (In reply to Philippe Waroquiers from comment #71) > > Created attachment 107073 [details] > > (hack) : patch that adds measurement code to scan the EC for a .so unload > > + for (j = 0; j < ec->n_ips; j++) { > + if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) { > + break; > + } > + } > > This is a side-effect-free loop whose only computed value (j) is unused, > and provably terminates. I think it's likely that gcc noticed all 3 facts > and deleted the loop. Yes, you are correct. I will attach a new patch which acvoids the loop elimination. With this, the scan is slower : for big applications (260_000 EC, 6_300_000 IP), a scan takes between 0.020 to 0.030 seconds. So, an application that does 1000 load/unload will use around 20 seconds more cpu for the scanning. Assuming these measurements are now ok; I think this still looks acceptable, as: * a large majority of the users will not use this feature, so we better reduce the impact of this feature when not used (typically in memory) * for users of the functionality, we better make --track-origins functionally correct * probably not many applications are doing load/unload at a high frequency. If then we find an application that suffers heavily from this scannning, then we can always implement a 'lazy scan': instead of scanning all EC when a lib is unloaded, we just scan an EC when EC is symbolised (or when it is 're-acquired' following the capture of a new stack trace). The logic of this 'lazy scanning' will be: if EC epoch is not the current epoch then scan all DI that were unloaded between EC epoch and current epoch. if an address of the EC matches one of these DI, mark the EC as archived (i.e. it should not be used for something else than output) else change EC epoch to be current epoch With this, very few EC should have to be verified : only the 'active EC' and the EC related to an error. But I think we better do first the simple approach of scanning, and if a user of --keep-debuginfo complains about performance, optimise by doing the lazy scanning. --00:00:00:01.288 9148-- exectx: scanning 1000 times 515 contexts/5,133 ips --00:00:00:01.294 9148-- exectx: finished scanning. Match 4 scanned 5,133,000 515 contexts/5,133 ips --00:00:00:01.295 9151-- exectx: scanning 1000 times 1,027 contexts/11,277 ips --00:00:00:01.309 9151-- exectx: finished scanning. Match 4 scanned 11,277,000 1,027 contexts/11,277 ips --00:00:00:01.308 9154-- exectx: scanning 1000 times 2,051 contexts/24,589 ips --00:00:00:01.345 9154-- exectx: finished scanning. Match 4 scanned 24,589,000 2,051 contexts/24,589 ips --00:00:00:01.315 9157-- exectx: scanning 1000 times 4,099 contexts/53,261 ips --00:00:00:01.417 9157-- exectx: finished scanning. Match 4 scanned 53,261,000 4,099 contexts/53,261 ips --00:00:00:01.331 9160-- exectx: scanning 1000 times 8,195 contexts/114,701 ips --00:00:00:01.576 9160-- exectx: finished scanning. Match 4 scanned 114,701,000 8,195 contexts/114,701 ips --00:00:00:01.346 9163-- exectx: scanning 1000 times 16,387 contexts/245,773 ips --00:00:00:01.860 9163-- exectx: finished scanning. Match 4 scanned 245,773,000 16,387 contexts/245,773 ips --00:00:00:01.381 9166-- exectx: scanning 1000 times 32,771 contexts/524,301 ips --00:00:00:02.576 9166-- exectx: finished scanning. Match 4 scanned 524,301,000 32,771 contexts/524,301 ips --00:00:00:01.441 9169-- exectx: scanning 1000 times 65,539 contexts/1,114,125 ips --00:00:00:05.809 9169-- exectx: finished scanning. Match 4 scanned 1,114,125,000 65,539 contexts/1,114,125 ips --00:00:00:01.559 9172-- exectx: scanning 1000 times 131,076 contexts/2,359,327 ips --00:00:00:15.639 9172-- exectx: finished scanning. Match 4 scanned 2,359,327,000 131,076 contexts/2,359,327 ips --00:00:00:01.777 9175-- exectx: scanning 1000 times 262,149 contexts/4,980,787 ips --00:00:00:28.796 9175-- exectx: finished scanning. Match 4 scanned 4,980,787,000 262,149 contexts/4,980,787 ips --00:00:00:01.792 9178-- exectx: scanning 1000 times 262,149 contexts/5,242,933 ips --00:00:00:32.021 9178-- exectx: finished scanning. Match 4 scanned 5,242,933,000 262,149 contexts/5,242,933 ips --00:00:00:01.821 9182-- exectx: scanning 1000 times 262,149 contexts/6,291,514 ips --00:00:00:21.031 9182-- exectx: finished scanning. Match 524296 scanned 6,291,514,000 262,149 contexts/6,291,514 ips Note that I am slightly amazed by the fact that the last run is faster than the 2 previous one. I cannot explain this (I checked, and the Match-es are all because of the 'last ip' below main, which has a small value. So, the last run scans the same nr of EC, but about 1_000_000 IPs more (1000 times), and is faster ???
Created attachment 107129 [details] measure time to scan the execcontext, but do not let gcc remove the loop
(In reply to Philippe Waroquiers from comment #73) Thank you for redoing the measurements. > With this, the scan is slower : for big applications (260_000 EC, > 6_300_000 IP), a scan takes between 0.020 to 0.030 seconds. That is about 10-15 cycles/IP, which sounds more realistic. > Note that I am slightly amazed by the fact that the last run is faster > than the 2 previous one. I cannot explain this (I checked, and the > Match-es are all because of the 'last ip' below main, which has a small > value. So, the last run scans the same nr of EC, but about 1_000_000 IPs > more (1000 times), and is faster ??? Who knows. Maybe it relates to the layout of the workloads in memory? (in other words, it is a caching effect?) Anyway, not a problem, I think. > Assuming these measurements are now ok; I think this still looks > acceptable, I agree. > But I think we better do first the simple approach of > scanning, and if a user of --keep-debuginfo complains > about performance, optimise by doing the lazy scanning. I agree here too. So the only question now is how to actually get the work done :-) Do you have any time/enthusiasm to hack on this?
(In reply to Julian Seward from comment #75) > So the only question now is how to actually get the work done :-) > Do you have any time/enthusiasm to hack on this? Enthusiasm, yes. Time (at short term), not much. I have added this to my list of things to do, and should be able to look at this sometime in September or October (I really need to finalise first the helgrind unwind optimisation, which is lingering since FOSDEM).
Landed, with major reworking by PhilippeW. cceed053ce876560b9a7512125dd93c7fa059778
Fixed now, after almost 14 years :)
Pushed a followup fix at f8ae2f95d6d717aa6d3923635b9f6f87af9b7cf1. This fixes a problem spotted by Matthias Schwarzott, in which dlopening an object that was previously dlopened and then dlclosed caused an assertion failure in check_CFSI_related_invariants().