Bug 79362 - Debug info is lost for .so files when they are dlclose'd
Summary: Debug info is lost for .so files when they are dlclose'd
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 2.1 CVS
Platform: unspecified Linux
: NOR wishlist
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 79875 84541 281449 291380 351536 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-04-10 03:07 UTC by Nicholas Nethercote
Modified: 2018-01-15 10:35 UTC (History)
21 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
so-thread-test.tar.gz (3.42 KB, application/octet-stream)
2009-05-05 21:40 UTC, Frank Pilhofer
Details
so-thread-test.tar.gz (3.43 KB, application/octet-stream)
2009-05-05 21:47 UTC, Frank Pilhofer
Details
potential patch to allow memory leak stack traces for unloaded .so files (7.35 KB, patch)
2011-01-04 16:27 UTC, Kevin Fitch
Details
update patch with regression tests (12.38 KB, patch)
2011-01-07 05:08 UTC, Kevin Fitch
Details
Create new CLO for keeping unloaded symbols (15.98 KB, patch)
2011-02-14 18:50 UTC, Kevin Fitch
Details
Addition for ecu tracking (3.35 KB, patch)
2011-02-28 05:57 UTC, Howard Chu
Details
Revised ecu tracking patch (3.46 KB, patch)
2011-03-02 22:34 UTC, Howard Chu
Details
Cleaned up version of patch to keep syms of unloaded code (24.41 KB, patch)
2011-03-03 20:13 UTC, Kevin Fitch
Details
keep syms of unloaded code - less dirty hackish (33.99 KB, patch)
2011-03-14 21:48 UTC, Kevin Fitch
Details
Refreshed patch against svn r12356 (28.10 KB, patch)
2012-01-26 19:48 UTC, Howard Chu
Details
Refreshed patch against svn r12356 (28.06 KB, patch)
2012-01-26 23:22 UTC, Howard Chu
Details
refresh patch again r12368 (32.96 KB, patch)
2012-02-08 04:59 UTC, Kevin Fitch
Details
Refresh patch against r13174. (34.52 KB, patch)
2012-12-11 05:22 UTC, Kevin Fitch
Details
epochs-3.diff (140.60 KB, patch)
2017-07-31 14:09 UTC, Julian Seward
Details
epochs-4.diff (182.82 KB, patch)
2017-08-01 06:41 UTC, Julian Seward
Details
epochs-5.diff (189.79 KB, patch)
2017-08-01 07:39 UTC, Julian Seward
Details
(hack) : patch that adds measurement code to scan the EC for a .so unload (1.05 KB, text/plain)
2017-08-04 13:06 UTC, Philippe Waroquiers
Details
measure time to scan the execcontext, but do not let gcc remove the loop (1.24 KB, text/plain)
2017-08-07 19:30 UTC, Philippe Waroquiers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nicholas Nethercote 2004-04-10 03:07:37 UTC
Debug info is lost for .so files when they are dlclose'd.  This makes some
errors much less informative than they should be.  Its mostly a problem for
errors that are detected and reported long after the relevant code has executed,
particularly
for memory leaks -- if the allocating code was in a dlclose'd .so, the trace won't
identify the allocation point.
Comment 1 Tom Hughes 2004-06-13 16:26:45 UTC
*** Bug 79875 has been marked as a duplicate of this bug. ***
Comment 2 Nicholas Nethercote 2004-07-06 10:39:38 UTC
*** Bug 84541 has been marked as a duplicate of this bug. ***
Comment 3 Nicholas Nethercote 2004-07-06 10:46:25 UTC
The problem ultimately comes from the fact that ExeContexts (ie. stack traces)
use an imperfect way of identifying code locations -- they use the memory address where the instruction of note is stored.  This uniquely identifies the instruction while it's in memory, but is no good once it's out of memory (ie. unloaded).

Better would be to use an object code location (eg. object file name + offset)
or even a source code location (eg. source file name + function + line number;  if these aren't known fall back on "???").  Esp. with source file locations, if the lookup is done immediately the debug info will be present and fresh and all these problems would disappear.

The problem with these is that they don't have the single best property of the memory-addresses -- which are extremely compact and easy to get, ie. a single pointer represents the code location.  By comparison, source code locations require looking up debug info (which is wasteful if some of the ExeContexts are not subsequently printed) and storing strings.

So it's a matter of coming up with an efficient way to represent unique code locations.

Comment 4 Howard Chu 2004-07-06 11:08:40 UTC
Nicholas Nethercote wrote:

> ------- Additional Comments From njn25 cam ac uk  2004-07-06 10:46 -------
> The problem ultimately comes from the fact that ExeContexts (ie. stack traces)
> use an imperfect way of identifying code locations -- they use the memory address where the instruction of note is stored.  This uniquely identifies the instruction while it's in memory, but is no good once it's out of memory (ie. unloaded).
> 
> Better would be to use an object code location (eg. object file name + offset)
> or even a source code location (eg. source file name + function + line number;  if these aren't known fall back on "???").  Esp. with source file locations, if the lookup is done immediately the debug info will be present and fresh and all these problems would disappear.

I saw this suggestion on the -developer list. I think it's already been 
shown to be not a great idea, you'll introduce a lot of redundant 
symbol-table searching by doing that, and with the fallback required, 
then you've just introduced a special case to your data format.

> The problem with these is that they don't have the single best property of the memory-addresses -- which are extremely compact and easy to get, ie. a single pointer represents the code location.  By comparison, source code locations require looking up debug info (which is wasteful if some of the ExeContexts are not subsequently printed) and storing strings.

> So it's a matter of coming up with an efficient way to represent unique code locations.

Please consider my suggestion from 84541. It only adds one element 
(timestamp or sequence number, sequence number is probably better) to 
the current context information. Then add a handler to intercept calls 
to _dl_debug_state() to update valgrind's notion of what objects are 
mapped in memory, and maintain a list of all the objects that have been 
added and removed, recording the current sequence number at load and 
unload time. Parsing the ELF linker map is pretty easy, I wrote similar 
code for FunctionCheck to do the same thing. You need to record the 
object name, the base address and size, and the start and end times.

When it comes time to display the symbols, you may need to reread the 
symbol tables of the unloaded objects but I think you already have code 
to do that.
Comment 5 Nicholas Nethercote 2004-07-06 11:15:26 UTC
> I think it's already been shown to be not a great idea

Where was this shown?

Comment 6 Howard Chu 2004-07-06 11:47:20 UTC
Nicholas Nethercote wrote:
------- Additional Comments From njn25 cam ac uk  2004-07-06 11:15 -------
>>I think it's already been shown to be not a great idea

> Where was this shown?

As pointed out here
http://sourceforge.net/mailarchive/message.php?msg_id=8864868

object_file + offset doesn't work because ELF files aren't that simple. 
I guess you can make the assumption that there's only one Text segment 
of interest in any object file, and that only Text is interesting in a 
stack trace. But functions like VG_(describe_addr) may need to look at 
the Data or BSS as well.

Mapping immediately to file/fn/line loses information that is often 
essential to diagnosing an error. Simple things like multi-line 
preprocessor macros are impossible to debug with just a single source 
line. (They're a pain to debug anyway, but with actual addresses it's 
possible to debug in assembler, if nothing else.)

There's also the very odd possibility that a particular module gets 
reloaded multiple times over the life of a process, and due to other 
activity, gets mapped to a different address each time. I've seen this 
happen with Cyrus servers as they load and unload authentication 
mechanisms. I guess your three-level table approach still works here, 
but it'll have redundant strings in this case.

Looking at the problem again from your first statement: you want an 
ExeContext to uniquely identify an instruction. Right now you treat 
memory as a one-dimensional address space, with the assumption that a 
single Address uniquely identifies any instruction. But because this 
space actually can change over time, it is really a two-dimensional 
address space.
Comment 7 Howard Chu 2004-07-06 11:55:15 UTC
...clearly in a 2D space, you cannot uniquely represent anything with only a single coordinate (Address). And using more than two coordinates is overkill. You already pointed out that it's wasteful to do full symbol resolution every time, because the majority of symbols won't ever be displayed.
Comment 8 Brian 2005-09-29 19:53:16 UTC
There may be a work-around for this problem that will be good enough for most applications.  Try using LD_PRELOAD to load the libraries into memory as valgrind is started.  That seems to insure that the symbols are still around when valgrind gets around to doing symbol resolution.

Example:

LD_PRELOAD=/usr/local/lib/libbuggy.so valgrind --leak-check=full /my/app -x -y -z

This isn't a perfect fix for the problem.  If the application is sensitive to the order in which libraries are loaded this can cause more problems than it solves.

Comment 9 Allan Sandfeld 2008-02-11 13:53:50 UTC
Confirmed to still be a bug in valgrind 3.2.3
Comment 10 Frank Pilhofer 2009-05-05 20:05:45 UTC
I just ran into this problem also.  Besides stack traces with lots of ??'s in them coming from memory leaks in modules that were since unloaded, I also had an even trickier issue: one of the shared libraries registered a "destructor" with pthread_key_create().  By the time the thread exited, the shared library was gone.  At least "valgrind -v" allowed me to compare the memory address where the crash occured against the valgrind messages "discarding syms at [...] due to munmap()".

How about the following option, made in ignorance of the valgrind internals: At munmap() time, don't discard the symbols associated with this memory region.  Just preserve the region as is, marking it non-readable, non-writable and non-executable by the program under test.  When another (or the same) shared library is loaded (again), it is assigned a different address.  That way, when an error occurs, there is still a unique mapping from memory address to symbol (if not vice versa).

That would certainly add to the memory usage, especially if many shared libraries are loaded, unloaded and re-loaded, but I think it would be of immense value.  Maybe there could be a command-line option to preserve symbols for unloaded modules or not.
Comment 11 Frank Pilhofer 2009-05-05 21:40:26 UTC
Created attachment 33376 [details]
so-thread-test.tar.gz

Here is a test case for the issue that I experienced.  Tested on linux-x86_64.

The main file implements a simple facility to do work asynchronously in a different thread, like a one-thread thread pool.  It then loads a shared library and runs a job from that shared library.

The job uses pthread_key_create() to arrange for a destructor to be called at thread exit time.  But the life of the thread exceeds the life of the shared library, so when thread exit time comes, the shared library and its destructor are long gone.

In the test.log file at line 80, the location of the crash is reported as "??", and the stack trace doesn't give you much else to go on.  The only way to make some sense of it is to look at line 75, which shows that the address of the crash was within the range of the shared library.

Obviously, I would like valgrind to report the symbol associated with the address ("test_destructor"), possibly with some useful hints like: "This address used to belong to function [xx] in file [yy] which was unloaded."

Yes, the test_job() function is buggy.  But if it wasn't, I wouldn't need valgrind.
Comment 12 Frank Pilhofer 2009-05-05 21:47:43 UTC
Created attachment 33377 [details]
so-thread-test.tar.gz

Sorry, forgot a pthread_join().  Updated test case.
Comment 13 Kevin Fitch 2011-01-04 16:23:44 UTC
I have created a patch that resolves this issue for me. Basically when debug info is going to be discarded I place it in a separate list. I then search both lists in many places. A few improvements might be to tweak things so that function name (or perhaps file names) get decorated with something like *UNLOADED*, to give the user a better idea about what happened. This is only tested this very briefly.

One other approach might be to just move the debuginfo to the end of the list (possibly also marking a new flag in the structure indicating it as discarded, but I was reluctant to do this given that I noticed another function in the file that messes with order.
Comment 14 Kevin Fitch 2011-01-04 16:27:05 UTC
Created attachment 55565 [details]
potential patch to allow memory leak stack traces for unloaded .so files
Comment 15 Kevin Fitch 2011-01-07 05:08:22 UTC
Created attachment 55670 [details]
update patch with regression tests

Same patch as before, but now with +inf% more regression tests.
Comment 16 Tom Hughes 2011-01-10 22:48:45 UTC
How do you decide which set of debuginfo to use to decode a given address if there are multiple possibilities?

The other obvious question is that of possible unbounded memory usage in a long running process that repeatedly opens and closes shared libraries, possibly even opening the same one repeatedly - do you do anything special to avoid storing duplicates for example?
Comment 17 Kevin Fitch 2011-01-10 23:05:52 UTC
Currently I stop at the first match I find. I search the list of currently valid symbols, then the list of symbols that have been unloaded. The unloaded list is order from most recently unloaded to least recent.

The unbounded memory issue does exist. And AFAICT there is no real good solution if we are going to put off resolving leak stack traces until the process ends. I guess when adding to the unloaded symbols list I could delete anything that was completer shadowed by the new addition. I was reluctant to do this since I was thinking that a follow on to this might be to add a 'generation number' to the address. I would then update the gen# when things get dlopen/dlclose'd which would theoretically allow me to find the right symbols even if a new dlopen used the same address space as something that was dlclosed.

That all being said, my (personal) experience with valgrind is that I don't normally use it on long running stuff (in large part b/c it already slows/bloats things) I work hard to make the shortest/smallest case to reproduce the problem. So I don't see the slow 'leak' I am causing as all that problematic.
Comment 18 Kevin Fitch 2011-02-14 18:50:56 UTC
Created attachment 57256 [details]
Create new CLO for keeping unloaded symbols

Since the only real feedback I got was that this change would cause a showstopper for programs that repeatedly dlopen/dlclose, I turned this into a CLO so that the user could opt in to the extra memory usage if they need it.
Comment 19 Tom Hughes 2011-02-15 06:19:55 UTC
Nonsense - as far as I can see you haven't addressed, in any reasonable way, the question of how to resolve stack traces when multiple different ELF objects may have been loaded at a given address over the life of a program.
Comment 20 Kevin Fitch 2011-02-25 18:54:22 UTC
I thought I had responded to this a while back, somehow my post never showed up ... anyway.

I think that you are letting better be the enemy of good-enough

  This patch is not perfect, there are cases where it will print the wrong answer, when multiple objects get mapped to the same AVMA over the life of the program. In my experience (which is by no means comprehensive) this is not a common enough situation to worry about. Having it give the right answer most of the time is a huge win (at least for me). The LD_PRELOAD hack mentioned above only kind-of works for me, and introduces a number of new false positives since dlclose doesn't actually clean up completely (since the object is still loaded). And having the object still loaded causes other issues, e.g. if you dlopen/dlclose the same object twice, static variables normally get reinitialized (on the second dlopen), but will still have stale data in them with LD_PRELOAD.

I have looked into making a patch that would always get the right answer. This means teaching valgrind that addresses (or at least ExeContext's) are not only about where, but also when. Two potential starting points are 1) Extend Addr to include the 'when' 2) Extend ExeContext to include the when. Option (1) would have far reaching implications all over the code base, so I gave up on that one completely. (2) Is a little less invasive, but it does involve changing a number of function signatures (or adding a number of new functions). Everything that is in the path of turning an ExeContext into a printed stack trace.
Comment 21 Howard Chu 2011-02-25 23:17:22 UTC
(In reply to comment #20)
> I have looked into making a patch that would always get the right answer. This
> means teaching valgrind that addresses (or at least ExeContext's) are not only
> about where, but also when. Two potential starting points are 1) Extend Addr to
> include the 'when' 2) Extend ExeContext to include the when. Option (1) would
> have far reaching implications all over the code base, so I gave up on that one
> completely. (2) Is a little less invasive, but it does involve changing a
> number of function signatures (or adding a number of new functions). Everything
> that is in the path of turning an ExeContext into a printed stack trace.

Actually the ExeContexts themselves don't need any changing, they already include a monotonically increasing serial number "ecu".

It might be ugly, but you could export an "ec_current_ecu" as a global to make it available to the debuginfo code, so that you can reference it while registering a symbol table, and again when doing symbol lookups. Then no function signatures need to change.
Comment 22 Kevin Fitch 2011-02-26 05:50:52 UTC
Actually they would still need to change, the full ExeContext does not get passed all the way down, it quickly devolves into a list of Addr's and then each Addr is treated independently at the point that things actually get translated from an address into file/line information. So we would either need to change things so that ExeContext's passed further down the stack, or have the "ecu" (or some other indication of 'when') get passed along.
Comment 23 Howard Chu 2011-02-27 12:00:48 UTC
(In reply to comment #22)
> Actually they would still need to change, the full ExeContext does not get
> passed all the way down, it quickly devolves into a list of Addr's and then
> each Addr is treated independently at the point that things actually get
> translated from an address into file/line information. So we would either need
> to change things so that ExeContext's passed further down the stack, or have
> the "ecu" (or some other indication of 'when') get passed along.

Pretty sure you're overlooking something (although I haven't tried this with your patch yet, will try it in the next day or so). ExeContexts are processed one at a time. All you need to do is set the global ec_current_ecu when you're creating a new ExeContext and when you're processing one (e.g. pp_ExeContext). It doesn't need to get passed along at all.
Comment 24 Howard Chu 2011-02-28 05:57:20 UTC
Created attachment 57564 [details]
Addition for ecu tracking

This is what I had in mind. Add it on top of your latest patch. It doesn't appear to break anything, anyway.
Comment 25 Kevin Fitch 2011-02-28 17:42:05 UTC
OOOOoooohhhh. I totally misunderstood where you were going in the previous post. This seems to get the job done with very little new code. Unfortunately I suspect it has some nasty side effects. In particular, there are some places that print out stacktraces without going through pp_ExeContext. So those would end up using a stale value in VG_(ec_curr_ecu).

Also, is it a true statement that only 1 thread is ever printing stack traces at a time? I don't have a good feel for all the guarantees of valgrind yet. The case I am thinking about is two threads doing something bad (e.g. accessing off the end of a malloc'ed array) at the same time. They would both print a current stack trace (i.e. where the bad access happened) and a past stack trace (i.e. where the allocation of said array occurred).
Comment 26 Howard Chu 2011-03-01 00:12:16 UTC
(In reply to comment #25)
> OOOOoooohhhh. I totally misunderstood where you were going in the previous
> post. This seems to get the job done with very little new code. Unfortunately I
> suspect it has some nasty side effects. In particular, there are some places
> that print out stacktraces without going through pp_ExeContext. So those would
> end up using a stale value in VG_(ec_curr_ecu).

I was wondering about that. In those cases, we have no idea what ecu applies, so it's hard to say that any particular value is more right or wrong than any other. I was thinking of zeroing VG_(ec_curr_ecu) before pp_ExeContext returns, and then we can skip the ecu checks in debuginfo when VG_(ec_curr_ecu) is zero. But regardless, the odds of returning a wrong value are the same.

> Also, is it a true statement that only 1 thread is ever printing stack traces
> at a time? I don't have a good feel for all the guarantees of valgrind yet. The
> case I am thinking about is two threads doing something bad (e.g. accessing off
> the end of a malloc'ed array) at the same time. They would both print a current
> stack trace (i.e. where the bad access happened) and a past stack trace (i.e.
> where the allocation of said array occurred).

valgrind itself is single-threaded, no matter how many threads it simulates on the target.
Comment 27 Howard Chu 2011-03-01 01:56:26 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > OOOOoooohhhh. I totally misunderstood where you were going in the previous
> > post. This seems to get the job done with very little new code. Unfortunately I
> > suspect it has some nasty side effects. In particular, there are some places
> > that print out stacktraces without going through pp_ExeContext. So those would
> > end up using a stale value in VG_(ec_curr_ecu).
> 
> I was wondering about that. In those cases, we have no idea what ecu applies,
> so it's hard to say that any particular value is more right or wrong than any
> other. I was thinking of zeroing VG_(ec_curr_ecu) before pp_ExeContext returns,
> and then we can skip the ecu checks in debuginfo when VG_(ec_curr_ecu) is zero.
> But regardless, the odds of returning a wrong value are the same.

Having looked at it some more, all of the calls to pp_StackTrace are just using a "current" stack on a current thread. There's probably no reason to examine the unloadedDebugInfo in these cases. Flagging this case by zeroing VG_(ec_curr_ecu) would be fine. (I guess it would be possible for a function to cause its caller to get munmap'd. Good way to crash yourself.)
Comment 28 Howard Chu 2011-03-02 22:34:47 UTC
Created attachment 57625 [details]
Revised ecu tracking patch

Added changes from comment #27
Comment 29 Kevin Fitch 2011-03-03 20:13:05 UTC
Created attachment 57651 [details]
Cleaned up version of patch to keep syms of unloaded code

This contains all the latest work, it should properly decode all symbols, even when multiple objects get loaded at the same AVMA. Some improved testing is probably worth while. This version passes all the same tests (make regtest) that a clean checkout of trunk does for me. I get 6 failures on both:
memcheck/tests/linux/stack_switch        (stderr)
helgrind/tests/pth_barrier3              (stderr)
helgrind/tests/tc06_two_races_xml        (stderr)
drd/tests/pth_barrier3                   (stderr)
drd/tests/tc09_bad_unlock                (stderr)
exp-ptrcheck/tests/bad_percentify        (stderr)

I get very similar numbers (to within .1 sec) on "perl perf/vg_perf perf/" for clean trunk vs this patch. Even when I tweaked the patch to keep symbols by default.
Comment 30 Howard Chu 2011-03-04 02:19:06 UTC
(In reply to comment #29)
> Created an attachment (id=57651) [details]
> Cleaned up version of patch to keep syms of unloaded code
> 
> This contains all the latest work, it should properly decode all symbols, even
> when multiple objects get loaded at the same AVMA. Some improved testing is
> probably worth while. This version passes all the same tests (make regtest)
> that a clean checkout of trunk does for me.

Probably not significant, but why are you using <= instead of < here? Just curious for the reasoning.

+   for (di = list; di != NULL; di = di->next) {
+      if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || 
+             VG_(ec_curr_ecu) > di->last_ecu) {
Comment 31 Kevin Fitch 2011-03-04 03:20:34 UTC
(In reply to comment #30)
> Probably not significant, but why are you using <= instead of < here? Just
> curious for the reasoning.
> 
> +   for (di = list; di != NULL; di = di->next) {
> +      if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || 
> +             VG_(ec_curr_ecu) > di->last_ecu) {

Assuming I understood the ExeContext code correctly, first_ecu actually contains the last ecu that was allocated before this debuginfo was created, and last_ecu contains the last ecu that was allocated before this debuginfo was removed (potentially the same value as first_ecu, if no ExeContexts were allocated between load and unload).
Comment 32 Howard Chu 2011-03-04 08:06:44 UTC
(In reply to comment #31)
> (In reply to comment #30)
> > Probably not significant, but why are you using <= instead of < here? Just
> > curious for the reasoning.
> > 
> > +   for (di = list; di != NULL; di = di->next) {
> > +      if ( ( VG_(ec_curr_ecu) <= di->first_ecu) || 
> > +             VG_(ec_curr_ecu) > di->last_ecu) {
> 
> Assuming I understood the ExeContext code correctly, first_ecu actually
> contains the last ecu that was allocated before this debuginfo was created, and
> last_ecu contains the last ecu that was allocated before this debuginfo was
> removed

OK, makes sense.

> (potentially the same value as first_ecu, if no ExeContexts were
> allocated between load and unload).

Which raises another interesting point. If first_ecu and last_ecu are the same at the time of unmap, then we know that this debuginfo is totally unused and could be freed instead of saved.
Comment 33 Howard Chu 2011-03-08 13:24:23 UTC
Comment on attachment 57625 [details]
Revised ecu tracking patch

Merged into attachment #57651 [details]
Comment 34 Kevin Fitch 2011-03-14 21:48:36 UTC
Created attachment 58003 [details]
keep syms of unloaded code - less dirty hackish

Finally got through adding all the new functions with new function signatures. This patch removes the dirty hack in the previous patch where global vars were used to avoid having extra function arguments.
Comment 35 Howard Chu 2011-03-15 03:29:30 UTC
(In reply to comment #34)
> Created an attachment (id=58003) [details]
> keep syms of unloaded code - less dirty hackish
> 
> Finally got through adding all the new functions with new function signatures.
> This patch removes the dirty hack in the previous patch where global vars were
> used to avoid having extra function arguments.

The comments added in this chunk are no longer relevant, just drop them.

>Index: coregrind/m_execontext.c
>===================================================================
>--- coregrind/m_execontext.c	(revision 11643)
>+++ coregrind/m_execontext.c	(working copy)
>@@ -168,8 +169,11 @@
> 
> /* Print an ExeContext. */
> void VG_(pp_ExeContext) ( ExeContext* ec )
>-{
>-   VG_(pp_StackTrace)( ec->ips, ec->n_ips );
>+{  /* Set the current ecu to decode, this hack avoids changing a number of
>+   function signatures in order to pass this info all the way down to
>+   debuginfo */
>+   VG_(pp_StackTrace_with_ecu)( ec->ecu, ec->ips, ec->n_ips );
>+   /* And reset to decoding current stack traces */
> }
>
Comment 36 Tom Hughes 2011-09-06 07:16:29 UTC
*** Bug 281449 has been marked as a duplicate of this bug. ***
Comment 37 Howard Chu 2011-11-07 19:54:09 UTC
Any more action on this bug? Why is its status still "Unconfirmed" when the issue is obviously real, and we have a working patch for it?
Comment 38 Tom Hughes 2011-11-07 20:20:35 UTC
We don't generally bother with the NEW/ASSIGNED states very much, so don't read anything into the UNCONFIRMED. I've set it to NEW now though if that makes you happier.

As I have made clear I do not consider the current patch suitable, so I won't be merging it. If somebody else with commit access wants to then that is up to them.
Comment 39 Howard Chu 2011-11-07 20:34:35 UTC
(In reply to comment #38)
> We don't generally bother with the NEW/ASSIGNED states very much, so don't read
> anything into the UNCONFIRMED. I've set it to NEW now though if that makes you
> happier.

OK.

> As I have made clear I do not consider the current patch suitable, so I won't
> be merging it. If somebody else with commit access wants to then that is up to
> them.

Looking back, you expressed objections in comment #16 and comment #19. Comment #19 is a non-issue, the patch tracks all addresses unambiguously. That only leaves the question of unbounded memory growth in comment #16. I don't believe that's a reasonable complaint. If a program does a continuous sequence of mallocs and frees, valgrind's tracking of those events must also be unbounded. What's the difference between these two cases?
Comment 40 Tom Hughes 2011-11-07 21:04:04 UTC
Sorry, I hadn't realised there was a version of the patch that handled the multiple objects at one address problem. I'll try and review that when I get a chance.
Comment 41 Alexander B 2011-12-08 15:51:02 UTC
Hi,

Is there any progress on this issue? I would like to have this supported in Valgrind.

Thanks
Comment 42 Tom Hughes 2012-01-12 16:10:52 UTC
*** Bug 291380 has been marked as a duplicate of this bug. ***
Comment 43 kaspars.skels 2012-01-26 18:56:05 UTC
Hi! The patch seams to be based on old baseline (fails to apply..)
Would it be possible to get it for 3.7.0? That would be really appreciated.

Kind Regards, Kaspars
Comment 44 Howard Chu 2012-01-26 19:48:00 UTC
Created attachment 68205 [details]
Refreshed patch against svn r12356

Obsoletes attachment id 58003.
Comment 45 kaspars.skels 2012-01-26 23:15:34 UTC
Thanks! There is a missing "{" in m_debuginfo/debuginfo.c:1656
But I'm quite sure it's not working, I still don't get debuginfo

==20034== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3
==20034==    at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343)
==20034==    by 0x4034319: ???
==20034==    by 0x40347C6: ???
==20034==    by 0x8064695: main (in /home/kaspars/Projects/device/bin/cfw)

P.S. If I leave the library open it will have the traces as needed

==20081== 123 bytes in 1 blocks are definitely lost in loss record 6 of 8
==20081==    at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343)
==20081==    by 0x4034319: function() (in /home/kaspars/Projects/device/bin/libcfw_example.so)
==20081==    by 0x40347C6: example::start() (in /home/kaspars/Projects/device/bin/libcfw_example.so)
==20081==    by 0x8064695: main (in /home/kaspars/Projects/device/bin/cfw)
Comment 46 Howard Chu 2012-01-26 23:22:53 UTC
Created attachment 68213 [details]
Refreshed patch against svn r12356

Fixed typo in previous copy
Comment 47 Howard Chu 2012-01-26 23:24:21 UTC
(In reply to comment #45)
> Thanks! There is a missing "{" in m_debuginfo/debuginfo.c:1656

Sorry about that. Fixed in updated attachment.

> But I'm quite sure it's not working, I still don't get debuginfo

It isn't enabled by default, you must use the --keep-unloaded-syms=yes option.
Comment 48 kaspars.skels 2012-01-26 23:45:06 UTC
Great! It really does work with all the line numbers and everything :) 

Thanks a lot! /Kaspars



==30057== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3
==30057==    at 0x40281BB: operator new[](unsigned int) (vg_replace_malloc.c:343)
==30057==    by 0x4034319: function() (example.cpp:8)
==30057==    by 0x40347C6: example::start() (example.cpp:24)
==30057==    by 0x8064695: main (core.cpp:76)
Comment 49 Howard Chu 2012-01-26 23:49:50 UTC
(In reply to comment #48)
> Great! It really does work with all the line numbers and everything :) 
> 
> Thanks a lot! /Kaspars

Yes, it works perfectly well. I just hope someone commits it before the patch bitrots again.

> ==30057== 123 bytes in 1 blocks are definitely lost in loss record 3 of 3
> ==30057==    at 0x40281BB: operator new[](unsigned int)
> (vg_replace_malloc.c:343)
> ==30057==    by 0x4034319: function() (example.cpp:8)
> ==30057==    by 0x40347C6: example::start() (example.cpp:24)
> ==30057==    by 0x8064695: main (core.cpp:76)
Comment 50 Kevin Fitch 2012-02-08 04:59:56 UTC
Created attachment 68609 [details]
refresh patch again r12368

This patch should work against r12368, it includes testing related files that were missing in the the previous refresh.

I am still very interested in getting this merged into the mainline. I am very willing to put more work into it. But, at the moment I don't know what to do to push it forward. Any feedback would be greatly appreciated.
Comment 51 Endre Szalai 2012-10-09 13:00:00 UTC
Are there any plans to include this patch in Valgrind?
Comment 52 Magnus 2012-11-05 11:15:05 UTC
8 years.. a working patch.. what is the problem?
Comment 53 Endre Szalai 2012-11-22 12:25:52 UTC
Still very interested to see the patch be accepted..... Please please please :)
Comment 54 Kevin Fitch 2012-12-11 05:22:23 UTC
Created attachment 75779 [details]
Refresh patch against r13174.

In my testing (x86-64 linux) this patch does not introduce any more failures, and it passes newly introduced tests.
Comment 55 Chris Hillery 2013-06-08 07:25:54 UTC
Thank you very much, Kevin Fitch, for this patch. I sincerely hope that it will be accepted into the mainline of valgrind someday.

I did find an issue with it, however - it doesn't seem to work with --gen-suppressions, at least for memory leaks. All the stack frames in the .so are reported in the error report, but the generated suppression block just has "obj:*" for each of them. This of course could easily suppress completely different leaks (in completely different libraries even) in addition to the one that triggered the report.

I haven't yet tried manually creating a suppressions file with the real symbols in it as reported by --keep-unloaded-syms=yes. Hopefully if that does work, then expanding your patch to also put those symbols into --gen-suppressions output will be straightforward. If that doesn't work, though, I imagine it gets a little hairier.
Comment 56 Tom Hughes 2015-08-20 21:38:29 UTC
*** Bug 351536 has been marked as a duplicate of this bug. ***
Comment 57 Howard Chu 2015-08-20 22:02:26 UTC
Tom Hughes wrote:
> https://bugs.kde.org/show_bug.cgi?id=79362
>
> Tom Hughes <tom@compton.nu> changed:

When are you actually going to merge this fix? The bug has been open for over 
a decade and a clean fix has been available for nearly 5 years.


>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |thiago@kde.org
>
> --- Comment #56 from Tom Hughes <tom@compton.nu> ---
> *** Bug 351536 has been marked as a duplicate of this bug. ***
>
Comment 58 Quanah Gibson-Mount 2016-03-10 18:06:19 UTC
Please fix this?  It causes significant pain for many people using valgrind.
Comment 59 Julian Seward 2017-07-25 11:19:17 UTC
(In reply to Howard Chu from comment #57)

> a clean fix has been available for nearly 5 years.

I'm trying to understand the proposed fix.  What I've gathered is:

(1) Each ExeContext has a 32-bit unique ID.

(2) m_execontext.c issues unique IDs in a monotonically increasing sequence,
    from ec_next_ecu.

(3) The patch extends struct DebugInfo so as to record the unique ID (2) at
    both the time the DebugInfo came into being and the point at which the
    associated .so was unmapped.

(4) DebugInfos for objects that have been unmapped are moved into their own
    linked list (unloadedDebugInfo_list)

(5) When symbolicating an ExeContext, I assume that you use its unique to
    identify which set of DebugInfos in unloadedDebugInfo_list can
    participate in its symbolisation (viz, should be used to look up symbol
    names etc).

This seems somewhat plausible.  But what I don't understand is how this
interacts with the fact that m_execontext doesn't necessarily issue a new
ExeContext with a new unique every time a stack unwind is requested.
Instead, it unwinds the stack, then checks to find if it already has an
ExeContext with that stack.  If so it merely gives out a pointer to the
existing version rather than creating a new one.

I'm not saying the patch is wrong.  I am saying I don't understand it.  Was
the m_execontext duplicate-stack-removal behaviour taken into account in the
design of the fix?

AFAICS, the only thing that you can conclude from the fact that

  execontext1.ecu > execontext2.ecu

is that execontext1's stack was observed for the first time before
execontext2's stack was observed for the first time.  And that the
two stacks are different, since the .ecus would be the same if the
stacks were the same.
Comment 60 Julian Seward 2017-07-25 11:33:58 UTC
(In reply to Julian Seward from comment #59)
> AFAICS, the only thing that you can conclude from the fact that
> 
>   execontext1.ecu > execontext2.ecu
> 
> is that execontext1's stack was observed for the first time before
> execontext2's stack

Duh.  I meant "after", not "before".
Comment 61 Howard Chu 2017-07-25 11:57:05 UTC
Julian Seward wrote:
> https://bugs.kde.org/show_bug.cgi?id=79362
> 
> --- Comment #59 from Julian Seward <jseward@acm.org> ---
> (In reply to Howard Chu from comment #57)
> 
>> a clean fix has been available for nearly 5 years.
> 
> I'm trying to understand the proposed fix.  What I've gathered is:
> 
> (1) Each ExeContext has a 32-bit unique ID.
> 
> (2) m_execontext.c issues unique IDs in a monotonically increasing sequence,
>      from ec_next_ecu.
> 
> (3) The patch extends struct DebugInfo so as to record the unique ID (2) at
>      both the time the DebugInfo came into being and the point at which the
>      associated .so was unmapped.
> 
> (4) DebugInfos for objects that have been unmapped are moved into their own
>      linked list (unloadedDebugInfo_list)
> 
> (5) When symbolicating an ExeContext, I assume that you use its unique to
>      identify which set of DebugInfos in unloadedDebugInfo_list can
>      participate in its symbolisation (viz, should be used to look up symbol
>      names etc).
> 
> This seems somewhat plausible.  But what I don't understand is how this
> interacts with the fact that m_execontext doesn't necessarily issue a new
> ExeContext with a new unique every time a stack unwind is requested.
> Instead, it unwinds the stack, then checks to find if it already has an
> ExeContext with that stack.  If so it merely gives out a pointer to the
> existing version rather than creating a new one.
> 
> I'm not saying the patch is wrong.  I am saying I don't understand it.  Was
> the m_execontext duplicate-stack-removal behaviour taken into account in the
> design of the fix?

It's been so long I don't remember the fine details any more. But I think it's 
safe to say, most of the time, if a shared object has been unloaded and a 
different one has been loaded, even if mapped to the same base address as the 
previous object, it is unlikely to have the same sizes and ordering of 
functions as the previous object, and thus unlikely to produce duplicates of 
any existing stack traces.
Comment 62 Julian Seward 2017-07-31 14:09:41 UTC
Created attachment 106984 [details]
epochs-3.diff

Here's a patch against 3.13.0.  It works only for the tools "none" and
"memcheck" -- currently it disables building of all other tools.  It should
be able to generate suppressions for unloaded libraries too, although I
haven't tested that.

This is a from-scratch implementation.  I couldn't convince myself that the
use of ExeContext uniques was reliable and the more I looked into it, the
more doubts I had.  This patch adds two new types:

* DiEpoch, which is just a wrapper around a 32 bit unsigned integer.
  These mark points in time where debuginfo is loaded or unloaded.

* ExeContextAndEpoch.  This is a pairing of an ExeContext (an array of code
  addresses that is a stack trace) and the DiEpoch in which it should be
  interpreted.

The core idea is that an ExeContext is still a stack trace, but one that
can't be symbolised.  To symbolise it, we need to specify a DiEpoch too,
hence this pairing.  In many places, the patch replaces uses of ExeContext*
with ExeContextAndEpoch.

Each DebugInfo structure acquires two new DiEpoch fields which indicate the
first and last epochs for which they are valid.  There is still only one top
level list of DebugInfo structures, which holds both the "active" and
"archived" DebugInfos.  They are distinguished by having the last-epoch
field be set to DiEpoch_INVALID() for an active structure, and to a valid
value to an archived structure.

From a user's point of view, the default behaviour of the patch is still to
discard debuginfo -- at least for now.  To retain it, use
--keep-debuginfo=yes.  You can see the difference if you run with -v.  Then
you will see messages of the form

  Discarding syms at [..] due to munmap()    with --keep-debuginfo=no

but

  Archiving syms at [..] due to munmap()    with --keep-debuginfo=no

Two things that I don't know if I kludged correctly or not, are the GDB
server support and the XTree support.  In both cases these just symbolise
against the current epoch, when it is necessary to do so.  So their
behaviour should at least be unchanged from before, even if they are not yet
"epoch-aware".
Comment 63 Philippe Waroquiers 2017-07-31 21:34:19 UTC
I quickly read the patch, here are a few comments

* For what concerns the gdbserver code and the xtree code: for sure, 
  using the current epoch will be as good (or as bad) as the current
  behaviour. But once the patch is committed, I will update the code
  to use a real epoch where it is better (basically, I think xtree should store
  an epoch, while gdbserver should normally be good enough with current epoch)
  But see below the idea to store epoch inside execontext,which should probably
  solve the xtree case.

* there are some tl_assert occurences in coregrind, where vg_assert should
  be used instead.

* for the --help of  --keep-debuginfo=..  this is not limited to memory leak
  stack traces as I understand, but all stack traces recorded as an execontext.
  E.g. IIUC, this is also valid for other types of errors.
  Same for the user manual.

* For execontext, instead of just storing the execcontext
  (e.g. in an mc chunk allocated by memecheck for each block), we now store
  an execontext + epoch.

  An alternative approach that will reduce the memory would be to store
  the epoch of an execontext in the execcontext itself.
  (by default, invalid epoch, indicating 'execontext valid in the current
  set of di').
  Then when a DI is archived, scan the execontext, and check if an execontext
  contains some addressed that are in the debuginfo being archived.
  If yes, change the execontext epoch from invalid to the current epoch
  (before unloading the DI).
  This should reduce the impact on all execontext users, both in terms
  of code change, and in terms of memory increase where an execontext
  has to be stored.
  This approach spares memory (and I think less impact on tools), but
  each unload of a DI means scanning the execontext.
  With this approach, when searching for an execontext that corresponds to
  a (fresh) stacktrace, only execontext that have an invalid epoch should
  be returned (as a fresh stack trace cannot match an execontext that
  has been marked with a real epoch). This should also avoid the
  problem of having 2 errors being the same, while in reality their 
  execontext are different, as a new error will never give back an 
  'old execontext' with a non invalid epoch.
  Similarly, IIUC, origin tracking will store an ec, not an ec + epoch.
  Storing the epoch inside the execcontext will also provide correct
  DI epoch for track origin.
Comment 64 Julian Seward 2017-08-01 06:41:44 UTC
Created attachment 107001 [details]
epochs-4.diff

Now builds for all tools again, although only Memcheck is properly
"epoch-aware".  All other tools just use the current epoch when needed,
so their behaviour should be unchanged related to pre-patch.
Comment 65 Julian Seward 2017-08-01 06:48:16 UTC
(In reply to Philippe Waroquiers from comment #63)

Thanks for the feedback.  I'll fix the small stuff.

> * For execontext, instead of just storing the execcontext
>   (e.g. in an mc chunk allocated by memecheck for each block), we now store
>   an execontext + epoch.

I did consider this design, in some detail, but decided not to do that.
Instead I kept semantics of ExeContext unchanged and added a new type
ExeContextAndEpoch.

The problem is that we would lose the ability to do pointer-comparisons
on ExeContext*s.  Currently, if two ExeContext*s (the addresses) are
different then we are sure that the vectors of addresses are different
(and vice versa), because m_execontext.c carefully de-duplicates them.
And I was unsure what to do about maintaining those semantics if we add
an epoch to ExeContext.

That said, fixing XTree, the GDB server and all the other tools (if it
is possible) would be a good thing.
Comment 66 Julian Seward 2017-08-01 06:49:42 UTC
(In reply to Philippe Waroquiers from comment #63)

Thanks for the feedback.  I'll fix the small stuff.

> * For execontext, instead of just storing the execcontext
>   (e.g. in an mc chunk allocated by memecheck for each block), we now store
>   an execontext + epoch.

I did consider this design, in some detail, but decided not to do that.
Instead I kept semantics of ExeContext unchanged and added a new type
ExeContextAndEpoch.

The problem is that we would lose the ability to do pointer-comparisons
on ExeContext*s.  Currently, if two ExeContext*s (the addresses) are
different then we are sure that the vectors of addresses are different
(and vice versa), because m_execontext.c carefully de-duplicates them.
And I was unsure what to do about maintaining those semantics if we add
an epoch to ExeContext.

That said, fixing XTree, the GDB server and all the other tools (if it
is possible) would be a good thing.
Comment 67 Philippe Waroquiers 2017-08-01 07:22:31 UTC
(In reply to Julian Seward from comment #66)
> (In reply to Philippe Waroquiers from comment #63)
> 
> Thanks for the feedback.  I'll fix the small stuff.
> 
> > * For execontext, instead of just storing the execcontext
> >   (e.g. in an mc chunk allocated by memecheck for each block), we now store
> >   an execontext + epoch.
> 
> I did consider this design, in some detail, but decided not to do that.
> Instead I kept semantics of ExeContext unchanged and added a new type
> ExeContextAndEpoch.
> 
> The problem is that we would lose the ability to do pointer-comparisons
> on ExeContext*s.  
What I suggest still allows to do pointer comparison (or ecu comparison)
and fully identify the epoch+addresses.

m_execontext.c would de-duplicate the (epoch + addresses).
So, even if after unmap and remap of some other lib, a bunch of addresses
would (by lack of chance) be the same, they would differ by their epoch,
and so that would give another execontext, only differing by the epoch.


> Currently, if two ExeContext*s (the addresses) are
> different then we are sure that the vectors of addresses are different
> (and vice versa), because m_execontext.c carefully de-duplicates them.
> And I was unsure what to do about maintaining those semantics if we add
> an epoch to ExeContext.
What I suggest is to consider the epoch part of the execontext itself,
and so, the same set of addresses but another epoch would give another
execontext.

To ensure this, the idea is to store 'invalid epoch' in an execontext
when adding it. When an debug info is  unloaded, all execontext with
an invalid epoch that have an address inside the unloaded debug info
are marked with the current epoch.

With that idea, an execontext reference (so either a ptr or an ecu)
will uniquely identify a set of address and its epoch.
That decreases the memory (e.g. for memcheck in an MC_Chunk)
or in the track origin (which can only store an ecu).

> 
> That said, fixing XTree, the GDB server and all the other tools (if it
> is possible) would be a good thing.
Comment 68 Julian Seward 2017-08-01 07:39:16 UTC
Created attachment 107003 [details]
epochs-5.diff

Same as epochs-4.diff with with regression tests from Kevin Fitch's
original patch (afaics, comment 15).
Comment 69 Ivo Raisr 2017-08-01 13:04:03 UTC
Just a quick note:
Once this functionality is integrated, remove the corresponding entry from www page http://valgrind.org/help/projects.html:

-----------------------
Preserving debugging information
Currently, Valgrind unloads debugging information for shared objects when they are unloaded with dlclose(). If the shared object has a memory leak, the stack trace for Memcheck's error message at termination will be missing entries, which makes fixing the leak difficult. This is described in bug report #79362.

One way to fix this would be to change things so that Valgrind records source-level information for stack traces, rather than code addresses. It is not entirely clear how to do this in a way that avoids unnecessary debug information lookups, and nor how to avoid increasing the amount of memory used. An implementation would help clarify the problem and show if this approach is feasible. This project is of intermediate difficulty. (Added August 27, 2005)
-----------------------
Comment 70 Philippe Waroquiers 2017-08-04 13:04:58 UTC
Following discussion on irc of the suggestion discussed in comment 63 till comment 67,
I will attach a patch that adds some measurement code for the scanning of the EC contexts
that would be needed at each unload of a .so.

The measurement is done by applying the patch, and then doing:

for n in 10 11 12 13 14 15 16 17 18 19 20 25
do
   ./vg-in-place --stats=yes --time-stamp=yes --num-callers=$n ./perf/many-xpts 
done |& grep -e 'exe.*scanning'


Detailed results below. Basically, (as expected), the time to scan depends on the 
nr of execontexts, and the avg size of each execontext.
For a huge data set (260_000 EC, 6_300_000 ips) , a scan takes about 0.005 seconds
A medium size data set (65_000 EC, 1_100_000 ips), a scan takes about 0.001 seconds.
(debian 8, Intel i5-6402P CPU @ 2.80GHz)
So, the cost seems reasonable, e.g. 5 seconds CPU in total for a big application
that would do 1000 load/unload of a .so.

We might possibly avoid this scanning if ever it would be needed, e.g. by maintaining
a list of DI valid for a certain epoch, and another data structure that associates
the current epoch with past 'compatible' epoch.
But that seems more complex than the simple scanning solution at each unload.


--00:00:00:01.298 11342--    exectx: scanning 1000 times 515 contexts/5,133 ips
--00:00:00:01.299 11342--    exectx: finished scanning 515 contexts/5,133 ips
--00:00:00:01.308 11345--    exectx: scanning 1000 times 1,027 contexts/11,277 ips
--00:00:00:01.310 11345--    exectx: finished scanning 1,027 contexts/11,277 ips
--00:00:00:01.341 11348--    exectx: scanning 1000 times 2,051 contexts/24,589 ips
--00:00:00:01.347 11348--    exectx: finished scanning 2,051 contexts/24,589 ips
--00:00:00:01.357 11351--    exectx: scanning 1000 times 4,099 contexts/53,261 ips
--00:00:00:01.381 11351--    exectx: finished scanning 4,099 contexts/53,261 ips
--00:00:00:01.403 11354--    exectx: scanning 1000 times 8,195 contexts/114,701 ips
--00:00:00:01.485 11354--    exectx: finished scanning 8,195 contexts/114,701 ips
--00:00:00:01.420 11357--    exectx: scanning 1000 times 16,387 contexts/245,773 ips
--00:00:00:01.584 11357--    exectx: finished scanning 16,387 contexts/245,773 ips
--00:00:00:01.405 11360--    exectx: scanning 1000 times 32,771 contexts/524,301 ips
--00:00:00:01.787 11360--    exectx: finished scanning 32,771 contexts/524,301 ips
--00:00:00:01.494 11363--    exectx: scanning 1000 times 65,539 contexts/1,114,125 ips
--00:00:00:02.447 11363--    exectx: finished scanning 65,539 contexts/1,114,125 ips
--00:00:00:01.576 11373--    exectx: scanning 1000 times 131,076 contexts/2,359,327 ips
--00:00:00:03.952 11373--    exectx: finished scanning 131,076 contexts/2,359,327 ips
--00:00:00:01.782 11376--    exectx: scanning 1000 times 262,149 contexts/4,980,787 ips
--00:00:00:06.319 11376--    exectx: finished scanning 262,149 contexts/4,980,787 ips
--00:00:00:01.798 11379--    exectx: scanning 1000 times 262,149 contexts/5,242,933 ips
--00:00:00:06.832 11379--    exectx: finished scanning 262,149 contexts/5,242,933 ips
--00:00:00:01.825 11382--    exectx: scanning 1000 times 262,149 contexts/6,291,514 ips
--00:00:00:06.545 11382--    exectx: finished scanning 262,149 contexts/6,291,514 ips
Comment 71 Philippe Waroquiers 2017-08-04 13:06:09 UTC
Created attachment 107073 [details]
(hack) : patch that adds measurement code to scan the EC for a .so unload
Comment 72 Julian Seward 2017-08-04 13:36:36 UTC
(In reply to Philippe Waroquiers from comment #71)
> Created attachment 107073 [details]
> (hack) : patch that adds measurement code to scan the EC for a .so unload

+            for (j = 0; j < ec->n_ips; j++) {
+               if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) {
+                  break;
+               }
+            }

This is a side-effect-free loop whose only computed value (j) is unused,
and provably terminates.  I think it's likely that gcc noticed all 3 facts
and deleted the loop.

Do you get different results if you change it like this?

+            for (j = 0; j < ec->n_ips; j++) {
+               if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) {
                   __asm__ __volatile__("":::"cc","memory");
+                  break;
+               }
+            }

The (empty) inline assembly claims to make arbitrary changes to memory and
might scare gcc into keeping the loop alive.
Comment 73 Philippe Waroquiers 2017-08-07 19:28:38 UTC
(In reply to Julian Seward from comment #72)
> (In reply to Philippe Waroquiers from comment #71)
> > Created attachment 107073 [details]
> > (hack) : patch that adds measurement code to scan the EC for a .so unload
> 
> +            for (j = 0; j < ec->n_ips; j++) {
> +               if (UNLIKELY(ec->ips[j] >= from && ec->ips[j] <= to)) {
> +                  break;
> +               }
> +            }
> 
> This is a side-effect-free loop whose only computed value (j) is unused,
> and provably terminates.  I think it's likely that gcc noticed all 3 facts
> and deleted the loop.
Yes, you are correct.
I will attach a new patch which acvoids the loop elimination.
With this, the scan is slower : for big applications (260_000 EC,
6_300_000 IP), a scan takes between 0.020 to 0.030 seconds.
So, an application that does 1000 load/unload will use around
20 seconds more cpu for the scanning.

Assuming these measurements are now ok; I think this still looks acceptable,
as:
  * a large majority of the users will not use this feature, so we
    better reduce the impact of this feature when not used
    (typically in memory)
  * for users of the functionality, we better make 
    --track-origins functionally correct
  * probably not many applications are doing load/unload at a
    high frequency.

If then we find an application that suffers heavily from this scannning,
then we can always implement a 'lazy scan': instead of scanning
all EC when a lib is unloaded, we just scan an EC when
EC is symbolised (or when it is 're-acquired' following the capture
of a new stack trace).
The logic of this 'lazy scanning' will be:
   if EC epoch is not the current epoch then
      scan all DI that were unloaded between EC epoch
       and current epoch.
      if an address of the EC matches one of these DI,
         mark the EC as archived (i.e. it should not
             be used for something else than output)
      else
         change EC epoch to be current epoch
With this, very few EC should have to be verified :
   only the 'active EC' and the EC related to an error.

But I think we better do first the simple approach of
scanning, and if a user of --keep-debuginfo complains
about performance, optimise by doing the lazy scanning.


--00:00:00:01.288 9148--    exectx: scanning 1000 times 515 contexts/5,133 ips
--00:00:00:01.294 9148--    exectx: finished scanning. Match 4 scanned 5,133,000  515 contexts/5,133 ips
--00:00:00:01.295 9151--    exectx: scanning 1000 times 1,027 contexts/11,277 ips
--00:00:00:01.309 9151--    exectx: finished scanning. Match 4 scanned 11,277,000  1,027 contexts/11,277 ips
--00:00:00:01.308 9154--    exectx: scanning 1000 times 2,051 contexts/24,589 ips
--00:00:00:01.345 9154--    exectx: finished scanning. Match 4 scanned 24,589,000  2,051 contexts/24,589 ips
--00:00:00:01.315 9157--    exectx: scanning 1000 times 4,099 contexts/53,261 ips
--00:00:00:01.417 9157--    exectx: finished scanning. Match 4 scanned 53,261,000  4,099 contexts/53,261 ips
--00:00:00:01.331 9160--    exectx: scanning 1000 times 8,195 contexts/114,701 ips
--00:00:00:01.576 9160--    exectx: finished scanning. Match 4 scanned 114,701,000  8,195 contexts/114,701 ips
--00:00:00:01.346 9163--    exectx: scanning 1000 times 16,387 contexts/245,773 ips
--00:00:00:01.860 9163--    exectx: finished scanning. Match 4 scanned 245,773,000  16,387 contexts/245,773 ips
--00:00:00:01.381 9166--    exectx: scanning 1000 times 32,771 contexts/524,301 ips
--00:00:00:02.576 9166--    exectx: finished scanning. Match 4 scanned 524,301,000  32,771 contexts/524,301 ips
--00:00:00:01.441 9169--    exectx: scanning 1000 times 65,539 contexts/1,114,125 ips
--00:00:00:05.809 9169--    exectx: finished scanning. Match 4 scanned 1,114,125,000  65,539 contexts/1,114,125 ips
--00:00:00:01.559 9172--    exectx: scanning 1000 times 131,076 contexts/2,359,327 ips
--00:00:00:15.639 9172--    exectx: finished scanning. Match 4 scanned 2,359,327,000  131,076 contexts/2,359,327 ips
--00:00:00:01.777 9175--    exectx: scanning 1000 times 262,149 contexts/4,980,787 ips
--00:00:00:28.796 9175--    exectx: finished scanning. Match 4 scanned 4,980,787,000  262,149 contexts/4,980,787 ips
--00:00:00:01.792 9178--    exectx: scanning 1000 times 262,149 contexts/5,242,933 ips
--00:00:00:32.021 9178--    exectx: finished scanning. Match 4 scanned 5,242,933,000  262,149 contexts/5,242,933 ips
--00:00:00:01.821 9182--    exectx: scanning 1000 times 262,149 contexts/6,291,514 ips
--00:00:00:21.031 9182--    exectx: finished scanning. Match 524296 scanned 6,291,514,000  262,149 contexts/6,291,514 ips


Note that I am slightly amazed by the fact that the last run is faster than
the 2 previous one. I cannot explain this (I checked, and the Match-es are all because of the
'last ip' below main, which has a small value. So, the last run scans the same nr of EC,
but about 1_000_000 IPs more (1000 times), and is faster ???
Comment 74 Philippe Waroquiers 2017-08-07 19:30:25 UTC
Created attachment 107129 [details]
measure time to scan the execcontext, but do not let gcc remove the loop
Comment 75 Julian Seward 2017-08-08 07:54:54 UTC
(In reply to Philippe Waroquiers from comment #73)

Thank you for redoing the measurements.

> With this, the scan is slower : for big applications (260_000 EC,
> 6_300_000 IP), a scan takes between 0.020 to 0.030 seconds.

That is about 10-15 cycles/IP, which sounds more realistic.

> Note that I am slightly amazed by the fact that the last run is faster
> than the 2 previous one. I cannot explain this (I checked, and the
> Match-es are all because of the 'last ip' below main, which has a small
> value. So, the last run scans the same nr of EC, but about 1_000_000 IPs
> more (1000 times), and is faster ???

Who knows.  Maybe it relates to the layout of the workloads in memory?
(in other words, it is a caching effect?)  Anyway, not a problem, I think.

> Assuming these measurements are now ok; I think this still looks
> acceptable,

I agree.

> But I think we better do first the simple approach of
> scanning, and if a user of --keep-debuginfo complains
> about performance, optimise by doing the lazy scanning.

I agree here too.

So the only question now is how to actually get the work done :-)
Do you have any time/enthusiasm to hack on this?
Comment 76 Philippe Waroquiers 2017-08-08 21:19:01 UTC
(In reply to Julian Seward from comment #75)
> So the only question now is how to actually get the work done :-)
> Do you have any time/enthusiasm to hack on this?
Enthusiasm, yes. Time (at short term), not much.

I have added this to my list of things to do, and
should be able to look at this sometime in September or October
(I really need to finalise first the helgrind unwind optimisation, which
is lingering since FOSDEM).
Comment 77 Julian Seward 2018-01-11 19:01:54 UTC
Landed, with major reworking by PhilippeW.
cceed053ce876560b9a7512125dd93c7fa059778
Comment 78 Philippe Waroquiers 2018-01-14 11:29:13 UTC
Fixed now, after almost 14 years :)
Comment 79 Julian Seward 2018-01-15 10:35:59 UTC
Pushed a followup fix at f8ae2f95d6d717aa6d3923635b9f6f87af9b7cf1.
This fixes a problem spotted by Matthias Schwarzott, in which dlopening
an object that was previously dlopened and then dlclosed caused an 
assertion failure in check_CFSI_related_invariants().