Bug 69511 - Valgrind can call wrong function
Summary: Valgrind can call wrong function
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 2.0.0
Platform: unspecified Linux
: NOR major
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 69532 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-12-02 17:39 UTC by Lee Kindness
Modified: 2005-07-07 15:32 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lee Kindness 2003-12-02 17:39:41 UTC
Given the example source below, which uses nested functions, Valgrind calls the
wrong nested function.

#include <stdio.h>

static void call_func(void (*sel)(void))
{
   sel();
}

void test1()
{
   void test1_inner()
   {
      printf( "Inside test1\n" );
   }
   call_func( test1_inner );
}

void test2()
{
   void test2_inner()
   {
      printf( "Inside test2\n" );
   }
   call_func( test2_inner );
}

int main(int argc, char** argv)
{
   test1();
   test2();
   return( 0 );
}

The implementation of nested functions by GCC makes use of dynamically modified
code, which Valgrind does not detect. This then causes it to call the incorrect
function.
Comment 1 Lee Kindness 2003-12-02 17:40:36 UTC
Discussed on the mailing list on "Incorrect function called under Valgrind" and "Nested Functions" threads.

L.
Comment 2 Jeremy Fitzhardinge 2003-12-02 23:42:02 UTC
I filed bug 69532 to propose a general fix for this.
Comment 3 Dirk Mueller 2003-12-05 17:41:59 UTC
*** Bug 69532 has been marked as a duplicate of this bug. ***
Comment 4 Jeremy Fitzhardinge 2005-01-29 10:08:36 UTC
Ah, its our old friend, the infinitely recursive allocation:

#0  0xb003825f in SkipList__Find (l=0xb00902a0, k=0xb02e802c, prevs=0x0)
    at vg_skiplist.c:292
#1  0xb00382f4 in vgPlain_SkipList_Find_Before (l=0xb00902a0, k=0xb02e802c)
    at vg_skiplist.c:345
#2  0xb002ae8b in vgPlain_find_map_space (addr=2955837484, len=1056768,
    for_client=0 '\0') at vg_memory.c:555
#3  0xb002c50b in vgPlain_mmap (start=0xb0000000, length=1048576, prot=7,
    flags=34, sf_flags=0, fd=4294967295, offset=0) at vg_mylibc.c:289
#4  0xb002e05e in vgPlain_get_memory_from_mmap (nBytes=1048576,
    who=0xb007a5aa "newSuperblock") at vg_mylibc.c:1701
#5  0xb0028b73 in newSuperblock (a=0xb01a1360, cszB=1048576) at vg_malloc2.c:478
#6  0xb00297d4 in vgPlain_arena_malloc (aid=0, req_pszB=2954498912)
    at vg_malloc2.c:979
#7  0xb0038106 in vgPlain_SkipNode_Alloc (l=0xb00902a0) at vg_skiplist.c:226
#8  0xb002ab26 in vgPlain_map_file_segment (addr=2956996608, len=1048576,
    prot=7, flags=8201, dev=0, ino=0, off=0, filename=0x0) at vg_memory.c:120
#9  0xb002ad18 in vgPlain_map_fd_segment (addr=2956996608, len=1048576, prot=7,
    flags=8201, fd=-1, off=0, filename=0x0) at vg_memory.c:468
#10 0xb002c5f1 in vgPlain_mmap (start=0xb0000000, length=1048576, prot=7,
    flags=50, sf_flags=8201, fd=4294967295, offset=0) at vg_mylibc.c:313
#11 0xb002e05e in vgPlain_get_memory_from_mmap (nBytes=1048576,
    who=0xb007a5aa "newSuperblock") at vg_mylibc.c:1701
#12 0xb0028b73 in newSuperblock (a=0xb01a1360, cszB=1048576) at vg_malloc2.c:478
#13 0xb00297d4 in vgPlain_arena_malloc (aid=0, req_pszB=2954498912)
    at vg_malloc2.c:979
#14 0xb0038106 in vgPlain_SkipNode_Alloc (l=0xb00902a0) at vg_skiplist.c:226
#15 0xb002ab26 in vgPlain_map_file_segment (addr=2956996608, len=1048576,
    prot=7, flags=8201, dev=0, ino=0, off=0, filename=0x0) at vg_memory.c:120
#16 0xb002ad18 in vgPlain_map_fd_segment (addr=2956996608, len=1048576, prot=7,
    flags=8201, fd=-1, off=0, filename=0x0) at vg_memory.c:468
#17 0xb002c5f1 in vgPlain_mmap (start=0xb0000000, length=1048576, prot=7,
    flags=50, sf_flags=8201, fd=4294967295, offset=0) at vg_mylibc.c:313
...
Comment 5 Jeremy Fitzhardinge 2005-01-29 10:10:02 UTC
Um, er, not this one.
Comment 6 Jeremy Fitzhardinge 2005-03-02 18:13:48 UTC
How about this as a fix: if we see a code fetch from a writable page, then mark
the page read-only.  If we later see a protection fault again that page, it means someone is trying to modify the page with code, so we mark the page read-write and flush the transtab for code covering that page.

This would be horribly slow, but its pretty simple and correct, and it doesn't require a big change.
Comment 7 Nicholas Nethercote 2005-03-02 20:53:51 UTC
I was about to say "that's the fix that's always suggested, and it won't work
for code on the stack".  But then I realised you are suggested doing the read-only marking at code fetch time, rather than translation time.  That might be enough
to make all the difference.

When would the check-if-fetching-from-a-writable-page check occur?  Would you have to do it at the start of every BB executed?  I wonder if this will get difficult if Vex joins together basic blocks from different parts of the program into a single block?  (Ie. part of a block could be from a read-only page, and part from a read-write page?)
Comment 8 Jeremy Fitzhardinge 2005-03-02 21:04:19 UTC
Erm, I'm not quite sure what you mean by "code fetch time".

I'm proposing that the translator does it as it is translating; if the instruction it is disassembling on an RW page, then it marks that page RO and carries on.

Hm, this doesn't actually work.  If the instruction doing the modification is on that page, then you'll end up with infinite protect/fault/unprotect+invalidate/restart/protect loop.

OK, scratch that idea.  I think the idea I proposed in bug 69532 is the best bet.
Comment 9 John Reiser 2005-05-02 23:05:13 UTC
For the specific case of on-stack code thunks for local functions compiled by gcc (the translation becomes stale when the stack is trimmed back beyond the location of the untranslated thunk), there is a solution that is not extravagantly expensive.  As remarked in bug #69532 (current marked "Closed: Duplicate of bug #69511" [this present bug]), such a translation must become self-checking.  The decision to generate the self-checking code is made on the basis of nearness to the user stack pointer at the time of translation: within the current frame, or perhaps within some fixed distance if the frame cannot be determined easily.

The code for the self-check is:
     cmpl $ sp_at_translation, & sp_array[k]  # compare immediate to memory
     blo stale  # if stack has been trimmed beyond value at translation
where sp_array is an array of values of the stack pointer, and k is an index unique to the current translation.  Any (other) translation that trims the stack must update  sp_array[k] = max(sp_array[k], new_sp);  for each self-checking translation that could be active.  This can be expensive if there are many of them, but usually there are only a very few, if any at all.  Also, various optimizations are possible: once any sp_array[k] becomes stale, then it need not be updated any more [until it is re-generated by being called]; the k can be dynamically managed to reduce their range; the updating code can be dynamic itself; etc.  The runtime cost for a program that does _not_ use this feature is at most two instructions per stack trim: skip the update of sp_array[] if the number of selfcheckers is zero.  By keeping a list of translations that trim the stack, the onset of the two-instruction dynamic cost can be delayed until seeing the first translation that needs it, then invalidating all the known stack-trimming translations.

Switching stacks (for instance, by using threads) can result in "false positive" invalidations.  These are safe (never "false negative"), and the cost is proportional to the frequency of thread switching.  The self-checking is thread safe, although updating sp_array[] must be serialized.

-- 
John Reiser, jreiser@BitWagon.com
Comment 10 Nicholas Nethercote 2005-05-03 14:52:09 UTC
I now agree with #69532.  I'm not sure if I didn't see it previously or just didn't understand it.  I'm not keen on John's proposal in comment #9, since it
doesn't restrict the cost of this to just the translations that require it.

We could just do something simple: store a copy of the code for the translations deemed short-lived, and call out to a C function at the start of these translations, which would just do a simple byte-by-byte check to see if the code has changed.  (Hmm, you'd probably first test that the original code is within the bounds of the stack, and automatically retranslate if not.)  Some questions:  where would the copy of the code be kept?  And is this a Vex or a Valgrind thing?
Comment 11 Nicholas Nethercote 2005-05-03 17:45:28 UTC
Here's another data point, from an Ada user:

> By the way, local procedures are much more widely used in a language          
> like Ada, than in C.    
>
> As a data point, I ran all the ACATS tests (see the gcc testsuite/ada         
> directory) under valgrind, and the majority of failures were spurious,        
> coming from the trampoline problem.  On the other hand, there are about       
> 1800 tests, and only about 50 triggered the trampoline problem (I don't       
> recall the exact number), so it's not that common either.         
Comment 12 Julian Seward 2005-05-17 01:15:48 UTC
It's clear that self-checking translations are the way to go.  John's
scheme (comment #9) sounds over-complex, though.  I like #10 better, but
that requires storing a complete copy of the original bytes somewhere,
which is inconvenient.  A variant of #10 is to compute a 32-bit CRC of
the bytes, store that, and check it at BB-entry time.  I don't care if
translations close to the stack take a significant performance hit;
they are going to be pretty rare.  The only thing that worries me is
if someone screwed with page permissions in such a way that the self-check
causes a page fault.  Not sure how to fix that.
Comment 13 Jeremy Fitzhardinge 2005-05-17 01:19:35 UTC
We can always fix the page permissions before doing the check, assuming it is possible to have an executable non-readable page (plain x86 doesn't allow it, but x86-64 does).
Comment 14 Julian Seward 2005-05-17 01:43:17 UTC
Can you clarify what you mean by "fix" ?  Do you mean, before each
self-test, force the permissions of the page(s) concerned to be 
whatever they were at the time the translation was made?
Comment 15 Jeremy Fitzhardinge 2005-05-17 02:43:37 UTC
Well, you could just go and try to do the check; if it faults, you can make it readable in the fault handler and try again.  Once the check has finished, you can restore the page permissions.

On the other hand, this is a pretty rare case.  In general, a dynamic piece of code is going to be in RWX memory.
Comment 16 Nicholas Nethercote 2005-07-07 15:32:50 UTC
Julian just fixed this in the 3.0 repository at valgrind.org!