Bug 327427

Summary: ifunc wrapper crashes when symbols are discarded because of false mmap overlaps
Product: [Developer tools] valgrind Reporter: Mark Wielaard <mark>
Component: generalAssignee: Julian Seward <jseward>
Status: REPORTED ---    
Severity: normal    
Priority: NOR    
Version First Reported In: 3.8.0   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Mark Wielaard 2013-11-10 21:38:26 UTC
I am seeing this during elfutils make check when configured with --enable-valgrind, which will run all tests under valgrind memcheck.

valgrind: m_redir.c:700 (vgPlain_redir_add_ifunc_target): Assertion 'old' failed.
==26262==    at 0x38059B6F: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26262==    by 0x38059CB2: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26262==    by 0x3806A40D: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26262==    by 0x3809F787: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==26262==    by 0x380AE0FC: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==26262==    at 0x48017CC: _vgnU_ifunc_wrapper (in /usr/lib64/valgrind/vgpreload_core-amd64-linux.so)
==26262==    by 0x3384E0E99E: _dl_fixup (dl-irel.h:32)
==26262==    by 0x3384E152D4: _dl_runtime_resolve (dl-trampoline.S:45)
==26262==    by 0xB9FEB30: x86_64_core_note (linux-core-note.c:210)
==26262==    by 0x4C3ECF2: ebl_core_note (eblcorenote.c:54)
==26262==    by 0x4C3B286: __libdwfl_attach_state_for_core (linux-core-attach.c:333)
==26262==    by 0x4C38547: dwfl_core_file_report@@ELFUTILS_0.158 (core-file.c:565)
==26262==    by 0x4C2E320: parse_opt (argp-std.c:317)
==26262==    by 0x3385302987: ??? (in /usr/lib64/libc-2.17.so)
==26262==    by 0x401A74: main (addr2line.c:149)

I don't have a simpler reproducer yet. But the following workaround fixes it, but might not be correct:

--- a/coregrind/m_debuginfo/debuginfo.c
+++ b/coregrind/m_debuginfo/debuginfo.c
@@ -903,6 +903,10 @@ ULong VG_(di_notify_mmap)( Addr a, Bool allow_SkFileV, Int use_fd )
    di = find_or_create_DebugInfo_for( filename );
    vg_assert(di);
 
+   /* Already processed? */
+   if (di->have_dinfo)
+      return 0;
+
    /* Note the details about the mapping. */
    struct _DebugInfoMapping map;
    map.avma = a;

The problem is that one of the tests reads a core file and mmaps libc.so again (read only, not executable) to inspect the headers.  As can be seen above that adds the new avma mapping to the existing DebugInfo for that file. Then the program unmaps this mapping of libc.so again. But apparently that doesn't remove the mapping from the DebugInfo. Next another shared library is loaded and happens to be mmapped at this same address. This then causes discard_DebugInfos_which_overlap_with to remove all symbols for the libc DebugInfo. When the new library code is called and triggers an ifunc (for libc memcmp in this case)  the above assert is triggered.

Reproducible: Always
Comment 1 Mark Wielaard 2014-05-09 17:41:25 UTC
On irc Julian correctly pointed that we are probably not just opening file read-only. We are indeed opening rw (MAP_PRIVATE) because we might want to do some relocations in the file (if it turns out to be an ET_REL file). Unfortanately it is not possible to know whether or not we need the file ro or rw beforehand. We have to open the file first to examine if it is usable as is or whether we might need to also write to the mapping.

So if at all possible I would like to see if we can have valgrind detect this isn't an executable mapping. Or at least add some smarts so that valgrind detects this double opening/mapping case and doesn't destroy symbols when discard_DebugInfos_which_overlap_with is run on a subsequent file mapping.
Comment 2 Julian Seward 2015-02-02 23:22:15 UTC
Is this still alive?  If yes is there anything we can or should do about it?
Comment 3 Mark Wielaard 2015-04-01 15:14:22 UTC
(In reply to Julian Seward from comment #2)
> Is this still alive?  If yes is there anything we can or should do about it?

Yes, I can still trigger this issue. But it might be I am the only one :)
In my use case we mmap glibc.so twice. Once through ld.so because the program uses libc.so itself and then because it wants to inspect that same libc.so file.  I doubt that is a common usage scenario.
Comment 4 Mark Wielaard 2018-01-22 10:06:10 UTC
I had hoped the fix for Bug 79362 - Debug info is lost for .so files when they are dlclose'd, would have also fixed this issue. But it didn't.

Note that this is a somewhat weird/special case, it happens with the following elfutils testcases (when configure with --enable-valgrind to run all tests under valgrind):

  tests/run-backtrace-demangle.sh
  tests/run-stack-d-test.sh
  tests/run-stack-demangled-test.sh
  tests/run-stack-i-test.sh

They all have the following workaround for now:

 # Disable valgrind while dumping because of a bug unmapping libc.so.
 # https://bugs.kde.org/show_bug.cgi?id=327427
 SAVED_VALGRIND_CMD="$VALGRIND_CMD"
 unset VALGRIND_CMD

They are somewhat special in that try to create a backtrace for a different process, so they map (and unmap) libc.so into their address space twice (once because they are linked against it themselves and then another time because the target process is linked against it). Normal processes would obviously never do that.
Comment 5 Mark Wielaard 2018-01-26 19:52:44 UTC
BTW. The valgrind crash does change when using --keep-debuginfo=yes for these cases. Now it crashes because of:

valgrind: m_debuginfo/debuginfo.c:452 (discard_or_archive_DebugInfo): Assertion 'is_DebugInfo_active(di)' failed.

host stacktrace:
==7114==    at 0x5804AAF5: show_sched_status_wrk (m_libcassert.c:355)
==7114==    by 0x5804AC24: report_and_quit (m_libcassert.c:426)
==7114==    by 0x5804ADA0: vgPlain_assert_fail (m_libcassert.c:492)
==7114==    by 0x5808891E: discard_or_archive_DebugInfo (debuginfo.c:452)
==7114==    by 0x5808A376: discard_or_archive_marked_DebugInfos (debuginfo.c:589)
==7114==    by 0x5808A376: discard_DebugInfos_which_overlap_with (debuginfo.c:614)
==7114==    by 0x5808A376: di_notify_ACHIEVE_ACCEPT_STATE (debuginfo.c:897)
==7114==    by 0x5808A376: vgPlain_di_notify_mmap (debuginfo.c:1231)
==7114==    by 0x580BDD5F: vgModuleLocal_generic_PRE_sys_mmap (syswrap-generic.c:2388)
==7114==    by 0x580F45A2: vgSysWrap_amd64_linux_sys_mmap_before (syswrap-amd64-linux.c:400)
==7114==    by 0x580B8C7E: vgPlain_client_syscall (syswrap-main.c:1857)
==7114==    by 0x580B52EA: handle_syscall (scheduler.c:1176)
==7114==    by 0x580B6CAE: vgPlain_scheduler (scheduler.c:1498)
==7114==    by 0x580C8936: thread_wrapper (syswrap-linux.c:103)
==7114==    by 0x580C8936: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 7114)
==7114==    at 0x4018C8A: mmap (mmap.c:34)
==7114==    by 0x400669A: _dl_map_object_from_fd (dl-load.c:1347)