Bug 492663 - Valgrind ignores debug info for some binaries
Summary: Valgrind ignores debug info for some binaries
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.24 GIT
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Paul Floyd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-05 03:40 UTC by mh
Modified: 2024-09-07 08:06 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Test case archive (.tar.bz2) (3.77 MB, application/x-bzip)
2024-09-05 03:40 UTC, mh
Details
patch to always merge when called from nsegments (3.62 KB, patch)
2024-09-06 19:57 UTC, Paul Floyd
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mh 2024-09-05 03:40:45 UTC
Created attachment 173332 [details]
Test case archive (.tar.bz2)

SUMMARY
Some binary layouts make it so that valgrind ignores their debug info.

STEPS TO REPRODUCE
1. Download and unpack the attached test case archive
2. run `valgrind --leak-check=full testcase/firefox-bad` on Linux x86_64.

OBSERVED RESULT
A leak is reported with the second frame non-symbolicated

EXPECTED RESULT
The second frame should be symbolicated

ADDITIONAL INFORMATION
The archive also contains a `firefox-good` binary that does get the second frame symbolicated.

The only difference in the source code between `firefox-bad` and `firefox-good` is the content and length of a single string. It's enough to push the boundaries of PT_LOADs to a place that makes `check_elf_and_get_rw_loads` not agree with how coregrind handles segments-merges on mmap, leading `di_notify_mmap` to bail with `no dinfo loaded firefox-bad (no rx or no rw mapping)` on the last firefox-bad segment, and the debug info not being loaded, ultimately leading to the lack of symbolication.

The following patch "fixes" this problem:
```
diff --git a/coregrind/m_debuginfo/readelf.c b/coregrind/m_debuginfo/readelf.c
index 735f83044..326a1c59d 100644
--- a/coregrind/m_debuginfo/readelf.c
+++ b/coregrind/m_debuginfo/readelf.c
@@ -3954,7 +3954,6 @@ Bool ML_(check_elf_and_get_rw_loads) ( Int fd, const HChar* filename, Int * rw_l
                 * with or without rounding up.
                 * */
                if (previous_rw_a_phdr.p_memsz > 0 &&
-                   ehdr_m.e_type == ET_EXEC &&
                    previous_rw_a_phdr.p_vaddr + previous_rw_a_phdr.p_filesz
                      == a_phdr.p_vaddr)
                {
```
but the whole ordeal is incredibly fragile and making assumptions that I'm sure future linkers and/or dynamic loaders will keep breaking.
Comment 1 Paul Floyd 2024-09-05 06:48:34 UTC
(In reply to mh from comment #0)

> but the whole ordeal is incredibly fragile and making assumptions that I'm
> sure future linkers and/or dynamic loaders will keep breaking.

I think that the code is overly cautious. There are two possible errors. One would be false positive reading debuginfo for a binary that is getting loaded other than as part of execution, link loading or dlopen. I don't imagine user code mmapping binaries is at all common, so a false positive isn't a disaster.

Then there are all the false negatives where a binary really does get loaded but we fail to read the debuginfo. That's much more common. For the loading of the tool and the guest exe I think we should always trigger the read of debuginfo. The tool can't have failed to load and if the guest exe fails to load it will terminate very quickly. Loading guest shared libraries is trickier. The problem is that we are counting mmaps to see when the library has been fully loaded. There are a lot of things that can go wrong there.
Comment 2 Paul Floyd 2024-09-06 06:55:39 UTC
For the bad exe

  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000025f4c 0x0000000000025f4c  R      0x1000
  LOAD           0x0000000000025f50 0x0000000000026f50 0x0000000000026f50
                 0x00000000000899b0 0x00000000000899b0  R E    0x1000
  LOAD           0x00000000000af900 0x00000000000b1900 0x00000000000b1900
                 0x0000000000001700 0x0000000000001700  RW     0x1000
  LOAD           0x00000000000b1000 0x00000000000b3000 0x00000000000b3000
                 0x0000000000000018 0x0000000000001fa8  RW     0x1000

So one RO one RX and two RW PT_LOADS. So I expect 3 or 4 calls to VG_(di_notify_mmap) depending on whether the RW segments were merged or not.

First, the RO segment

--15892-- di_notify_mmap-0:
--15892-- di_notify_mmap-1: 0x108000-0x12dfff r--
--15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad
--15892-- di_notify_mmap-3: is_rx_map 0, is_rw_map 0, is_ro_map 1
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24
--15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0
--15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping)


Next the RX segment

--15892-- di_notify_mmap-1: 0x12e000-0x1b8fff r-x
--15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad
--15892-- di_notify_mmap-3: is_rx_map 1, is_rw_map 0, is_ro_map 0
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24
--15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0
--15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping)

Then there is one RW segment

--15892-- di_notify_mmap-0:
--15892-- di_notify_mmap-1: 0x1b9000-0x1bbfff rw-
--15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad
--15892-- di_notify_mmap-3: is_rx_map 0, is_rw_map 1, is_ro_map 0
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888
--15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24
--15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0
--15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping)

After a bit more debugging it looks like the problem is that 'firefox-bad' has an object file type of ET_DYN (shared object file). readelf -e says

  Type:                              DYN (Position-Independent Executable file)

This all makes sense now. We were only handling ET_EXEC (non-PIE exes).

One thing to do would be to change the condition to be

                   (ehdr_m.e_type == ET_EXEC || ehdr_m.e_type == ET_DYN) &&

(slightly more restrictive than the suggested patch).

I'm afraid that might break loading shared libraries. All of this horrible mess is unnecessary for libraries since they get loaded by ld.so and I don't think that the loader merges calls to mmap so there is always a 1 to 1 correlation between the number of segments reported in the ELF header and the number of calls to mmap that we see.

That just leaves the valgrind tool and the guest exe. The tool never has two RW PT_LOADs - it's statically linked and the second RW PT_LOAD is used to split data and bss from all the gubbins like the GOT.

I'll look to see if we can simply not merge the mmaps and segments for the guest exe. I'm willing to sacrifice one NSegment to avoid all this complexity.
Comment 3 Paul Floyd 2024-09-06 19:57:00 UTC
Created attachment 173389 [details]
patch to always merge when called from nsegments

Could you try this patch?

In the end I gave up on trying to modify add_segment. There's no obvious way to temporarily turn off preening because segments get added for many things. I added a flag to ML_(check_elf_and_get_rw_loads) to tell apart visiting the nsegments to look for mapped exes/libs from later calls to mmap to load libs.
Comment 4 mh 2024-09-06 20:27:27 UTC
It works. Thank you.
Comment 5 Paul Floyd 2024-09-07 07:32:29 UTC
commit 1076dc0912e68b085de2bbf52e9e31cb32866f85 (HEAD -> master, origin/master, origin/HEAD)
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sat Sep 7 09:06:03 2024 +0200

    Bug 492663 - Valgrind ignores debug info for some binaries
    
    ML_(check_elf_and_get_rw_loads) now always checks for
    merged PT_LOADs when called from valgrind_main when
    iterating over nsegments.
    
    Updated comments and changed variable names and the
    debug message when the number of expected RW PT_LOADs
    hasn't been reached.