Created attachment 173332 [details] Test case archive (.tar.bz2) SUMMARY Some binary layouts make it so that valgrind ignores their debug info. STEPS TO REPRODUCE 1. Download and unpack the attached test case archive 2. run `valgrind --leak-check=full testcase/firefox-bad` on Linux x86_64. OBSERVED RESULT A leak is reported with the second frame non-symbolicated EXPECTED RESULT The second frame should be symbolicated ADDITIONAL INFORMATION The archive also contains a `firefox-good` binary that does get the second frame symbolicated. The only difference in the source code between `firefox-bad` and `firefox-good` is the content and length of a single string. It's enough to push the boundaries of PT_LOADs to a place that makes `check_elf_and_get_rw_loads` not agree with how coregrind handles segments-merges on mmap, leading `di_notify_mmap` to bail with `no dinfo loaded firefox-bad (no rx or no rw mapping)` on the last firefox-bad segment, and the debug info not being loaded, ultimately leading to the lack of symbolication. The following patch "fixes" this problem: ``` diff --git a/coregrind/m_debuginfo/readelf.c b/coregrind/m_debuginfo/readelf.c index 735f83044..326a1c59d 100644 --- a/coregrind/m_debuginfo/readelf.c +++ b/coregrind/m_debuginfo/readelf.c @@ -3954,7 +3954,6 @@ Bool ML_(check_elf_and_get_rw_loads) ( Int fd, const HChar* filename, Int * rw_l * with or without rounding up. * */ if (previous_rw_a_phdr.p_memsz > 0 && - ehdr_m.e_type == ET_EXEC && previous_rw_a_phdr.p_vaddr + previous_rw_a_phdr.p_filesz == a_phdr.p_vaddr) { ``` but the whole ordeal is incredibly fragile and making assumptions that I'm sure future linkers and/or dynamic loaders will keep breaking.
(In reply to mh from comment #0) > but the whole ordeal is incredibly fragile and making assumptions that I'm > sure future linkers and/or dynamic loaders will keep breaking. I think that the code is overly cautious. There are two possible errors. One would be false positive reading debuginfo for a binary that is getting loaded other than as part of execution, link loading or dlopen. I don't imagine user code mmapping binaries is at all common, so a false positive isn't a disaster. Then there are all the false negatives where a binary really does get loaded but we fail to read the debuginfo. That's much more common. For the loading of the tool and the guest exe I think we should always trigger the read of debuginfo. The tool can't have failed to load and if the guest exe fails to load it will terminate very quickly. Loading guest shared libraries is trickier. The problem is that we are counting mmaps to see when the library has been fully loaded. There are a lot of things that can go wrong there.
For the bad exe LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000025f4c 0x0000000000025f4c R 0x1000 LOAD 0x0000000000025f50 0x0000000000026f50 0x0000000000026f50 0x00000000000899b0 0x00000000000899b0 R E 0x1000 LOAD 0x00000000000af900 0x00000000000b1900 0x00000000000b1900 0x0000000000001700 0x0000000000001700 RW 0x1000 LOAD 0x00000000000b1000 0x00000000000b3000 0x00000000000b3000 0x0000000000000018 0x0000000000001fa8 RW 0x1000 So one RO one RX and two RW PT_LOADS. So I expect 3 or 4 calls to VG_(di_notify_mmap) depending on whether the RW segments were merged or not. First, the RO segment --15892-- di_notify_mmap-0: --15892-- di_notify_mmap-1: 0x108000-0x12dfff r-- --15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad --15892-- di_notify_mmap-3: is_rx_map 0, is_rw_map 0, is_ro_map 1 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24 --15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0 --15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping) Next the RX segment --15892-- di_notify_mmap-1: 0x12e000-0x1b8fff r-x --15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad --15892-- di_notify_mmap-3: is_rx_map 1, is_rw_map 0, is_ro_map 0 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24 --15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0 --15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping) Then there is one RW segment --15892-- di_notify_mmap-0: --15892-- di_notify_mmap-1: 0x1b9000-0x1bbfff rw- --15892-- di_notify_mmap-2: /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad --15892-- di_notify_mmap-3: is_rx_map 0, is_rw_map 1, is_ro_map 0 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 1 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb1900 p_offset 719104, p_filesz 5888 --15892-- check_elf_and_get_rw_loads: ++*rw_load_count to 2 for /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad p_vaddr 0xb3000 p_offset 724992, p_filesz 24 --15892-- di_notify_mmap-4: noting details in DebugInfo* at 0x100287C4D0 --15892-- di_notify_mmap-6: no dinfo loaded /home/paulf/scratch/vg_example/bug492663/testcase/firefox-bad (no rx or no rw mapping) After a bit more debugging it looks like the problem is that 'firefox-bad' has an object file type of ET_DYN (shared object file). readelf -e says Type: DYN (Position-Independent Executable file) This all makes sense now. We were only handling ET_EXEC (non-PIE exes). One thing to do would be to change the condition to be (ehdr_m.e_type == ET_EXEC || ehdr_m.e_type == ET_DYN) && (slightly more restrictive than the suggested patch). I'm afraid that might break loading shared libraries. All of this horrible mess is unnecessary for libraries since they get loaded by ld.so and I don't think that the loader merges calls to mmap so there is always a 1 to 1 correlation between the number of segments reported in the ELF header and the number of calls to mmap that we see. That just leaves the valgrind tool and the guest exe. The tool never has two RW PT_LOADs - it's statically linked and the second RW PT_LOAD is used to split data and bss from all the gubbins like the GOT. I'll look to see if we can simply not merge the mmaps and segments for the guest exe. I'm willing to sacrifice one NSegment to avoid all this complexity.
Created attachment 173389 [details] patch to always merge when called from nsegments Could you try this patch? In the end I gave up on trying to modify add_segment. There's no obvious way to temporarily turn off preening because segments get added for many things. I added a flag to ML_(check_elf_and_get_rw_loads) to tell apart visiting the nsegments to look for mapped exes/libs from later calls to mmap to load libs.
It works. Thank you.
commit 1076dc0912e68b085de2bbf52e9e31cb32866f85 (HEAD -> master, origin/master, origin/HEAD) Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Sat Sep 7 09:06:03 2024 +0200 Bug 492663 - Valgrind ignores debug info for some binaries ML_(check_elf_and_get_rw_loads) now always checks for merged PT_LOADs when called from valgrind_main when iterating over nsegments. Updated comments and changed variable names and the debug message when the number of expected RW PT_LOADs hasn't been reached.