Bug 395682 - Reading debug info of binaries with readonly PT_LOAD segments
Summary: Reading debug info of binaries with readonly PT_LOAD segments
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Unclassified
Component: general (show other bugs)
Version: 3.14 SVN
Platform: Other Linux
: NOR normal (vote)
Target Milestone: ---
Assignee: Mark Wielaard
URL:
Keywords:
: 384727 (view as bug list)
Depends on:
Blocks: 396476
  Show dependency treegraph
 
Reported: 2018-06-21 09:28 UTC by Дилян Палаузов
Modified: 2018-07-16 13:14 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
The produced binary (17.77 KB, application/x-executable)
2018-06-21 09:28 UTC, Дилян Палаузов
Details
Accept read-only PT_LOAD segments and .rodata by ld -z separate-code (4.44 KB, patch)
2018-07-12 13:50 UTC, Mark Wielaard
Details
A test (6.94 KB, application/octet-stream)
2018-07-12 15:01 UTC, H.J. Lu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Дилян Палаузов 2018-06-21 09:28:05 UTC
Created attachment 113482 [details]
The produced binary

I have this program:
  #include <stdio.h>
  int main() {
    printf("a\n");
    int i = 7 /0;
    printf("b\n");
  }
which I compile with "gcc -g t.c -o t"

Running `valgrind -v  --memcheck:track-origins=yes --read-var-info=yes --memcheck:show-leak-kinds=all --vgdb=no t` I expect to see the line where there are problems, but it prints:

==24781== Memcheck, a memory error detector
==24781== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24781== Using Valgrind-3.14.0.GIT-90daa486e8-20180620X and LibVEX; rerun with -h for copyright info
==24781== Command: t
==24781== 
--24781-- Valgrind options:
--24781--    -v
--24781--    --memcheck:track-origins=yes
--24781--    --read-var-info=yes
--24781--    --memcheck:show-leak-kinds=all
--24781--    --vgdb=no
--24781-- Contents of /proc/version:
--24781--   Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08)
--24781-- 
--24781-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-sse3
--24781-- Page sizes: currently 4096, max supported 4096
--24781-- Valgrind library directory: /usr/local/lib/valgrind
--24781-- Reading syms from /home/me/t
--24781-- ELF section outside all mapped regions
--24781-- Reading syms from /lib/x86_64-linux-gnu/ld-2.19.so
--24781--   Considering /lib/x86_64-linux-gnu/ld-2.19.so ..
--24781--   .. CRC mismatch (computed c067370a wanted 8c45d3ea)
--24781--   Considering /usr/lib/debug/lib/x86_64-linux-gnu/ld-2.19.so ..
--24781--   .. CRC is valid
--24781-- warning: addVar: unknown size (buf)
--24781-- warning: addVar: unknown size (buf)
--24781-- warning: addVar: unknown size (buf)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- warning: addVar: unknown size (loadcmds)
--24781-- Reading syms from /usr/local/lib/valgrind/memcheck-amd64-linux
--24781--    object doesn't have a symbol table
--24781--    object doesn't have a dynamic symbol table
--24781-- Scheduler: using generic scheduler lock implementation.
--24781-- Reading suppressions file: /usr/local/lib/valgrind/default.supp
--24781-- REDIR: 0x4017b50 (ld-linux-x86-64.so.2:strlen) redirected to 0x581df42e (???)
--24781-- REDIR: 0x4017900 (ld-linux-x86-64.so.2:index) redirected to 0x581df448 (???)
--24781-- Reading syms from /usr/local/lib/valgrind/vgpreload_core-amd64-linux.so
--24781--    object doesn't have a symbol table
--24781-- Reading syms from /usr/local/lib/valgrind/vgpreload_memcheck-amd64-linux.so
--24781--    object doesn't have a symbol table
==24781== WARNING: new redirection conflicts with existing -- ignoring it
--24781--     old: 0x04017b50 (strlen              ) R-> (0000.0) 0x581df42e ???
--24781--     new: 0x04017b50 (strlen              ) R-> (2007.0) 0x0402d490 strlen
--24781-- REDIR: 0x4017b20 (ld-linux-x86-64.so.2:strcmp) redirected to 0x402ed40 (strcmp)
--24781-- REDIR: 0x4018850 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4037390 (mempcpy)
--24781-- Reading syms from /lib/x86_64-linux-gnu/libc-2.19.so
--24781--   Considering /lib/x86_64-linux-gnu/libc-2.19.so ..
--24781--   .. CRC mismatch (computed 8b555f82 wanted cd9b3228)
--24781--   Considering /usr/lib/debug/lib/x86_64-linux-gnu/libc-2.19.so ..
--24781--   .. CRC is valid
--24781-- REDIR: 0x4aa8dc0 (libc.so.6:strcasecmp) redirected to 0x4023720 (_vgnU_ifunc_wrapper)
--24781-- REDIR: 0x4aab0b0 (libc.so.6:strncasecmp) redirected to 0x4023720 (_vgnU_ifunc_wrapper)
--24781-- REDIR: 0x4aa8590 (libc.so.6:memcpy@GLIBC_2.2.5) redirected to 0x4023720 (_vgnU_ifunc_wrapper)
--24781-- REDIR: 0x4aa6910 (libc.so.6:rindex) redirected to 0x402ce10 (rindex)
--24781-- REDIR: 0x4aa4c10 (libc.so.6:strlen) redirected to 0x402d3d0 (strlen)
==24781== 
==24781== Process terminating with default action of signal 8 (SIGFPE)
==24781==  Integer divide by zero at address 0x10090584FE
==24781==    at 0x401149: ??? (in /home/me/t)
==24781==    by 0x4A44B44: (below main) (libc-start.c:287)
a
--24781-- REDIR: 0x4a9f600 (libc.so.6:free) redirected to 0x402afd0 (free)
==24781== 
==24781== HEAP SUMMARY:
==24781==     in use at exit: 0 bytes in 0 blocks
==24781==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==24781== 
==24781== All heap blocks were freed -- no leaks are possible
==24781== 
==24781== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==24781== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Why isn't the code line shown at
==24781==  Integer divide by zero at address 0x10090584FE
==24781==    at 0x401149: ??? (in /home/me/t)

How shall I compile in order to see the lines in valgrind's output?  It worked in the past and I don't know when and why it stopped working.

I use gcc (GCC) 7.3.1 20180618 and valgrind-3.14.0.GIT-90daa486e8-20180620X.
Comment 1 Дилян Палаузов 2018-06-30 09:00:45 UTC
It turns out that when linking with ld.bfd 2.30 or gold 2.31.51.20180630) 1.16 valgrind can read the debug information, but not when ld.bfd 2.31.51.20180630 is used.

I described at https://sourceware.org/bugzilla/show_bug.cgi?id=23357 the whole case, and uploaded there the produced binaries.  Please read it, I don't want to repeat here the text from there in order keep the total text for the case shorter.  While I filled a PR for ld.bfd, I don't state whether the problem is in ld.bfd or in valgrind.
Comment 2 Дилян Палаузов 2018-06-30 18:38:49 UTC
For the time being programs can either be linked explicitly with gold:
  gcc -fuse-ld=gold
or switch off the implicitly enabled separate-code on Linux/x86:
  gcc -fuse-ld=bfd -Wl,-z,noseparate-code
or change the default linker by replacing '/usr/local/x86_64-pc-linux-gnu/bin/ld' (the path can be different on your system), which is a copy of /usr/local/x86_64-pc-linux-gnu/bin/ld.bfd, with /usr/local/x86_64-pc-linux-gnu/bin/ld.gold .
Comment 3 H.J. Lu 2018-07-11 13:40:31 UTC
Here is a tiny program:

https://github.com/hjl-tools/simple-linux/tree/divide-by-zero

Valgrind can't read its DWARF debug info:

[hjl@gnu-cfl-1 simple-linux]$ make LD=ld.gold
gcc -g -O0   -c -o test.o test.c
test.c: In function \u2018main\u2019:
test.c:22:9: warning: division by zero [-Wdiv-by-zero]
   i = 7 / 0;
         ^
gcc -g   -c -o start.o start.S
gcc -g   -c -o syscall.o syscall.S
ld.gold  -o test test.o start.o syscall.o
./test hello world
a
make: *** [Makefile:12: all] Floating point exception
[hjl@gnu-cfl-1 simple-linux]$ valgrind ./test
==27555== Memcheck, a memory error detector
==27555== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27555== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==27555== Command: ./test
==27555== 
a
==27555== 
==27555== Process terminating with default action of signal 8 (SIGFPE)
==27555==  Integer divide by zero at address 0x1002B95532
==27555==    at 0x40015E: ??? (in /export/ssd/git/github/simple-linux/test)
==27555==    by 0x4001A8: ??? (in /export/ssd/git/github/simple-linux/test)
==27555== 
==27555== HEAP SUMMARY:
==27555==     in use at exit: 0 bytes in 0 blocks
==27555==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==27555== 
==27555== All heap blocks were freed -- no leaks are possible
==27555== 
==27555== For counts of detected and suppressed errors, rerun with: -v
==27555== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Floating point exception
[hjl@gnu-cfl-1 simple-linux]$ gdb test
GNU gdb (GDB) Fedora 8.1-19.fc28
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test...done.
(gdb) r
Starting program: /export/ssd/git/github/simple-linux/test 
a

Program received signal SIGFPE, Arithmetic exception.
0x000000000040015c in main (argc=1, argv=0x7fffffffd708) at test.c:22
22	  i = 7 / 0;
(gdb)
Comment 4 Mark Wielaard 2018-07-11 14:14:28 UTC
The root cause is that ld -z separate-code introduces various new PT_LOAD segments and the code in valgrind interpreting the PT_LOAD mappings is very specific about what is does and what it doesn't consider a code or data mapping to match symbols and debuginfo against.

There are various bugs for this:
fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1600034
debian: http://bugs.debian.org/903389
binutils: https://sourceware.org/bugzilla/show_bug.cgi?id=23357

The following valgrind bug is also somewhat related: https://bugs.kde.org/show_bug.cgi?id=390871 "ELF debug info reader confused with multiple .rodata* sections"

The issue can most easily be seen using --trace-symtab=yes which will show the PT_LOAD program headers and flags.

There are two places in the code which interpret the mappings/flags and try to map things to debuginfo.

coregrind/m_debuginfo/debuginfo.c (di_notify_mmap) and
coregrind/m_debuginfo/readelf.c (read_elf_debug_info)

Note in the first that it doesn't set up is_ro_map unless compiled for darwin.
Note in the second that it seems to only consider the first PT_LOAD segment and bails out when that doesn't match.
Comment 5 H.J. Lu 2018-07-11 14:52:40 UTC
(In reply to Mark Wielaard from comment #4)
> The root cause is that ld -z separate-code introduces various new PT_LOAD

This is very misleading.  This simple test:

https://github.com/hjl-tools/simple-linux/tree/divide-by-zero

doesn't use -z separate-code:

[hjl@gnu-cfl-1 simple-linux]$ readelf -lW test

Elf file type is EXEC (Executable file)
Entry point 0x40019a
There are 3 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x000240 0x000240 R E 0x1000
  LOAD           0x001000 0x0000000000401000 0x0000000000401000 0x000000 0x000000 RW  0x1000
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RWE 0x10

 Section to Segment mapping:
  Segment Sections...
   00     .text .rodata .eh_frame 
   01     .data .bss 
   02     
[hjl@gnu-cfl-1 simple-linux]$
Comment 6 Mark Wielaard 2018-07-11 15:45:54 UTC
(In reply to H.J. Lu from comment #5)
> (In reply to Mark Wielaard from comment #4)
> > The root cause is that ld -z separate-code introduces various new PT_LOAD
> 
> This is very misleading.

I don't know why you say that, or why you removed the rest of the sentence "... and the code in valgrind interpreting the PT_LOAD mappings is very specific about what it does and what it doesn't consider a code or data mapping to match symbols and debuginfo against."

The point is simply that if people want to work around it the simplest thing to do for now is to use ld -z no-separate-code, which is what fedora is doing for now till we get valgrind fixed.

I didn't say that it is the only way that valgrind might get confused by different PT_LOAD segment setups. In fact if you look at the code pointed out in  comment #4 you'll see that there are lots of ways to confuse valgrind. And that the code is horribly architecture and OS specific. It just doesn't work with -z separate-code at the moment. We have to fix that (and hopefully fix other things while we do it). But for now just don't use -z separate-code or build binutils with --disable-separate-code.
Comment 7 H.J. Lu 2018-07-11 22:40:14 UTC
The problem is with readonly PT_LOAD segments.  Please try
users/hjl/pr395682/master branch at

https://github.com/hjl-tools/valgrind/tree/users/hjl/pr395682/master
Comment 8 Tom Hughes 2018-07-11 23:05:16 UTC
That looks very similar to my version at https://github.com/tomhughes/valgrind/commit/40a30ca68769c0825e078b731cd115849b1a6744 but I think you've got something a bit extra cope with multiple ro segments?
Comment 9 Mark Wielaard 2018-07-12 13:50:34 UTC
Created attachment 113898 [details]
Accept read-only PT_LOAD segments and .rodata by ld -z separate-code

I combined Tom's fix for separate read-only segments with the .rodata reading fix from H.J., but changed it to always use the bias from the loaded range. I wasn't comfortable with all the changes in the asserts especially because they would contradict the comments directly before them. And they seemed unnecessary.

This patch fixes the issue with the reported binary in this bug and with the i386 glibc ld.so created on fedora (when build with ld -z separate-code).
Comment 10 H.J. Lu 2018-07-12 14:57:50 UTC
(In reply to Mark Wielaard from comment #9)
> Created attachment 113898 [details]
> Accept read-only PT_LOAD segments and .rodata by ld -z separate-code
> 
> I combined Tom's fix for separate read-only segments with the .rodata
> reading fix from H.J., but changed it to always use the bias from the loaded
> range. I wasn't comfortable with all the changes in the asserts especially
> because they would contradict the comments directly before them. And they
> seemed unnecessary.
> 
> This patch fixes the issue with the reported binary in this bug and with the
> i386 glibc ld.so created on fedora (when build with ld -z separate-code).

It doesn't fix:

https://github.com/hjl-tools/simple-linux/tree/divide-by-zero
Comment 11 H.J. Lu 2018-07-12 15:01:19 UTC
Created attachment 113899 [details]
A test

With the proposed fix:

[hjl@gnu-cfl-1 build-x86_64-linux]$   ./.in_place/memcheck-amd64-linux    /export/gnu/import/git/github/simple-linux/test 
==30545== Memcheck, a memory error detector
==30545== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30545== Using Valgrind-3.14.0.GIT and LibVEX; rerun with -h for copyright info
==30545== Command: /export/gnu/import/git/github/simple-linux/test
==30545== 
a
==30545== 
==30545== Process terminating with default action of signal 8 (SIGFPE)
==30545==  Integer divide by zero at address 0x100387C52A
==30545==    at 0x40015E: ??? (in /export/ssd/git/github/simple-linux/test)
==30545==    by 0x4001A8: ??? (in /export/ssd/git/github/simple-linux/test)
==30545== 
==30545== HEAP SUMMARY:
==30545==     in use at exit: 0 bytes in 0 blocks
==30545==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==30545== 
==30545== All heap blocks were freed -- no leaks are possible
==30545== 
==30545== For counts of detected and suppressed errors, rerun with: -v
==30545== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Floating point exception

With my fixes:

[hjl@gnu-cfl-1 build-x86_64-linux]$   ./.in_place/memcheck-amd64-linux    /export/gnu/import/git/github/simple-linux/test 
==31384== Memcheck, a memory error detector
==31384== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==31384== Using Valgrind-3.14.0.GIT and LibVEX; rerun with -h for copyright info
==31384== Command: /export/gnu/import/git/github/simple-linux/test
==31384== 
a
==31384== 
==31384== Process terminating with default action of signal 8 (SIGFPE)
==31384==  Integer divide by zero at address 0x100388C52A
==31384==    at 0x40015E: main (test.c:22)
==31384== 
==31384== HEAP SUMMARY:
==31384==     in use at exit: 0 bytes in 0 blocks
==31384==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==31384== 
==31384== All heap blocks were freed -- no leaks are possible
==31384== 
==31384== For counts of detected and suppressed errors, rerun with: -v
==31384== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Floating point exception
[hjl@gnu-cfl-1 build-x86_64-linux]$
Comment 12 Mark Wielaard 2018-07-12 21:35:53 UTC
(In reply to H.J. Lu from comment #10)
> (In reply to Mark Wielaard from comment #9)
> > This patch fixes the issue with the reported binary in this bug and with the
> > i386 glibc ld.so created on fedora (when build with ld -z separate-code).
> 
> It doesn't fix:
> 
> https://github.com/hjl-tools/simple-linux/tree/divide-by-zero

yes, you are right, but isn't that a totally different case? Your example seems to have everything in a single rx segment and then a zero size rw segment.

Is that something that is specific to they way you created that testcase without actually linking against any library, or does that happen normally in some configurations?

I think it is best to create a new bug for this case and post the specific patch that solves this issue in that bug.

Even though your patches might be totally correct and help fix this issue I found it hard to reason about them because some just seemed to disable various asserts without updating the comments that explained why we needed those asserts (if the comments are wrong, please update them together with the code changes).
Comment 13 H.J. Lu 2018-07-13 14:54:08 UTC
(In reply to Mark Wielaard from comment #12)
> yes, you are right, but isn't that a totally different case? Your example
> seems to have everything in a single rx segment and then a zero size rw
> segment.
> 
> Is that something that is specific to they way you created that testcase
> without actually linking against any library, or does that happen normally
> in some configurations?

Gold linker generates it.

> I think it is best to create a new bug for this case and post the specific
> patch that solves this issue in that bug.
> 

I opened:

https://bugs.kde.org/show_bug.cgi?id=396476
Comment 14 Ivo Raisr 2018-07-15 11:13:37 UTC
*** Bug 384727 has been marked as a duplicate of this bug. ***
Comment 15 Mark Wielaard 2018-07-16 13:14:16 UTC
commit 64aa729bfae71561505a40c12755bd6b55bb3061
Author: Mark Wielaard <mark@klomp.org>
Date:   Thu Jul 12 13:56:00 2018 +0200

    Accept read-only PT_LOAD segments and .rodata.
    
    The new binutils ld -z separate-code option creates multiple read-only
    PT_LOAD segments and might place .rodata in a non-executable segment.
    
    Allow and keep track of separate read-only segments and allow a readonly
    page with .rodata section.
    
    Based on patches from Tom Hughes <tom@compton.nu> and
    H.J. Lu <hjl.tools@gmail.com>.
    
    https://bugs.kde.org/show_bug.cgi?id=395682