Bug 272189 - Valgrind: Killed by fatal signal when Valgrinding Wine compiled with Clang
Summary: Valgrind: Killed by fatal signal when Valgrinding Wine compiled with Clang
Status: RESOLVED NOT A BUG
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.7 SVN
Platform: Unlisted Binaries Linux
: NOR crash
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-01 23:25 UTC by Austin English
Modified: 2011-05-10 00:52 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
valgrind output (62 bytes, text/plain)
2011-05-01 23:25 UTC, Austin English
Details
problem file, bzip2 -9'ed (915.40 KB, application/x-bzip)
2011-05-03 18:29 UTC, Austin English
Details
valgrind verbose output (8.03 KB, text/plain)
2011-05-03 18:29 UTC, Austin English
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Austin English 2011-05-01 23:25:18 UTC
Created attachment 59514 [details]
valgrind output

Looking at a wine bug that only occurs when compiling with Clang trunk (http://bugs.winehq.org/show_bug.cgi?id=26754), I found that Valgrind crashes:
preloader: Warning: failed to reserve range 00110000-68000000
--9831-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--9831-- si_code=2;  Faulting address: 0x805BF6E8;  sp: 0x6848b574

valgrind: the 'impossible' happened:
   Killed by fatal signal
==9831==    at 0x380A6DDB: vgModuleLocal_read_debuginfo_dwarf3 (readdwarf.c:280)
==9831==    by 0x3805AE45: vgModuleLocal_read_elf_debug_info (readelf.c:2208)
==9831==    by 0x38055E57: vgPlain_di_notify_mmap (debuginfo.c:835)
==9831==    by 0x38076F0F: vgModuleLocal_generic_PRE_sys_mmap (syswrap-generic.c:2071)
==9831==    by 0x3809CA02: vgSysWrap_x86_linux_sys_mmap2_before (syswrap-x86-linux.c:1381)
==9831==    by 0x3806BE4D: vgPlain_client_syscall (syswrap-main.c:1498)
==9831==    by 0x38068292: handle_syscall (scheduler.c:900)
==9831==    by 0x380693DE: vgPlain_scheduler (scheduler.c:1096)
==9831==    by 0x38097254: run_a_thread_NORETURN (syswrap-linux.c:95)

this is with wine from git (e594268421fbaba5e8db4ee8a929beaae0effcf0), clang from trunk (clang version 3.0 (trunk 130453)), and valgrind from svn (revision 11719).

To reproduce:
$ compile wine with:
#!/bin/bash
export CC="clang"
export CXX="$CC"
export CFLAGS="-std=gnu89 -g -O0 -no-integrated-as"
./configure
make

then run:
valgrind -q --trace-children=yes --track-origins=yes --gen-suppressions=all --suppressions=valgrind-suppressions --leak-check=full --num-callers=20  --workaround-gcc296-bugs=yes --vex-iropt-precise-memory-exns=yes loader/wine regedit

valgrind output attached.

If Wine is compiled with gcc 4.4.5, works fine.

Machine is running gentoo 32-bit, kernel 2.6.37, glibc 2.11.3.
Comment 1 Julian Seward 2011-05-03 13:33:33 UTC
This happened when V tried to read line number info from something
clang compiled, either a .so or an executable.  Easiest for me would
be if you could figure out what file is involved and email it to me
(or put it somewhere I can download, if large).

You can identify the offending file by running Valgrind with -v.
Comment 2 Austin English 2011-05-03 18:29:05 UTC
Created attachment 59582 [details]
problem file, bzip2 -9'ed
Comment 3 Austin English 2011-05-03 18:29:31 UTC
Created attachment 59583 [details]
valgrind verbose output
Comment 4 Austin English 2011-05-03 18:29:58 UTC
(In reply to comment #1)
> This happened when V tried to read line number info from something
> clang compiled, either a .so or an executable.  Easiest for me would
> be if you could figure out what file is involved and email it to me
> (or put it somewhere I can download, if large).
> 
> You can identify the offending file by running Valgrind with -v.

(In reply to comment #1)
> This happened when V tried to read line number info from something
> clang compiled, either a .so or an executable.  Easiest for me would
> be if you could figure out what file is involved and email it to me
> (or put it somewhere I can download, if large).
> 
> You can identify the offending file by running Valgrind with -v.

==22366== Warning: set address range perms: large range [0x82000000, 0xbee10000) (noaccess)
--22366-- REDIR: 0x42180e0 (bcmp) redirected to 0x40276e0 (bcmp)
--22366-- REDIR: 0x42127b0 (calloc) redirected to 0x402448f (calloc)
--22366-- Reading syms from /home/austin/src/wine-clang/dlls/ntdll/ntdll.dll.so (0x7bc00000)
--22366-- VALGRIND INTERNAL ERROR: Valgrind received a signal 11 (SIGSEGV) - exiting
--22366-- si_code=2;  Faulting address: 0x805BF6E8;  sp: 0x6848f574

ntdll.dll.so attached, as well as the output of:
valgrind -v --trace-children=yes --track-origins=yes --gen-suppressions=all --suppressions=valgrind-suppressions --leak-check=full --num-callers=20 --workaround-gcc296-bugs=yes --vex-iropt-precise-memory-exns=yes loader/wine regedit
Comment 5 Julian Seward 2011-05-03 19:43:19 UTC
(In reply to comment #2)
> Created an attachment (id=59582) [details]
> problem file, bzip2 -9'ed

Thanks.  I can successfully reproduce with this.  Will investigate.
Comment 6 Julian Seward 2011-05-04 01:43:20 UTC
This somehow pertains to r11705, a fix in which (I thought) I fixed
reading of line number info from LLVM-2.9 generated dwarf.

There's something very odd about this.  The problem pertains to the
DW_AT_stmt_list fields in DW_TAG_compile_units.  These specify the
offset in the .debug_line section of a line number "program" for the
compilation unit.  There are multiple of these.  The first one
provides line numbers for an "actctx.c" and is sane enough:

 <0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    < c>   DW_AT_producer    : clang version 3.0 (trunk 130453) 
    <2d>   DW_AT_language    : 1        (ANSI C)
    <2f>   DW_AT_name        : actctx.c 
    <38>   DW_AT_entry_pc    : 0x0      
    <3c>   DW_AT_stmt_list   : 0x0      
    <40>   DW_AT_comp_dir    : /home/austin/src/wine-clang/dlls/ntdll   

so the line number program is at 0x0 offset in .debug_line, viz, right
at the start.  The problem is the next one:

 <0><8c0b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <8c0c>   DW_AT_producer    : clang version 3.0 (trunk 130453)       
    <8c2d>   DW_AT_language    : 1      (ANSI C)
    <8c2f>   DW_AT_name        : atom.c 
    <8c36>   DW_AT_entry_pc    : 0x0    
    <8c3a>   DW_AT_stmt_list   : 0x7bc018cf     
    <8c3e>   DW_AT_comp_dir    : /home/austin/src/wine-clang/dlls/ntdll 

The DW_AT_stmt_list is claimed to be at offset 0x7bc018cf in
.debug_line, that is, around about 2GB past the start.  This is
completely bogus, especially considering the entire ntdll.dll.so is
only about 3MB long.  Valgrind duly tries to access at the stated
offset in .debug_line and segfaults immediately.

I noticed, when doing the fix in r11705, that LLVM-2.9 encodes
DW_AT_stmt_list using DW_FORM_addr, whereas gcc has always used
DW_FORM_data4.  I don't know why 2.9 uses that (2.8 didn't iirc), nor
whether it's relevant.  The dwarf3 spec says that DW_FORM_addr is for
holding an address on the target machine, and is subject to
relocation.  Hence it sounds completely inappropriate for holding an
offset into the .debug_info section.

/me confused.
Comment 7 Julian Seward 2011-05-04 01:45:16 UTC
Forgot to add: even stranger:
readelf --debug-dump=line manages to read all the line number info
despite the apparently bogus DW_AT_stmt_list field.  No idea how/why.
Comment 8 Tom Hughes 2011-05-04 11:09:16 UTC
Well dwarfdump doesn't seem to like it - it decodes the offsets in the same way that you report and when asked to dump line numbers it complains and gives up:

vauxhall [~/vgtest] % dwarfdump -l ntdll.dll.so         

.debug_line: line number info for a single cu
dwarfdump ERROR:  dwarf_srclines:  DW_DLE_ATTR_FORM_BAD (114)
Comment 9 Tom Hughes 2011-05-04 11:19:56 UTC
Latest dwarfdump/libdwarf built from source also objects, though slightly differently:

vauxhall [~/vgtest] % ~/src/libdwarf/dwarfdump/dwarfdump -l ntdll.dll.so

.debug_line: line number info for a single cu

/home/thh/src/libdwarf/dwarfdump/dwarfdump ERROR:  dwarf_srclines:  DW_DLE_BAD_REF_FORM (118)

CU Name = 
CU Producer = "clang version 3.0 (trunk 130453)"
DIE OFF = 0x00000000 GOFF = 0x00000000, Low PC = 0x00000000, High PC = 0x00000000

It is basically complaining about the use of DW_FORM_addr for the stmt_list which it doesn't think is valid.
Comment 10 Tom Hughes 2011-05-04 11:49:06 UTC
Looking at binutils, the reason that readelf works is that it doesn't actually go via the CU's in the debug_info section when dumping line number data - it just reads the debug_line section straight through from one end to the other.
Comment 11 Julian Seward 2011-05-04 12:19:34 UTC
(In reply to comment #10)

Thanks for investigating.

> Looking at binutils, the reason that readelf works is that it doesn't actually
> go via the CU's in the debug_info section when dumping line number data - it
> just reads the debug_line section straight through from one end to the other.

Is that an approach that we could switch to?  Our current scheme we
appears to scan through the main DIE list, to find
DW_TAG_compile_units, and read only the parts of debug_line pointed to
by DW_AT_stmt_list entries.  I don't know why it's done this way
instead of the binutils way -- I didn't write this code.  It does
presumably have the advantage that we don't end up reading line number
info for compilation units not referenced by the DIE list.  It also
doesn't assume that the blocks of line number info in debug_line are
placed back-to-back w/ no holes in between.  That said I have no idea
if either of those points are important.

This seems to me like a bug in LLVM 2.9.  I don't remember any such
problems with 2.8.  Much as I'd like to forget about this and tell the
LLVM crew to fix their compiler, I'm sure we'll get a bunch of hassle
along the lines of "binutils reads it, so what's the problem?"  So if
we can hack up a workaround, perhaps we should.
Comment 12 Tom Hughes 2011-05-04 12:39:45 UTC
Well readelf works because it is only trying to dump the whole section.

Whether other users of binutils work is a whole different question - there are routines in bfd to do things like finding line number information for a given address which look like they would fail.

That said "objdump -dl" does work somehow, which I thought would fail...
Comment 13 Tom Hughes 2011-05-04 12:58:09 UTC
The change in llvm to using DW_FORM_addr was made in r112678:

  Use absolute label for DW_AT_stmt_list if a target does not prefer offset here.
  This patch was developed on top of original patch by Artur Pietrek.

The only platform where addr4 is still used is Darwin. This change seems to have been made as a result of this mailing list discussion:

http://groups.google.com/group/llvm-dev/browse_thread/thread/8964413cbab85774?pli=1
Comment 14 Tom Hughes 2011-05-04 13:02:31 UTC
The earlier discussion is also relevant:

http://groups.google.com/group/llvm-dev/browse_thread/thread/aeae5b63349fab6b

They seem to recognise that it isn't strictly valid (and that it doesn't work on Darwin) but they are doing it so that link time optimisation can adjust the locations by changing the relocations or something.
Comment 15 Julian Seward 2011-05-04 13:38:09 UTC
(In reply to comment #13)
> The change in llvm to using DW_FORM_addr was made in r112678:
> 
>   Use absolute label for DW_AT_stmt_list if a target does not prefer offset
> here.
>   This patch was developed on top of original patch by Artur Pietrek.
> 
> The only platform where addr4 is still used is Darwin.

Uh, but the problem we're seeing on Linux is because it's using
DW_FORM_addr, not DW_FORM_data4 like gcc does.
Comment 16 Julian Seward 2011-05-04 13:39:45 UTC
(In reply to comment #12)
> Whether other users of binutils work is a whole different question - there are
> routines in bfd to do things like finding line number information for a given
> address which look like they would fail.

austinenglish, does GDB have any difficulty working with the
offending file?
Comment 17 Tom Hughes 2011-05-04 13:51:31 UTC
(In reply to comment #15)
> (In reply to comment #13)
> > The change in llvm to using DW_FORM_addr was made in r112678:
> > 
> >   Use absolute label for DW_AT_stmt_list if a target does not prefer offset
> > here.
> >   This patch was developed on top of original patch by Artur Pietrek.
> > 
> > The only platform where addr4 is still used is Darwin.
> 
> Uh, but the problem we're seeing on Linux is because it's using
> DW_FORM_addr, not DW_FORM_data4 like gcc does.

Exactly - they have changed all platforms except Darwin to use addr. They only didn't change Darwin because doing so broke the debugger.
Comment 18 Julian Seward 2011-05-04 22:48:26 UTC
Rafael Espindola tells me that it "Should be fixed in r130846."
(which presumably happened in the last few hours).  Austin, can
you give it a go?
Comment 19 Austin English 2011-05-05 00:45:33 UTC
(In reply to comment #18)
> Rafael Espindola tells me that it "Should be fixed in r130846."
> (which presumably happened in the last few hours).  Austin, can
> you give it a go?

Yep, looks good here, thanks.

Resolved upstream?
Comment 20 Julian Seward 2011-05-05 00:59:17 UTC
(In reply to comment #19)
> Resolved upstream?

Not sure what you're referring to here.  Is what resolved
in whose upstream?
Comment 21 Austin English 2011-05-05 01:04:58 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > Resolved upstream?
> 
> Not sure what you're referring to here.  Is what resolved
> in whose upstream?

Your comment 18. I updated built clang from svn (r130866), recompiled Wine, and ran:
valgrind -q --trace-children=yes --track-origins=yes --gen-suppressions=all
--suppressions=valgrind-suppressions --leak-check=full --num-callers=20 
--workaround-gcc296-bugs=yes --vex-iropt-precise-memory-exns=yes loader/wine
regedit

which no longer crashes valgrind. I never updated valgrind, so it seems the problem was fixed upstream in LLVM/Clang, unless I'm misunderstanding something.
Comment 22 Julian Seward 2011-05-09 10:10:51 UTC
(In reply to comment #21)
Austin,

Thanks for the feedback.

> which no longer crashes valgrind. I never updated valgrind, so it seems the
> problem was fixed upstream in LLVM/Clang, unless I'm misunderstanding
> something.

You understand correctly: it was fixed in LLVM upstream.

There's one final thing to check though.  Valgrind already contained
a half-baked attempt to fix this (ineffective, as you saw), and I need
to back it out now (svn rev 11705).  Can you try the diff below and
verify that you still get line numbers on LLVM generated code?
Thanks.

Index: coregrind/m_debuginfo/readdwarf.c
===================================================================
--- coregrind/m_debuginfo/readdwarf.c	(revision 11731)
+++ coregrind/m_debuginfo/readdwarf.c	(working copy)
@@ -1111,9 +1111,9 @@
                                                unconditionally to cval? */
 
             case 0x01: /* FORM_addr */      if (addr_size == 4) {
-                                               cval = *(UInt*)p;
+                                               //cval = *(UInt*)p;
                                             } else if (addr_size == 8) {
-                                               cval = *(ULong*)p;
+                                               //cval = *(ULong*)p;
                                             } else {
                                                /* wtf, Houston? */
                                             }
Comment 23 Austin English 2011-05-09 22:52:52 UTC
Yep, still looks good. This was on top of r11726.
Comment 24 Julian Seward 2011-05-10 00:52:39 UTC
Thanks for the followup check.  r11705 is now backed out by r11738.
Closing as Invalid (since it's not our bug).