Bug 193917

Summary: valgrind fails to run on Mac OS 10.5.7, gets a SIGTRAP
Product: [Developer tools] valgrind Reporter: FX <fxcoudert>
Component: generalAssignee: Julian Seward <jseward>
Status: RESOLVED NOT A BUG    
Severity: crash CC: caulfield.holden, gparker, mike, njn, rjwalsh, tom, tph
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Other   
Latest Commit: Version Fixed In:
Attachments: Basic system profile

Description FX 2009-05-24 17:47:59 UTC
Version:            (using Devel)
Compiler:          gcc version 4.0.1 Apple Inc. build 5490
OS:                OS X
Installed from:    Compiled sources

I've built the DARWIN branch of valgrind today on Mac OS 10.5.7, Intel Core 2 Duo; after installation, running valgrind results in a SIGTRAP. Please find below the detailed report:

  1. I got svn sources from svn://svn.valgrind.org/valgrind/branches/DARWIN, revision 10129
  2. Run autogen.sh in the source directory
  3. Configure with no option other than prefix (./configure --prefix=/opt/valgrind-svn)
  4. make && make install
  5. Run /opt/valgrind-svn/bin/valgrind, see the SIGTRAP (shell says "zsh: trace trap")

Now, for some system information:

$ uname -a
Darwin lumo.local 9.7.0 Darwin Kernel Version 9.7.0: Tue Mar 31 22:52:17 PDT 2009; root:xnu-1228.12.14~1/RELEASE_I386 i386

$ gcc -v
Using built-in specs.
Target: i686-apple-darwin9
Configured with: /var/tmp/gcc/gcc-5490~1/src/configure --disable-checking -enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.0/ --with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib --build=i686-apple-darwin9 --with-arch=apple --with-tune=generic --host=i686-apple-darwin9 --target=i686-apple-darwin9
Thread model: posix
gcc version 4.0.1 (Apple Inc. build 5490)


I tried to run the valgrind binary in gdb, but it doesn't give useful information:

(gdb) r
Starting program: /opt/valgrind-svn/bin/valgrind 
Reading symbols for shared libraries ++. done

Program received signal SIGTRAP, Trace/breakpoint trap.
0xf0ad9024 in ?? ()
(gdb) b
Breakpoint 1 at 0xf0ad9024


Please tell me how I can provide more (or more relevant) information.
Comment 1 Robert Walsh 2009-05-24 20:34:17 UTC
I tried this using revision 10130 (no actual changes in the DARWIN branch from 10129, though).  The only thing I did different was use /usr/local/valgrind as my prefix.  Worked fine for me - no SIGTRAP, and I was able to run memcheck on a few simple test programs.
Comment 2 Nicholas Nethercote 2009-05-25 00:13:48 UTC
Can you give us the command line you invoked and Valgrind's output?
Comment 3 FX 2009-05-25 08:43:07 UTC
It happens whatever arguments I use on the command-line (including when I use no arguments). It happens too early for valgrind to have output anything at all:

$ /opt/valgrind-svn/bin/valgrind
zsh: trace trap  /opt/valgrind-svn/bin/valgrind
Comment 4 Nicholas Nethercote 2009-05-25 09:07:09 UTC
Can you show the output of "valgrind -d -d -v -v"?
Comment 5 Nicholas Nethercote 2009-05-25 09:07:45 UTC
You could also try running "vg-in-place" from within the valgrind directory, see if that works any better.
Comment 6 FX 2009-05-25 09:48:48 UTC
$ /opt/valgrind-svn/bin/valgrind -d -d -v -v
--80107:1:debuglog DebugLog system started by Stage 1, level 2 logging requested
--80107:1:launcher no tool requested, defaulting to 'memcheck'
--80107:1:launcher arch 'x86' IS installed
--80107:1:launcher arch 'amd64' IS NOT installed
--80107:1:launcher arch 'arm' IS NOT installed
--80107:1:launcher arch 'ppc32' IS NOT installed
--80107:1:launcher arch 'ppc64' IS NOT installed
--80107:1:launcher no client specified, defaulting arch to 'x86'
--80107:1:launcher launcher_name = /opt/valgrind-svn/bin/valgrind
--80107:1:launcher launching /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin
zsh: trace trap  /opt/valgrind-svn/bin/valgrind -d -d -v -v
Comment 7 FX 2009-05-25 09:52:14 UTC
And I forgot: "vg-in-place" doesn't work any better.
Comment 8 Nicholas Nethercote 2009-05-25 10:05:33 UTC
Very mysterious.  What happens if you run "memcheck/memcheck-x86-linux -d" directly?
Comment 9 FX 2009-05-25 10:28:31 UTC
$ /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -d 
zsh: trace trap  /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -d

The gdb backtrace is, once again, useless:

Program received signal SIGTRAP, Trace/breakpoint trap.
0xf0ad92a0 in ?? ()
(gdb) bt
#0  0xf0ad92a0 in ?? ()
#1  0xfffffffe in ?? ()
#2  0x14c483cc in ?? ()
Comment 10 Nicholas Nethercote 2009-05-26 00:20:40 UTC
It's crashing extremely early.  The startup for memcheck/memcheck-x86-linux is a bit unusual.  There is no main() function, instead, AIUI, control starts at the symbol __start, in coregrind/m_start-x86-linux.S.  It sets up the stack and then calls _start_in_C_darwin(), in coregrind/m_main.c, which then calls valgrind_main().

I suggest using GDB to step through those functions.  Set a breakpoint at _start, and step through until the trap occurs, and let us know where the problem is.  Thanks.

This is all quite strange because it's worked fine on a number of Mac systems, and Mac systems tend to be very homogeneous (unlike Linux systems).  For example, I'm running exactly the same kernel as you and an extremely similar GCC.
Comment 11 FX 2009-05-26 08:59:17 UTC
It gets the SIGTRAP even before entering __start:

(gdb) b __start 
Breakpoint 1 at 0xf012f5db: file m_start-x86-darwin.S, line 74.
(gdb) r
Starting program: /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin 
Reading symbols for shared libraries + done
Re-enabling shared library breakpoint -4

Program received signal SIGTRAP, Trace/breakpoint trap.
0xf0ad92a0 in ?? ()


I guess it's really in the startup code. On linux, I'd try to strace it, or run ld.so in gdb, but I don't know how to do the equivalent of these on Mac.
Comment 12 Nicholas Nethercote 2009-05-26 09:17:46 UTC
Stranger and stranger.  The Mac equivalent to strace is called "dtruss".  You have to run it as root.
Comment 13 FX 2009-05-26 10:24:55 UTC
$ sudo dtruss /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -v
dtrace: failed to execute /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin: Could not create symbolicator for task

It actually fails so earlier that even dyld doesn't have time to finish its own output:

$ DYLD_PRINT_STATISTICS=1 /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin
zsh: trace trap  DYLD_PRINT_STATISTICS=1 /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin

(while it should print dyld statistics).


One more thing I realised is that for each time I run memcheck-x86-darwin, I get a line in dmesg:

shared_region: enter(0x8211c1c,0x81e2530,0,7,0): vm_map_enter(0x90000000,0x20000000,0x754ac94) error 0x3

Google tells me it's from the vm_shared_region_enter() function of the xnu kernel, and that it means that vm_map_enter_mem_object() has failed.
Comment 14 Tom Hughes 2009-05-26 10:46:47 UTC
If the Darwin build is anything like the Linux build then memcheck-x86-darwin will be statically linked so the dynamic linker won't be present which is probably why DYLD_PRINT_STATISTICS isn't doing anything.
Comment 15 Nicholas Nethercote 2009-05-26 12:18:53 UTC
AIUI there's no such thing as a truly static build on Darwin, the dynamic linker is always involved.  But I could be wrong.
Comment 16 FX 2009-05-26 12:34:39 UTC
I was about to say the same thing as coment #15: you can link static versions of all your libraries, but the final executable always involves dyld.
Comment 17 Tom Harrington 2009-05-28 19:52:06 UTC
Since this is still marked "unconfirmed" I thought I'd mention that I'm getting exactly the same symptoms as FX describes, all the way down the line, on a system with the same CPU and OS.  I also looked in the system console in case there was anything useful there.  FWIW, when I try "valgrind -d -d -v -v" I get the following console messages, in addition to the output FX previously posted:

5/28/09 11:48:40 AM kernel shared_region: enter(0x7f954d8,0x50c2594,0,7,0): vm_map_enter(0x90000000,0x20000000,0x42cc8c0) error 0x3 
5/28/09 11:48:40 AM ReportCrash[33716] Failed to create VMUSymbolicator for memcheck-x86-darwin[33715] 
5/28/09 11:48:40 AM ReportCrash[33716] Formulating crash report for process memcheck-x86-darwin[33715] 
5/28/09 11:48:40 AM ReportCrash[33716] Saved crashreport to /Users/tph/Library/Logs/CrashReporter/memcheck-x86-darwin_2009-05-28-114840_atomic-air.crash using uid: 502 gid: 20, euid: 502 egid: 20 
5/28/09 11:48:40 AM ReportCrash[33716] Saved crashreport to /Users/tph/Library/Logs/CrashReporter/memcheck-x86-darwin_2009-05-28-114840_atomic-air.crash using uid: 502 gid: 20, euid: 502 egid: 20 

Digging up the crash log mentioned above, I find it contains this (which is probably not useful, but you never know):

Process:         memcheck-x86-darwin [33715]
Path:            /usr/local/lib/valgrind/memcheck-x86-darwin
Identifier:      memcheck-x86-darwin
Version:         ??? (???)
Code Type:       X86 (Native)
Parent Process:  bash [3693]

Date/Time:       2009-05-28 11:48:40.728 -0600
OS Version:      Mac OS X 10.5.7 (9J61)
Report Version:  6
Anonymous UUID:  B985D7B6-C449-4A39-AA55-4DEBB1DF8EB9

Exception Type:  EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0000000000000002, 0x0000000000000000
Crashed Thread:  Unknown

Backtrace not available

Unknown thread crashed with X86 Thread State (32-bit):
  eax: 0x00000000  ebx: 0xf0a33286  ecx: 0x00000000  edx: 0xf09ef5a0
  edi: 0xf0a3c9b0  esi: 0xf0a33278  ebp: 0xf007f458  esp: 0xf007f440
   ss: 0x0000001f  efl: 0x00000246  eip: 0xf0a332a0   cs: 0x00000017
   ds: 0x0000001f   es: 0x0000001f   fs: 0x00000000   gs: 0x00000037
  cr2: 0xf09ef5a0

Binary images description not available

More info is available, just let me know what might be useful.
Comment 18 Julian Seward 2009-05-28 20:07:45 UTC
I think we're all pretty much at sea here.

It strikes me as a bit odd that it died with SIGTRAP.  Usually
if something really bad happens we bite out with SIGSEGV, or
(much) more commonly an assert.

So two possibilities come to mind.  One is that it went into a
syscall and xnu promptly bashed it on the head with SIGTRAP.
The second is that it really ran into an instruction which
causes it to SIGTRAP -- perhaps an int $3 or some such.

In both cases, it would be a good first step, to know where it
was when this happened.  If either of you could attach GDB to
the crashed-out process image, or disassemble memcheck-x86-linux
at the stated EIP point when it died, kinda thing, that would be
useful.
Comment 19 Tom Harrington 2009-05-29 02:02:03 UTC
Julian, I'd like to do that, but I have to admit I don't know how.  If I run valgrind with gdb I get the same results FX noted above.  If there's something else useful I could do in gdb, just tell me what.
Comment 20 Nicholas Nethercote 2009-05-29 02:38:35 UTC
I googled for VMUSymbolicator and the only hit I got was this:

http://ufoai.ninex.info/forum/index.php?topic=3391.0

which provides some vague evidence that it's a kernel bug?  No idea.
Comment 21 Nicholas Nethercote 2009-05-29 23:29:18 UTC
I asked Greg Parker about this, he had some comments...

----
Off the top of my head:
* valgrind does run a copy of dyld for itself, but iirc the valgrind launcher hides DYLD_* environment variables from V's dyld and restores them for the inferior's dyld. I wouldn't expect DYLD_PRINT_STATISTICS to print anything for valgrind itself.
* gdb is confused by valgrind; it sees only the inferior's symbols. To get valgrind's own symbols in the backtrace, run `add-symbol-file /path/to/memcheck-<arch>-darwin`.
* CrashReporter can be similarly confused, but usually it doesn't fail as badly as shown.
* A breakpoint on __start might not work. Historically gdb has had trouble with breakpoints set before the gdb/dyld handshake.
* The 0xfffffffe frame in the backtrace might indicate the arrival of a signal, and the 0xf0ad92a0 frame is probably in valgrind's own code. Try looking for a __builtin_trap() from a valgrind signal handler.
----

With respect to the last point, there were some __builtin_trap calls in Greg's original port, but they've since been removed.
Comment 22 Tom Harrington 2009-05-30 01:24:32 UTC
I tried the add-symbol-file idea but it doesn't seem to provide any more information.  My results are below-- if I've misunderstood, please let me know what I should be doing differently.

Incidentally the messages below about the object files being newer than the executable occur even when starting from a clean slate.  I removed valgrind stuff from /usr/local/, did a "make distclean", and then went through the build & install flow again.  Still I get those messages.  I don't know if it means anything.  I haven't had any trouble with my system clock, and "ls -l" shows the executable as being newer than the object files.

$ gdb valgrind
GNU gdb 6.3.50-20050815 (Apple version gdb-962) (Sat Jul 26 08:14:40 UTC 2008)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-apple-darwin"...No symbol table is loaded.  Use the "file" command.

(gdb) add-symbol-file /usr/local/lib/valgrind/memcheck-x86-darwin
add symbol table from file "/usr/local/lib/valgrind/memcheck-x86-darwin"? (y or n) y
Reading symbols from /usr/local/lib/valgrind/memcheck-x86-darwin...warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(hdefs.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(hdefs.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(toIR.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(toIR.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(isel.o)" more recent than executable timestamp
warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(isel.o)" more recent than executable timestamp
done.
(gdb) run -d -d -d -v
Starting program: /usr/local/bin/valgrind -d -d -d -v
Reading symbols for shared libraries ++. done
Breakpoint 2 at 0x9660b372
--51922:1:debuglog DebugLog system started by Stage 1, level 3 logging requested
--51922:1:launcher no tool requested, defaulting to 'memcheck'
--51922:1:launcher arch 'x86' IS installed
--51922:1:launcher arch 'amd64' IS NOT installed
--51922:1:launcher arch 'arm' IS NOT installed
--51922:1:launcher arch 'ppc32' IS NOT installed
--51922:1:launcher arch 'ppc64' IS NOT installed
--51922:1:launcher no client specified, defaulting arch to 'x86'
--51922:1:launcher launcher_name = /usr/local/bin/valgrind
--51922:1:launcher launching /usr/local/lib/valgrind/memcheck-x86-darwin

Program received signal SIGTRAP, Trace/breakpoint trap.
0xf0a33024 in ?? ()
(gdb) where
#0  0xf0a33024 in ?? ()
Cannot access memory at address 0x4
Comment 23 Julian Seward 2009-05-30 02:59:59 UTC
I can't reproduce this.  MacOSX 10.5.7, Mac Mini 1.83 GHz, 2 GB ram,
svn trunk r10169, clean build.
Comment 24 Nicholas Nethercote 2009-06-01 01:40:52 UTC
FX sent Julian and I his memcheck-x86-darwin binary via email.  It ran fine on my machine, up to the "valgrind: You cannot run 'memcheck-x86-darwin' directly." point.  Ie. the result was the same as a Valgrind built on my system.

This means that the problem is with FX's system, not with the built executable.  Which gives us a little more information, but not a lot.
Comment 25 Nicholas Nethercote 2009-06-01 02:24:53 UTC
In relation to comment #14, DYLD_PRINT_STATISTICS doesn't print anything for me on my memcheck-x86-darwin before the "valgrind: You cannot run 'memcheck-x86-darwin' directly."  So that's not very significant.
Comment 26 Nicholas Nethercote 2009-06-01 02:59:26 UTC
Julian and I were just looking at the kernel code.  If you look at line 920 of bsd/kern/kern_exec.c, in the xnu-1228.12.14 source code (available at http://www.opensource.apple.com/release/mac-os-x-1057/), that's the most suspicious-looking place where a SIGTRAP is raised that we could find. 

AFAWCT, this occurs if P_LTRACED is set, which is the case if the executable is run under ptrace() with the PT_TRACE_ME flag specified.  Which doesn't seem likely -- Valgrind doesn't use ptrace() on Mac.  But it's worth asking, because it's the best idea we have right now:  FX and Tom Harrington, on your systems, could you have something set up which causes normal programs to be run under ptrace()?
Comment 27 Nicholas Nethercote 2009-06-01 03:00:45 UTC
Also, this message mentioned previously:

shared_region: enter(0x7f954d8,0x50c2594,0,7,0):
vm_map_enter(0x90000000,0x20000000,0x42cc8c0) error 0x3

doesn't seem to be significant;  I get it on my machine where everything works.
Comment 28 Tom Harrington 2009-06-01 05:31:52 UTC
I don't believe I have anything installed that would cause ptrace() to be run on every process.  Am I right in thinking that this would mean the process could detect the situation by looking up information on itself via sysctl()?  I ask because I have some code that looks to see if P_TRACED is set on it, and it doesn't detect that flag.
Comment 29 Nicholas Nethercote 2009-06-02 02:18:40 UTC
Another question for FX and Tom:  you are both getting this with 10.5.7.  Did either of you run Valgrind with 10.5.6?  If so, did you hit this problem?

Thanks.
Comment 30 Robert Walsh 2009-06-02 02:35:29 UTC
Might be helpful to see your system profile, too.  You can generate this by going to About This Mac (under the Apple menu), clicking on the "More Info..." button and hitting Command-S in the System Profiler app.  Take a look through this file before sending it in case you're worried about any information that's in it being attached to this bug.
Comment 31 Tom Harrington 2009-06-02 06:29:17 UTC
I had the problem on another Mac running 10.5.6.  I've since upgraded it to 10.5.7 but that had no impact on valgrind.

I saved a system profile, but at 2.2MB of text I haven't yet figured out if it contains any information I wouldn't want posted publicly.  I don't know why it would but that it's hard to imagine what's taking up that much space.
Comment 32 Tom Harrington 2009-06-02 06:36:10 UTC
Created attachment 34186 [details]
Basic system profile
Comment 33 Tom Harrington 2009-06-02 06:38:13 UTC
Comment on attachment 34186 [details]
Basic system profile

Result of running "system_profiler -detailLevel basic" to get a reduced system profile.  I'll look over the full profile and see if it looks postable.
Comment 34 Mike Ash 2009-06-04 04:14:47 UTC
Well, looks like is our (Rogue Amoeba's) fault. At least it is on my system, where I can reproduce this easily. It's due to our Instant Hijack extension conflicting somehow with Valgrind.

We will look into a fix. Right now it seems that this may be a bug in our code, but we won't be sure until the Instant Hijack guru has a chance to look at things after WWDC.

The good news is that you can easily work around it, either by uninstalling Instant Hijack from one of RA's apps, or by simply running this command to temporarily disable it:

sudo /usr/local/hermes/bin/hermesctl unload

And then once you're done using Valgrind for the moment you can re-enable it like so:

sudo /usr/local/hermes/bin/hermesctl load

My apologies for the trouble and wasted time. I'll post back when we know more, but again, that won't be until after WWDC.
Comment 35 Nicholas Nethercote 2009-06-04 04:21:37 UTC
FX, Tom:  can you confirm that Mike is right (comment #34), ie. whether this fixes the problem for you?
Comment 36 Nicholas Nethercote 2009-06-04 04:22:20 UTC
BTW, Mike: can I ask how did you determine Instant Hijack is the problem?

Thanks.
Comment 37 Mike Ash 2009-06-04 04:24:24 UTC
Nicholas, it was basically intuition. I saw the crash, googled for it, found this bug. When I saw that Valgrind did interesting stuff with dyld, I thought it might be related because we do interesting stuff with dyld too. I disabled it and tried Valgrind again, and suddenly it worked. I wish I could have a more comprehensive investigational process for you, but it was pretty much just a wild guess that happened to be right.
Comment 38 Greg Parker 2009-06-04 04:26:27 UTC
Aha! Yes, Valgrind and Instant Hijack will be surprised to discover each other as they mangle the process's initial state.
Comment 39 Tom Harrington 2009-06-04 06:11:39 UTC
Mike Ash's fix works for me.  This resolves my problem.  Thanks, Mike.
Comment 40 FX 2009-06-04 08:09:25 UTC
Unloading hermes (Mike's workaround) works for me too.
Comment 41 Mike Ash 2009-07-02 03:20:29 UTC
We've fixed, or at least worked around, this bug internally now. Fixes will be going out as part of the apps' normal update cycles. For those of you who'd like to get your hands on it now, you'll need to download development builds here:

For Audio Hijack Pro: http://www.rogueamoeba.com/audiohijackpro/download-development.php
For Airfoil: http://www.rogueamoeba.com/airfoil/mac/download-development.php
For Nicecast: http://www.rogueamoeba.com/nicecast/download-development.php

Note that Instant Hijack is a shared install so if you should happen to have more than one of these (and thanks if you do!) then you only need to grab one and install the Instant Hijack upgrade from it.

As expected, the problem was IH's dyld mangling getting in the way of Valgrind. Ultimately we decided that it wasn't worth making them work together (and risk having it break later on) so this new version of Instant Hijack just does a very early check to see if it's running in Valgrind, and if so it bails out and doesn't touch that process.
Comment 42 Nicholas Nethercote 2009-07-02 06:32:58 UTC
Thanks Mike!
Comment 43 caulfield.holden 2009-07-07 14:26:15 UTC
*** Bug 199282 has been marked as a duplicate of this bug. ***