Summary: | valgrind fails to run on Mac OS 10.5.7, gets a SIGTRAP | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | FX <fxcoudert> |
Component: | general | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED NOT A BUG | ||
Severity: | crash | CC: | caulfield.holden, gparker, mike, njn, rjwalsh, tom, tph |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Compiled Sources | ||
OS: | Other | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: | Basic system profile |
Description
FX
2009-05-24 17:47:59 UTC
I tried this using revision 10130 (no actual changes in the DARWIN branch from 10129, though). The only thing I did different was use /usr/local/valgrind as my prefix. Worked fine for me - no SIGTRAP, and I was able to run memcheck on a few simple test programs. Can you give us the command line you invoked and Valgrind's output? It happens whatever arguments I use on the command-line (including when I use no arguments). It happens too early for valgrind to have output anything at all: $ /opt/valgrind-svn/bin/valgrind zsh: trace trap /opt/valgrind-svn/bin/valgrind Can you show the output of "valgrind -d -d -v -v"? You could also try running "vg-in-place" from within the valgrind directory, see if that works any better. $ /opt/valgrind-svn/bin/valgrind -d -d -v -v --80107:1:debuglog DebugLog system started by Stage 1, level 2 logging requested --80107:1:launcher no tool requested, defaulting to 'memcheck' --80107:1:launcher arch 'x86' IS installed --80107:1:launcher arch 'amd64' IS NOT installed --80107:1:launcher arch 'arm' IS NOT installed --80107:1:launcher arch 'ppc32' IS NOT installed --80107:1:launcher arch 'ppc64' IS NOT installed --80107:1:launcher no client specified, defaulting arch to 'x86' --80107:1:launcher launcher_name = /opt/valgrind-svn/bin/valgrind --80107:1:launcher launching /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin zsh: trace trap /opt/valgrind-svn/bin/valgrind -d -d -v -v And I forgot: "vg-in-place" doesn't work any better. Very mysterious. What happens if you run "memcheck/memcheck-x86-linux -d" directly? $ /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -d zsh: trace trap /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -d The gdb backtrace is, once again, useless: Program received signal SIGTRAP, Trace/breakpoint trap. 0xf0ad92a0 in ?? () (gdb) bt #0 0xf0ad92a0 in ?? () #1 0xfffffffe in ?? () #2 0x14c483cc in ?? () It's crashing extremely early. The startup for memcheck/memcheck-x86-linux is a bit unusual. There is no main() function, instead, AIUI, control starts at the symbol __start, in coregrind/m_start-x86-linux.S. It sets up the stack and then calls _start_in_C_darwin(), in coregrind/m_main.c, which then calls valgrind_main(). I suggest using GDB to step through those functions. Set a breakpoint at _start, and step through until the trap occurs, and let us know where the problem is. Thanks. This is all quite strange because it's worked fine on a number of Mac systems, and Mac systems tend to be very homogeneous (unlike Linux systems). For example, I'm running exactly the same kernel as you and an extremely similar GCC. It gets the SIGTRAP even before entering __start: (gdb) b __start Breakpoint 1 at 0xf012f5db: file m_start-x86-darwin.S, line 74. (gdb) r Starting program: /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin Reading symbols for shared libraries + done Re-enabling shared library breakpoint -4 Program received signal SIGTRAP, Trace/breakpoint trap. 0xf0ad92a0 in ?? () I guess it's really in the startup code. On linux, I'd try to strace it, or run ld.so in gdb, but I don't know how to do the equivalent of these on Mac. Stranger and stranger. The Mac equivalent to strace is called "dtruss". You have to run it as root. $ sudo dtruss /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin -v dtrace: failed to execute /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin: Could not create symbolicator for task It actually fails so earlier that even dyld doesn't have time to finish its own output: $ DYLD_PRINT_STATISTICS=1 /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin zsh: trace trap DYLD_PRINT_STATISTICS=1 /opt/valgrind-svn/lib/valgrind/memcheck-x86-darwin (while it should print dyld statistics). One more thing I realised is that for each time I run memcheck-x86-darwin, I get a line in dmesg: shared_region: enter(0x8211c1c,0x81e2530,0,7,0): vm_map_enter(0x90000000,0x20000000,0x754ac94) error 0x3 Google tells me it's from the vm_shared_region_enter() function of the xnu kernel, and that it means that vm_map_enter_mem_object() has failed. If the Darwin build is anything like the Linux build then memcheck-x86-darwin will be statically linked so the dynamic linker won't be present which is probably why DYLD_PRINT_STATISTICS isn't doing anything. AIUI there's no such thing as a truly static build on Darwin, the dynamic linker is always involved. But I could be wrong. I was about to say the same thing as coment #15: you can link static versions of all your libraries, but the final executable always involves dyld. Since this is still marked "unconfirmed" I thought I'd mention that I'm getting exactly the same symptoms as FX describes, all the way down the line, on a system with the same CPU and OS. I also looked in the system console in case there was anything useful there. FWIW, when I try "valgrind -d -d -v -v" I get the following console messages, in addition to the output FX previously posted: 5/28/09 11:48:40 AM kernel shared_region: enter(0x7f954d8,0x50c2594,0,7,0): vm_map_enter(0x90000000,0x20000000,0x42cc8c0) error 0x3 5/28/09 11:48:40 AM ReportCrash[33716] Failed to create VMUSymbolicator for memcheck-x86-darwin[33715] 5/28/09 11:48:40 AM ReportCrash[33716] Formulating crash report for process memcheck-x86-darwin[33715] 5/28/09 11:48:40 AM ReportCrash[33716] Saved crashreport to /Users/tph/Library/Logs/CrashReporter/memcheck-x86-darwin_2009-05-28-114840_atomic-air.crash using uid: 502 gid: 20, euid: 502 egid: 20 5/28/09 11:48:40 AM ReportCrash[33716] Saved crashreport to /Users/tph/Library/Logs/CrashReporter/memcheck-x86-darwin_2009-05-28-114840_atomic-air.crash using uid: 502 gid: 20, euid: 502 egid: 20 Digging up the crash log mentioned above, I find it contains this (which is probably not useful, but you never know): Process: memcheck-x86-darwin [33715] Path: /usr/local/lib/valgrind/memcheck-x86-darwin Identifier: memcheck-x86-darwin Version: ??? (???) Code Type: X86 (Native) Parent Process: bash [3693] Date/Time: 2009-05-28 11:48:40.728 -0600 OS Version: Mac OS X 10.5.7 (9J61) Report Version: 6 Anonymous UUID: B985D7B6-C449-4A39-AA55-4DEBB1DF8EB9 Exception Type: EXC_BREAKPOINT (SIGTRAP) Exception Codes: 0x0000000000000002, 0x0000000000000000 Crashed Thread: Unknown Backtrace not available Unknown thread crashed with X86 Thread State (32-bit): eax: 0x00000000 ebx: 0xf0a33286 ecx: 0x00000000 edx: 0xf09ef5a0 edi: 0xf0a3c9b0 esi: 0xf0a33278 ebp: 0xf007f458 esp: 0xf007f440 ss: 0x0000001f efl: 0x00000246 eip: 0xf0a332a0 cs: 0x00000017 ds: 0x0000001f es: 0x0000001f fs: 0x00000000 gs: 0x00000037 cr2: 0xf09ef5a0 Binary images description not available More info is available, just let me know what might be useful. I think we're all pretty much at sea here. It strikes me as a bit odd that it died with SIGTRAP. Usually if something really bad happens we bite out with SIGSEGV, or (much) more commonly an assert. So two possibilities come to mind. One is that it went into a syscall and xnu promptly bashed it on the head with SIGTRAP. The second is that it really ran into an instruction which causes it to SIGTRAP -- perhaps an int $3 or some such. In both cases, it would be a good first step, to know where it was when this happened. If either of you could attach GDB to the crashed-out process image, or disassemble memcheck-x86-linux at the stated EIP point when it died, kinda thing, that would be useful. Julian, I'd like to do that, but I have to admit I don't know how. If I run valgrind with gdb I get the same results FX noted above. If there's something else useful I could do in gdb, just tell me what. I googled for VMUSymbolicator and the only hit I got was this: http://ufoai.ninex.info/forum/index.php?topic=3391.0 which provides some vague evidence that it's a kernel bug? No idea. I asked Greg Parker about this, he had some comments... ---- Off the top of my head: * valgrind does run a copy of dyld for itself, but iirc the valgrind launcher hides DYLD_* environment variables from V's dyld and restores them for the inferior's dyld. I wouldn't expect DYLD_PRINT_STATISTICS to print anything for valgrind itself. * gdb is confused by valgrind; it sees only the inferior's symbols. To get valgrind's own symbols in the backtrace, run `add-symbol-file /path/to/memcheck-<arch>-darwin`. * CrashReporter can be similarly confused, but usually it doesn't fail as badly as shown. * A breakpoint on __start might not work. Historically gdb has had trouble with breakpoints set before the gdb/dyld handshake. * The 0xfffffffe frame in the backtrace might indicate the arrival of a signal, and the 0xf0ad92a0 frame is probably in valgrind's own code. Try looking for a __builtin_trap() from a valgrind signal handler. ---- With respect to the last point, there were some __builtin_trap calls in Greg's original port, but they've since been removed. I tried the add-symbol-file idea but it doesn't seem to provide any more information. My results are below-- if I've misunderstood, please let me know what I should be doing differently. Incidentally the messages below about the object files being newer than the executable occur even when starting from a clean slate. I removed valgrind stuff from /usr/local/, did a "make distclean", and then went through the build & install flow again. Still I get those messages. I don't know if it means anything. I haven't had any trouble with my system clock, and "ls -l" shows the executable as being newer than the object files. $ gdb valgrind GNU gdb 6.3.50-20050815 (Apple version gdb-962) (Sat Jul 26 08:14:40 UTC 2008) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-apple-darwin"...No symbol table is loaded. Use the "file" command. (gdb) add-symbol-file /usr/local/lib/valgrind/memcheck-x86-darwin add symbol table from file "/usr/local/lib/valgrind/memcheck-x86-darwin"? (y or n) y Reading symbols from /usr/local/lib/valgrind/memcheck-x86-darwin...warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(hdefs.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(hdefs.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(toIR.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(ghelpers.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(toIR.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(isel.o)" more recent than executable timestamp warning: .o file "/private/tmp/valgrind/memcheck/../VEX/libvex-x86-darwin.a(isel.o)" more recent than executable timestamp done. (gdb) run -d -d -d -v Starting program: /usr/local/bin/valgrind -d -d -d -v Reading symbols for shared libraries ++. done Breakpoint 2 at 0x9660b372 --51922:1:debuglog DebugLog system started by Stage 1, level 3 logging requested --51922:1:launcher no tool requested, defaulting to 'memcheck' --51922:1:launcher arch 'x86' IS installed --51922:1:launcher arch 'amd64' IS NOT installed --51922:1:launcher arch 'arm' IS NOT installed --51922:1:launcher arch 'ppc32' IS NOT installed --51922:1:launcher arch 'ppc64' IS NOT installed --51922:1:launcher no client specified, defaulting arch to 'x86' --51922:1:launcher launcher_name = /usr/local/bin/valgrind --51922:1:launcher launching /usr/local/lib/valgrind/memcheck-x86-darwin Program received signal SIGTRAP, Trace/breakpoint trap. 0xf0a33024 in ?? () (gdb) where #0 0xf0a33024 in ?? () Cannot access memory at address 0x4 I can't reproduce this. MacOSX 10.5.7, Mac Mini 1.83 GHz, 2 GB ram, svn trunk r10169, clean build. FX sent Julian and I his memcheck-x86-darwin binary via email. It ran fine on my machine, up to the "valgrind: You cannot run 'memcheck-x86-darwin' directly." point. Ie. the result was the same as a Valgrind built on my system. This means that the problem is with FX's system, not with the built executable. Which gives us a little more information, but not a lot. In relation to comment #14, DYLD_PRINT_STATISTICS doesn't print anything for me on my memcheck-x86-darwin before the "valgrind: You cannot run 'memcheck-x86-darwin' directly." So that's not very significant. Julian and I were just looking at the kernel code. If you look at line 920 of bsd/kern/kern_exec.c, in the xnu-1228.12.14 source code (available at http://www.opensource.apple.com/release/mac-os-x-1057/), that's the most suspicious-looking place where a SIGTRAP is raised that we could find. AFAWCT, this occurs if P_LTRACED is set, which is the case if the executable is run under ptrace() with the PT_TRACE_ME flag specified. Which doesn't seem likely -- Valgrind doesn't use ptrace() on Mac. But it's worth asking, because it's the best idea we have right now: FX and Tom Harrington, on your systems, could you have something set up which causes normal programs to be run under ptrace()? Also, this message mentioned previously: shared_region: enter(0x7f954d8,0x50c2594,0,7,0): vm_map_enter(0x90000000,0x20000000,0x42cc8c0) error 0x3 doesn't seem to be significant; I get it on my machine where everything works. I don't believe I have anything installed that would cause ptrace() to be run on every process. Am I right in thinking that this would mean the process could detect the situation by looking up information on itself via sysctl()? I ask because I have some code that looks to see if P_TRACED is set on it, and it doesn't detect that flag. Another question for FX and Tom: you are both getting this with 10.5.7. Did either of you run Valgrind with 10.5.6? If so, did you hit this problem? Thanks. Might be helpful to see your system profile, too. You can generate this by going to About This Mac (under the Apple menu), clicking on the "More Info..." button and hitting Command-S in the System Profiler app. Take a look through this file before sending it in case you're worried about any information that's in it being attached to this bug. I had the problem on another Mac running 10.5.6. I've since upgraded it to 10.5.7 but that had no impact on valgrind. I saved a system profile, but at 2.2MB of text I haven't yet figured out if it contains any information I wouldn't want posted publicly. I don't know why it would but that it's hard to imagine what's taking up that much space. Created attachment 34186 [details]
Basic system profile
Comment on attachment 34186 [details]
Basic system profile
Result of running "system_profiler -detailLevel basic" to get a reduced system profile. I'll look over the full profile and see if it looks postable.
Well, looks like is our (Rogue Amoeba's) fault. At least it is on my system, where I can reproduce this easily. It's due to our Instant Hijack extension conflicting somehow with Valgrind. We will look into a fix. Right now it seems that this may be a bug in our code, but we won't be sure until the Instant Hijack guru has a chance to look at things after WWDC. The good news is that you can easily work around it, either by uninstalling Instant Hijack from one of RA's apps, or by simply running this command to temporarily disable it: sudo /usr/local/hermes/bin/hermesctl unload And then once you're done using Valgrind for the moment you can re-enable it like so: sudo /usr/local/hermes/bin/hermesctl load My apologies for the trouble and wasted time. I'll post back when we know more, but again, that won't be until after WWDC. FX, Tom: can you confirm that Mike is right (comment #34), ie. whether this fixes the problem for you? BTW, Mike: can I ask how did you determine Instant Hijack is the problem? Thanks. Nicholas, it was basically intuition. I saw the crash, googled for it, found this bug. When I saw that Valgrind did interesting stuff with dyld, I thought it might be related because we do interesting stuff with dyld too. I disabled it and tried Valgrind again, and suddenly it worked. I wish I could have a more comprehensive investigational process for you, but it was pretty much just a wild guess that happened to be right. Aha! Yes, Valgrind and Instant Hijack will be surprised to discover each other as they mangle the process's initial state. Mike Ash's fix works for me. This resolves my problem. Thanks, Mike. Unloading hermes (Mike's workaround) works for me too. We've fixed, or at least worked around, this bug internally now. Fixes will be going out as part of the apps' normal update cycles. For those of you who'd like to get your hands on it now, you'll need to download development builds here: For Audio Hijack Pro: http://www.rogueamoeba.com/audiohijackpro/download-development.php For Airfoil: http://www.rogueamoeba.com/airfoil/mac/download-development.php For Nicecast: http://www.rogueamoeba.com/nicecast/download-development.php Note that Instant Hijack is a shared install so if you should happen to have more than one of these (and thanks if you do!) then you only need to grab one and install the Instant Hijack upgrade from it. As expected, the problem was IH's dyld mangling getting in the way of Valgrind. Ultimately we decided that it wasn't worth making them work together (and risk having it break later on) so this new version of Instant Hijack just does a very early check to see if it's running in Valgrind, and if so it bails out and doesn't touch that process. Thanks Mike! *** Bug 199282 has been marked as a duplicate of this bug. *** |