Version: (using Devel) Compiler: gcc OS: OS X Installed from: Compiled sources Valgrinding the chrome unit tests on the Mac turns up that error message fairly often. It happens once in the base unittests, see http://build.chromium.org/buildbot/waterfall/builders/Modules Mac (valgrind)/builds/581/steps/valgrind test%3A base/logs/stdio (that should be valid for a few days). It says [ RUN ] WorkerPoolTest.PostTask --67607:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg: --67607:0:aspacem NSegment{ , start=0xb00a2000, end=0xb042cfff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"} --67607:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's seg: --67607:0:aspacem start=0xb00a2000 end=0xb00a2fff prot=0 dev=0 ino=0 offset=0 name="(none)" --67607:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg: --67607:0:aspacem NSegment{ , start=0xb00a2000, end=0xb042cfff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"} --67607:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's seg: --67607:0:aspacem start=0xb00a3000 end=0xb0123fff prot=3 dev=0 ino=0 offset=0 name="(none)" --67607:0:aspacem sync check at wqthread_hijack:0 (after): FAILED --67607:0:aspacem [ OK ] WorkerPoolTest.PostTask (230 ms) V. seems rather upset, can someone have a look?
I got these in testing my Mac OS X application: --9657:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg: --9657:0:aspacem NSegment{ , start=0x1dd07000, end=0x1fffffff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"} --9657:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's seg: --9657:0:aspacem start=0x1dd07000 end=0x1dd07fff prot=3 dev=0 ino=0 offset=0 name="(none)" --9657:0:aspacem sync check at pthread_hijack:0 (after): FAILED --9657:0:aspacem --9657:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg: --9657:0:aspacem NSegment{ , start=0x1dd07000, end=0x1fffffff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"} --9657:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's seg: --9657:0:aspacem start=0x1dd07000 end=0x1dd07fff prot=3 dev=0 ino=0 offset=0 name="(none)" --9657:0:aspacem sync check at pthread_hijack:0 (after): FAILED --9657:0:aspacem Hopefully another data point will help. I'm afraid I don't know what the application is doing at this point, although if anyone has any ideas, I'm happy to try to verify it.
Created attachment 35351 [details] A small reproducer of the aspacemgr problem I've managed to create a small stand-alone reproducer of the problem (the code was taken from http://www.cimgf.com/2008/02/16/cocoa-tutorial-nsoperation-and-nsoperationqueue/). Build the attached code using: > gcc -ObjC aspacem.m -lobjc -o aspacem -framework Foundation -framework AppKit `valgrind ./aspacem` will produce the following log: --51610:0:aspacem sync_check_mapping_callback: segment mismatch: V's seg: --51610:0:aspacem NSegment{ , start=0x6b9f000, end=0x6bb6fff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"} --51610:0:aspacem sync_check_mapping_callback: segment mismatch: kernel's seg: --51610:0:aspacem start=0x6b9f000 end=0x6b9ffff prot=3 dev=0 ino=0 offset=0 name="(none)" --51610:0:aspacem sync check at wqthread_hijack:0 (after): FAILED --51610:0:aspacem Valgrind SVN is synced to revision 10427
ping. This is our #1 valgrind bug right now.
Yeh. It's not good; it means that Valgrind's view of the (page-level) mapping of memory in the process has diverged from the kernel's view thereof. V is very careful to keep what effectively is a parallel data structure to the kernel's /proc/<pid>/maps; and on Linux that works OK. But Darwin has a tendency to pull mappings out of a hat and shove them into the process, without telling anyone, which causes major difficulties. I think it's somehow related to thread creation/destruction in the Darwin kernel, but don't quote me on that. Unfortunately I can't repro it with the test program, on a Core 2 Duo Mac Mini running kernel 9.7.0. But I have seen it from time to time over the past 6 months. AFAICT Valgrind doesn't drop dead though; so apart from the unhelpul spewage, in what ways is this being a showstopper for you? Nick probably knows more than I do. Isn't there a difference between a sync check (which just compares aspacem's info against the kernel info) and a forced resync? Do we care if a sync check fails? Then could we just do a forced resync, and carry on as if nothing had happened? The more serious problem, from my point of view, is that all this sync-checking is very expensive, and for threaded programs, can cause Valgrind to run far more slowly than it would running the same thing on Linux.
If it's any kind of consolation .. this is listed as a blocker for the upcoming 3.5.0 release. So we're going to have to get it sorted somehow in short order anyway.
> AFAICT Valgrind doesn't drop dead though; so apart from the unhelpul spewage, in what ways is this being a showstopper for you? We've tried porting ThreadSanitizer to Mac and every time "aspacem" report appears, TSAN hangs. The strange thing is that ThreadSanitizer has 5 threads (!) in Mac task manager after it hangs. I know this may be due to some mistakes we've done while porting TSAN/C++ to Mac. But at least, we have a couple of chromium tests that reproduce this hang and we'd be happy to check your patch when it's done.
> We've tried porting ThreadSanitizer to Mac and every time "aspacem" report > appears, TSAN hangs. > The strange thing is that ThreadSanitizer has 5 threads (!) in Mac task > manager > after it hangs. So do other tools (memcheck, and, importantly, --tool=none) also hang? In other words, is this tsan-specific?
AFAIK, it's TSAN-specific. I'll check that tomorrow
The hang may be TSAN-specific, but that's not what I was talking about. If the spewage is really benign, then I can filter it out of our logs, and move on.
OK, I've filed bug 201936 for the horrible crashes that happen after this. I thought they were related, but perhaps not.
Created attachment 35745 [details] mismatch messages I found a totally reliable reproducer for my machine: valgrind --sanity-level=3 /Applications/TextEdit.app/ Attached is the output; I tweaked the code so that the mismatch messages are more readable. The problem is that there are areas of memory that Valgrind thinks are free, but the kernel says they contains READ|WRITE maps. This is bad and could be causing follow-on crashes if Valgrind's mmaps the memory. (Julian, it's worse than I though when I last spoke to you about this; I thought it was just a permissions mismatch, not a free vs. not-free mismatch.)
Here's the sequence of syscalls/traps I see just before the crash for TextEdit: SYSCALL[98350,1](mach: 31) bootstrap_look_up("com.apple.FontObjectsServer") --> [async] ... SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got port unnamed-0x3203; com.apple.FontOb jectsServer SYSCALL[98350,1](mach: 31) vm_allocate (mach_task_self(), at 0x0, size 4096, flags 0x32000001) --> [a sync] ... SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) allocated at 0x5c54000 SYSCALL[98350,1](mach: 31) mach_make_memory_entry_64(mach_task_self(), 4096, 0x5c54000, 3, ..., 0) -- > [async] ... SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got port unnamed-0x3303; memory-0x3303 SYSCALL[98350,1](mach: 31) UNHANDLED mach_msg [id 0, to com.apple.FontObjectsServer, reply 0x3103] -- > [async] ... SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) SYSCALL[98350,1](mach: 31) UNHANDLED mach_msg [id 14, to com.apple.FontObjectsServer, reply 0x3103] - -> [async] ... SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got ool mem 0x17319000..0x19299000; --98350:0:aspacem segment mismatch: V's seg 1st, kernel's 2nd: --98350:0:aspacem 335: 0005c55000-0005c55fff 4096 ----- SmFixed d=0x000 i=0 o=0 (-1) m=0 (none) --98350:0:aspacem ...: .... 0005c55000-0005c55fff 4096 rw-.. ....... d=0x000 i=0 o=0 (.) m=. (none) The aspacem sanity check is done as a result of the "got ool" thingy. It can't be a coincidence that the last few traps are dealing with addres 0x5c54000 and then a mismatch occurs at 0x5c55000. The fact that the last two traps are unhandled mach_msg() traps is also suspicious. Greg, do you have any ideas about this?
(In reply to comment #12) Right, that reproduces it for me too. > SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got ool mem > 0x17319000..0x19299000; > > --98350:0:aspacem segment mismatch: V's seg 1st, kernel's 2nd: > --98350:0:aspacem 335: 0005c55000-0005c55fff 4096 ----- SmFixed > d=0x000 i=0 o=0 > (-1) m=0 (none) > --98350:0:aspacem ...: .... 0005c55000-0005c55fff 4096 rw-.. ....... I'm confused though. If I place a sync_mappings call immediately after the printf of the "got ool mem" msg but before ML_(notify_core_and_tool_of_mmap) (which I think, as you say is doing the sync check), viz: PRINT("got ool mem %p..%#lx; ", desc->out_of_line.address, (Addr)desc->out_of_line.address+desc->out_of_line.size); ////////// call sync_mappings here //////// ML_(notify_core_and_tool_of_mmap)( start, end - start, VKI_PROT_READ|VKI_PROT_WRITE, VKI_MAP_PRIVATE, -1, 0); then first of all, sync mappings finds no differences, yet the sync check entailed by ML_(notify_core_and_tool_of_mmap) fails. sync_mappings and the sync checker are of equal "power", right? Is there a way the latter can detect a difference that the former doesn't?
(In reply to comment #13) > > sync_mappings and the sync checker are of equal "power", right? > Is there a way the latter can detect a difference that the former > doesn't? I don't think so. Are you sure you're seeing the same failure? I tried something similar (possibly identical?) to what you did but it ended up aborting in a sync check somewhere else instead and I didn't reach any firm conclusions.
After some more investigation: the added sync_mappings() is itself failing... sync_mappings() calls ML_(notify_core_and_tool_of_mmap), which calls notify_core_of_mmap(), which calls VG_(am_notify_client_mmap), which does the sync check and fails. I.e. we do a sync check right after a missing segment. If there is more than one missing segment (in this case there are 73) the check will fail because there are still 72 segments that have to be added. To fix we should probably add a bool argument which indicates if we should skip the sync check. So I think that explains the TextEdit.app problems but not the original problem. Hmm.
(In reply to comment #15) General point: we need to have sync checking working reliably, because without that we have no way to find out where we need to insert missing resync calls. So fixing the TextEdit.app problem is itself a worthy goal. > I.e. we do a sync check right after a missing segment. If there is more than > one missing segment (in this case there are 73) the check will fail because > there are still 72 segments that have to be added. To fix we should probably > add a bool argument which indicates if we should skip the sync check. I tried doing that. And it works .. kinda .. but leads to yet another problem. Consider your 73 deltas scenario. For the first change, sync_mappings calls ML_(notify_core_and_tool_of_mmap), which calls notify_core_of_mmap, which calls VG_(am_notify_client_mmap), passing it the new boolean, telling it not to do a sync check, regardless of what --sanity-level says. And so that goes through OK. But then ML_(notify_core_and_tool_of_mmap) calls notify_tool_of_mmap. If that can, for any reason whatsoever, itself do a mmap (presumably to allocate memory), then the sync checker kicks in and we're back at square 1. So, as it stands, I have something that works with --tool=none since that just ignores client mmaps, but fails just as before with Memcheck, presumably because it's mmaping more shadow memory in response to these (purported) client mmaps. Urrr. So it seems like, in a resync, we need to tell the core about all the changes first, so its picture is up to date, and only then tell the tool about all the changes. I'll see if I can hack something up.
I'm also concerned that the calls to VG_(malloc) in sync_mappings() might cause problems with the whole cycle. Although maybe it's ok since VG_(malloc) is called before VG_(get_changed_segments).
(In reply to comment #17) > I'm also concerned that the calls to VG_(malloc) in sync_mappings() might > cause > problems with the whole cycle. Although maybe it's ok since VG_(malloc) is > called before VG_(get_changed_segments). I also wondered about that. I think it's not ok, since in the worst case, the VG_(malloc) call could cause m_mallocfree to make a new mapping before we enquire about the changed mappings with VG_(get_changed_segments), and so might overlap with them. --- I thought about this problem for a long time this afternoon. I am beginning to think we can't fix this properly without redesigning aspacem. Here's a horror scenario: two threads, T1 and T2: T1 is repeatedly doing something innocuous -- a getpid syscall (for the client) -- but in between getpids is doing mmaps, perhaps to get hold of new shadow memory. Meanwhile, T2 blocks in some big, complex syscall, during which the kernel creates a new mapping. Eventually, T2 returns from the syscall. So we'd have to resync after T2 returns, right? Yes. But we also have to resync every time T1 returns from sys_getpid, since we know that the T2 syscall is going to change the mappings, but we don't have any idea exactly when -- we only know that the mappings must change before T2's syscall returns. If we don't do that, we have no guarantee that the T1's shadow memory-allocating mmaps aren't going to trash T2's new mappings. At least, not unless the shadow mem-allocating mmaps are of the non-FIXED variety. Which is a pretty fragile thing to rely on -- any other thread could do mmap-FIXED whilst T2 is still in its syscall and inadvertantly trash the mapping. --- We don't know when exactly the kernel will add mappings. Hence, however assiduous we are at resyncing, there's always the possibility of getting out-raced by the kernel. The only proper fix I can think of is to redesign aspacem so that it doesn't need to maintain an exact picture of the address space. However, that's a big job and certainly not a pre-3.5.0 thing. --- For now .. I notice, from comments #0, #1 and #2, that the sync check always fails at one of these two places --67607:0:aspacem sync check at wqthread_hijack:0 (after): FAILED --9657:0:aspacem sync check at pthread_hijack:0 (after): FAILED and so I wonder if merely changing the sync check to a resync at those two points would help. It's annoying that we still don't have a test case that doesn't require --sanity-level=3. The program in Comment #2 runs fine for me, alas.
(In reply to comment #18) > > I think it's not ok, since in the worst case, > the VG_(malloc) call could cause m_mallocfree to make a new mapping before > we enquire about the changed mappings with VG_(get_changed_segments), and > so might overlap with them. Hmm, yes. > We don't know when exactly the kernel will add mappings. Hence, however > assiduous we are at resyncing, there's always the possibility of getting > out-raced by the kernel. The only proper fix I can think of is to > redesign aspacem so that it doesn't need to maintain an exact picture > of the address space. However, that's a big job and certainly not a > pre-3.5.0 thing. MAP_FIXED is a bad idea if we don't have total knowledge of the mappings. But is a big redesign really necessary? What if you just turned off MAP_FIXED in V's mmap requests to the kernel? Eg. see the comment "use advisory as a hint only" and following code at aspacemgr-linux.c:2382. Seems like the worst that could happen is that Valgrind's attempts to keep client maps and Valgrind maps separate might be less successful. And it would only be necessary on Darwin. (Hmm, some of the mappings done in macho.c look like they must be FIXED. But only those? At least the VG_(am_*_float_*) calls could be done without FIXED on Darwin.) > I wonder if merely changing the sync check to a resync > at those two points would help. Probably the best thing to do in the short-term.
Created attachment 35831 [details] resync after {wq,p}thread_hijack rather than merely doing a check
(In reply to comment #20) > Created an attachment (id=35831) [details] > resync after {wq,p}thread_hijack rather than merely doing a check Dan, Peter, Alexander, can you try this patch and see if makes any difference?
I've run the patch on the problematic Chromium tests under Memcheck and ThreadSanitizer. All of them passed without any 'aspacem' reports under Memcheck, and some of them even pass under TSAN (some still hang, however, but without any reports). Neither of the tests crashed.
(In reply to comment #22) > I've run the patch on the problematic Chromium tests under Memcheck and > ThreadSanitizer. All of them passed without any 'aspacem' reports under > Memcheck, and some of them even pass under TSAN (some still hang, however, but > without any reports). Neither of the tests crashed. So are you saying that it's an improvement? What I really want to know is, not whether it still prints the messages, but whether the associated crashes (as noted in bug 201936) have gone away.
Trying it now.
I've seen one associated crash on the chrome ui tests in four hours using your sync patch (plus the fork patch) on top of the july 15 sources. That doesn't mean much, there could be several sources of corruption, and sometimes it takes a long time before I see a burst of errors. Will continue running overnight. It definitely got rid of the aspacem warnings, though.
I've now seen four crashes in an overnight run, which is within the range I was seeing before. I'm inclined to blame the crashes on the app rather than valgrind. Chrome is quite complicated and does crash at some low rate in the real world, so there is almost certainly some memory corruption bug in it somewhere. We'll just have to keep looking for it. In the meantime, getting rid of the aspacem warnings is a definite Good Thing.
Patch in comment #20 committed as r10751. Not sure if we should close this bug or not. The patch might help, but to properly fix this will require major reengineering of m_aspacemgr.
I'll close it and open a new one regarding the remaining problems.
The follow-up bug is bug 203254. It's marked as a 3.6.0 blocker, it's too big to be done for 3.5.0.