Bug 192634 - V. reports "aspacem sync_check_mapping_callback: segment mismatch" on mac
Summary: V. reports "aspacem sync_check_mapping_callback: segment mismatch" on mac
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Compiled Sources Other
: NOR normal
Target Milestone: blocking3.5.0
Assignee: Nicholas Nethercote
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-14 06:54 UTC by Dan Kegel
Modified: 2009-08-10 04:09 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
A small reproducer of the aspacemgr problem (971 bytes, text/plain)
2009-07-15 14:21 UTC, Alexander Potapenko
Details
mismatch messages (21.10 KB, text/plain)
2009-07-31 07:11 UTC, Nicholas Nethercote
Details
resync after {wq,p}thread_hijack rather than merely doing a check (2.57 KB, patch)
2009-08-04 01:52 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Kegel 2009-05-14 06:54:08 UTC
Version:            (using Devel)
Compiler:          gcc 
OS:                OS X
Installed from:    Compiled sources

Valgrinding the chrome unit tests on the Mac turns up
that error message fairly often.  It happens once in
the base unittests, see

http://build.chromium.org/buildbot/waterfall/builders/Modules Mac (valgrind)/builds/581/steps/valgrind test%3A base/logs/stdio

(that should be valid for a few days).  It says
[ RUN      ] WorkerPoolTest.PostTask
--67607:0:aspacem  sync_check_mapping_callback: segment mismatch: V's seg:
--67607:0:aspacem  NSegment{    , start=0xb00a2000, end=0xb042cfff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"}
--67607:0:aspacem  sync_check_mapping_callback: segment mismatch: kernel's seg:
--67607:0:aspacem  start=0xb00a2000 end=0xb00a2fff prot=0 dev=0 ino=0 offset=0 name="(none)"
--67607:0:aspacem  sync_check_mapping_callback: segment mismatch: V's seg:
--67607:0:aspacem  NSegment{    , start=0xb00a2000, end=0xb042cfff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"}
--67607:0:aspacem  sync_check_mapping_callback: segment mismatch: kernel's seg:
--67607:0:aspacem  start=0xb00a3000 end=0xb0123fff prot=3 dev=0 ino=0 offset=0 name="(none)"
--67607:0:aspacem  sync check at wqthread_hijack:0 (after): FAILED
--67607:0:aspacem
[       OK ] WorkerPoolTest.PostTask (230 ms)

V. seems rather upset, can someone have a look?
Comment 1 Peter N Lewis 2009-06-11 04:43:56 UTC
I got these in testing my Mac OS X application:

--9657:0:aspacem  sync_check_mapping_callback: segment mismatch: V's seg:
--9657:0:aspacem  NSegment{    , start=0x1dd07000, end=0x1fffffff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"}
--9657:0:aspacem  sync_check_mapping_callback: segment mismatch: kernel's seg:
--9657:0:aspacem  start=0x1dd07000 end=0x1dd07fff prot=3 dev=0 ino=0 offset=0 name="(none)"
--9657:0:aspacem  sync check at pthread_hijack:0 (after): FAILED
--9657:0:aspacem  
--9657:0:aspacem  sync_check_mapping_callback: segment mismatch: V's seg:
--9657:0:aspacem  NSegment{    , start=0x1dd07000, end=0x1fffffff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"}
--9657:0:aspacem  sync_check_mapping_callback: segment mismatch: kernel's seg:
--9657:0:aspacem  start=0x1dd07000 end=0x1dd07fff prot=3 dev=0 ino=0 offset=0 name="(none)"
--9657:0:aspacem  sync check at pthread_hijack:0 (after): FAILED
--9657:0:aspacem  

Hopefully another data point will help.  I'm afraid I don't know what the application is doing at this point, although if anyone has any ideas, I'm happy to try to verify it.
Comment 2 Alexander Potapenko 2009-07-15 14:21:44 UTC
Created attachment 35351 [details]
A small reproducer of the aspacemgr problem

I've managed to create a small stand-alone reproducer of the problem (the code was taken from http://www.cimgf.com/2008/02/16/cocoa-tutorial-nsoperation-and-nsoperationqueue/).
Build the attached code using:
> gcc -ObjC aspacem.m -lobjc  -o aspacem -framework Foundation -framework AppKit

`valgrind ./aspacem` will produce the following log:

--51610:0:aspacem  sync_check_mapping_callback: segment mismatch: V's seg:
--51610:0:aspacem  NSegment{    , start=0x6b9f000, end=0x6bb6fff, smode=SmFixed, dev=0, ino=0, offset=0, fnIdx=-1, hasR=0, hasW=0, hasX=0, hasT=0, mark=0, name="(none)"}
--51610:0:aspacem  sync_check_mapping_callback: segment mismatch: kernel's seg:
--51610:0:aspacem  start=0x6b9f000 end=0x6b9ffff prot=3 dev=0 ino=0 offset=0 name="(none)"
--51610:0:aspacem  sync check at wqthread_hijack:0 (after): FAILED
--51610:0:aspacem  

Valgrind SVN is synced to revision 10427
Comment 3 Dan Kegel 2009-07-29 20:54:48 UTC
ping.  This is our #1 valgrind bug right now.
Comment 4 Julian Seward 2009-07-29 21:55:26 UTC
Yeh.  It's not good; it means that Valgrind's view of the (page-level)
mapping of memory in the process has diverged from the kernel's view
thereof.  V is very careful to keep what effectively is a parallel data
structure to the kernel's /proc/<pid>/maps; and on Linux that works OK.
But Darwin has a tendency to pull mappings out of a hat and shove them
into the process, without telling anyone, which causes major difficulties.
I think it's somehow related to thread creation/destruction in the Darwin
kernel, but don't quote me on that.

Unfortunately I can't repro it with the test program, on a Core 2 Duo
Mac Mini running kernel 9.7.0.  But I have seen it from time to time
over the past 6 months.

AFAICT Valgrind doesn't drop dead though; so apart from the unhelpul
spewage, in what ways is this being a showstopper for you?

Nick probably knows more than I do.  Isn't there a difference between a
sync check (which just compares aspacem's info against the kernel info)
and a forced resync?  Do we care if a sync check fails?  Then could we
just do a forced resync, and carry on as if nothing had happened?

The more serious problem, from my point of view, is that all this 
sync-checking is very expensive, and for threaded programs, can cause
Valgrind to run far more slowly than it would running the same thing on
Linux.
Comment 5 Julian Seward 2009-07-29 22:01:57 UTC
If it's any kind of consolation .. this is listed as a blocker for
the upcoming 3.5.0 release.  So we're going to have to get it sorted
somehow in short order anyway.
Comment 6 Timur Iskhodzhanov 2009-07-29 22:02:15 UTC
> AFAICT Valgrind doesn't drop dead though; so apart from the unhelpul
spewage, in what ways is this being a showstopper for you?
We've tried porting ThreadSanitizer to Mac and every time "aspacem" report appears, TSAN hangs.
The strange thing is that ThreadSanitizer has 5 threads (!) in Mac task manager after it hangs.

I know this may be due to some mistakes we've done while porting TSAN/C++ to Mac. But at least, we have a couple of chromium tests that reproduce this hang and we'd be happy to check your patch when it's done.
Comment 7 Julian Seward 2009-07-29 22:05:31 UTC
> We've tried porting ThreadSanitizer to Mac and every time "aspacem" report
> appears, TSAN hangs.
> The strange thing is that ThreadSanitizer has 5 threads (!) in Mac task 
> manager
> after it hangs.

So do other tools (memcheck, and, importantly, --tool=none) also hang?
In other words, is this tsan-specific?
Comment 8 Timur Iskhodzhanov 2009-07-29 22:06:25 UTC
AFAIK, it's TSAN-specific.
I'll check that tomorrow
Comment 9 Dan Kegel 2009-07-29 22:12:24 UTC
The hang may be TSAN-specific, but that's not what I was
talking about.  If the spewage is really benign, then I
can filter it out of our logs, and move on.
Comment 10 Dan Kegel 2009-07-29 22:52:25 UTC
OK, I've filed bug 201936 for the horrible crashes that happen
after this.  I thought they were related, but perhaps not.
Comment 11 Nicholas Nethercote 2009-07-31 07:11:02 UTC
Created attachment 35745 [details]
mismatch messages

I found a totally reliable reproducer for my machine:

  valgrind --sanity-level=3 /Applications/TextEdit.app/

Attached is the output;  I tweaked the code so that the mismatch messages are more readable.

The problem is that there are areas of memory that Valgrind thinks are free, but the kernel says they contains READ|WRITE maps.  This is bad and could be causing follow-on crashes if Valgrind's mmaps the memory.  (Julian, it's worse than I though when I last spoke to you about this;  I thought it was just a permissions mismatch, not a free vs. not-free mismatch.)
Comment 12 Nicholas Nethercote 2009-07-31 08:30:56 UTC
Here's the sequence of syscalls/traps I see just before the crash for TextEdit: 


SYSCALL[98350,1](mach: 31) bootstrap_look_up("com.apple.FontObjectsServer") --> [async] ...
SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got port unnamed-0x3203; com.apple.FontOb
jectsServer

SYSCALL[98350,1](mach: 31) vm_allocate (mach_task_self(), at 0x0, size 4096, flags 0x32000001) --> [a
sync] ...
SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) allocated at 0x5c54000

SYSCALL[98350,1](mach: 31) mach_make_memory_entry_64(mach_task_self(), 4096, 0x5c54000, 3, ..., 0) --
> [async] ...
SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got port unnamed-0x3303; memory-0x3303

SYSCALL[98350,1](mach: 31) UNHANDLED mach_msg [id 0, to com.apple.FontObjectsServer, reply 0x3103] --
> [async] ...
SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0)

SYSCALL[98350,1](mach: 31) UNHANDLED mach_msg [id 14, to com.apple.FontObjectsServer, reply 0x3103] -
-> [async] ...
SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got ool mem 0x17319000..0x19299000;

--98350:0:aspacem  segment mismatch: V's seg 1st, kernel's 2nd:
--98350:0:aspacem  335:      0005c55000-0005c55fff    4096 ----- SmFixed d=0x000 i=0       o=0
(-1) m=0 (none)
--98350:0:aspacem  ...: .... 0005c55000-0005c55fff    4096 rw-.. ....... d=0x000 i=0       o=0
(.) m=. (none)


The aspacem sanity check is done as a result of the "got ool" thingy.  It can't be a coincidence that the last few traps are dealing with addres 0x5c54000 and then a mismatch occurs at 0x5c55000.  The fact that the last two traps are unhandled mach_msg() traps is also suspicious.

Greg, do you have any ideas about this?
Comment 13 Julian Seward 2009-08-03 02:32:27 UTC
(In reply to comment #12)
Right, that reproduces it for me too.

> SYSCALL[98350,1](mach: 31) ... [async] --> Success(0x0:0x0) got ool mem
> 0x17319000..0x19299000;
> 
> --98350:0:aspacem  segment mismatch: V's seg 1st, kernel's 2nd:
> --98350:0:aspacem  335:      0005c55000-0005c55fff    4096 ----- SmFixed
> d=0x000 i=0       o=0
> (-1) m=0 (none)
> --98350:0:aspacem  ...: .... 0005c55000-0005c55fff    4096 rw-.. .......

I'm confused though.  If I place a sync_mappings call immediately
after the printf of the "got ool mem" msg but before
ML_(notify_core_and_tool_of_mmap) (which I think, as you say
is doing the sync check), viz:

            PRINT("got ool mem %p..%#lx; ", desc->out_of_line.address, 
                  (Addr)desc->out_of_line.address+desc->out_of_line.size);
            ////////// call sync_mappings here ////////
            ML_(notify_core_and_tool_of_mmap)(
               start, end - start, VKI_PROT_READ|VKI_PROT_WRITE, 
               VKI_MAP_PRIVATE, -1, 0);

then first of all, sync mappings finds no differences, yet the
sync check entailed by ML_(notify_core_and_tool_of_mmap) fails.

sync_mappings and the sync checker are of equal "power", right?
Is there a way the latter can detect a difference that the former
doesn't?
Comment 14 Nicholas Nethercote 2009-08-03 03:54:24 UTC
(In reply to comment #13)
> 
> sync_mappings and the sync checker are of equal "power", right?
> Is there a way the latter can detect a difference that the former
> doesn't?

I don't think so.

Are you sure you're seeing the same failure?  I tried something similar (possibly identical?) to what you did but it ended up aborting in a sync check somewhere else instead and I didn't reach any firm conclusions.
Comment 15 Nicholas Nethercote 2009-08-03 04:35:08 UTC
After some more investigation:  the added sync_mappings() is itself failing...  sync_mappings() calls ML_(notify_core_and_tool_of_mmap), which calls notify_core_of_mmap(), which calls VG_(am_notify_client_mmap), which does the sync check and fails.

I.e. we do a sync check right after a missing segment.  If there is more than one missing segment (in this case there are 73) the check will fail because there are still 72 segments that have to be added.  To fix we should probably add a bool argument which indicates if we should skip the sync check.

So I think that explains the TextEdit.app problems but not the original problem.  Hmm.
Comment 16 Julian Seward 2009-08-03 17:06:50 UTC
(In reply to comment #15)

General point: we need to have sync checking working reliably, because
without that we have no way to find out where we need to insert missing
resync calls.  So fixing the TextEdit.app problem is itself a worthy
goal.

> I.e. we do a sync check right after a missing segment.  If there is more than
> one missing segment (in this case there are 73) the check will fail because
> there are still 72 segments that have to be added.  To fix we should probably
> add a bool argument which indicates if we should skip the sync check.

I tried doing that.  And it works .. kinda .. but leads to yet another
problem.  Consider your 73 deltas scenario.  For the first change,
sync_mappings calls ML_(notify_core_and_tool_of_mmap), which calls
notify_core_of_mmap, which calls VG_(am_notify_client_mmap), passing it
the new boolean, telling it not to do a sync check, regardless of what
--sanity-level says.  And so that goes through OK.

But then ML_(notify_core_and_tool_of_mmap) calls notify_tool_of_mmap.
If that can, for any reason whatsoever, itself do a mmap (presumably
to allocate memory), then the sync checker kicks in and we're back at
square 1.  So, as it stands, I have something that works with --tool=none
since that just ignores client mmaps, but fails just as before
with Memcheck, presumably because it's mmaping more shadow memory in
response to these (purported) client mmaps.

Urrr.  So it seems like, in a resync, we need to tell the core about
all the changes first, so its picture is up to date, and only then
tell the tool about all the changes.

I'll see if I can hack something up.
Comment 17 Nicholas Nethercote 2009-08-04 00:26:43 UTC
I'm also concerned that the calls to VG_(malloc) in sync_mappings() might cause problems with the whole cycle.  Although maybe it's ok since VG_(malloc) is called before VG_(get_changed_segments).
Comment 18 Julian Seward 2009-08-04 01:00:44 UTC
(In reply to comment #17)
> I'm also concerned that the calls to VG_(malloc) in sync_mappings() might 
> cause
> problems with the whole cycle.  Although maybe it's ok since VG_(malloc) is
> called before VG_(get_changed_segments).

I also wondered about that.  I think it's not ok, since in the worst case,
the VG_(malloc) call could cause m_mallocfree to make a new mapping before
we enquire about the changed mappings with VG_(get_changed_segments), and 
so might overlap with them.

---

I thought about this problem for a long time this afternoon.  I am
beginning to  think we can't fix this properly without redesigning aspacem.
Here's a horror scenario: two threads, T1 and T2:

T1 is repeatedly doing something innocuous -- a getpid syscall (for the
client) -- but in between getpids is doing mmaps, perhaps to get hold of
new shadow memory.

Meanwhile, T2 blocks in some big, complex syscall, during which the
kernel creates a new mapping.  Eventually, T2 returns from the syscall.

So we'd have to resync after T2 returns, right?  Yes.  But we also have
to resync every time T1 returns from sys_getpid, since we know that
the T2 syscall is going to change the mappings, but we don't have any
idea exactly when -- we only know that the mappings must change before
T2's syscall returns.  If we don't do that, we have no guarantee that
the T1's shadow memory-allocating mmaps aren't going to trash T2's new
mappings.  At least, not unless the shadow mem-allocating mmaps are
of the non-FIXED variety.  Which is a pretty fragile thing to rely on
-- any other thread could do mmap-FIXED whilst T2 is still in its
syscall and inadvertantly trash the mapping.

---

We don't know when exactly the kernel will add mappings.  Hence, however
assiduous we are at resyncing, there's always the possibility of getting
out-raced by the kernel.  The only proper fix I can think of is to 
redesign aspacem so that it doesn't need to maintain an exact picture
of the address space.  However, that's a big job and certainly not a
pre-3.5.0 thing.

---

For now .. I notice, from comments #0, #1 and #2, that the sync
check always fails at one of these two places

--67607:0:aspacem  sync check at wqthread_hijack:0 (after): FAILED
--9657:0:aspacem  sync check at pthread_hijack:0 (after): FAILED

and so I wonder if merely changing the sync check to a resync
at those two points would help.  It's annoying that we still don't
have a test case that doesn't require --sanity-level=3.  The program
in Comment #2 runs fine for me, alas.
Comment 19 Nicholas Nethercote 2009-08-04 01:26:39 UTC
(In reply to comment #18)
> 
> I think it's not ok, since in the worst case,
> the VG_(malloc) call could cause m_mallocfree to make a new mapping before
> we enquire about the changed mappings with VG_(get_changed_segments), and 
> so might overlap with them.

Hmm, yes.

> We don't know when exactly the kernel will add mappings.  Hence, however
> assiduous we are at resyncing, there's always the possibility of getting
> out-raced by the kernel.  The only proper fix I can think of is to 
> redesign aspacem so that it doesn't need to maintain an exact picture
> of the address space.  However, that's a big job and certainly not a
> pre-3.5.0 thing.

MAP_FIXED is a bad idea if we don't have total knowledge of the mappings.  But is a big redesign really necessary?  What if you just turned off MAP_FIXED in V's mmap requests to the kernel?  Eg. see the comment "use advisory as a hint only" and following code at aspacemgr-linux.c:2382.  Seems like the worst that could happen is that Valgrind's attempts to keep client maps and Valgrind maps separate might be less successful.  And it would only be necessary on Darwin.  

(Hmm, some of the mappings done in macho.c look like they must be FIXED.  But only those?  At least the VG_(am_*_float_*) calls could be done without FIXED on Darwin.)

> I wonder if merely changing the sync check to a resync
> at those two points would help.

Probably the best thing to do in the short-term.
Comment 20 Julian Seward 2009-08-04 01:52:20 UTC
Created attachment 35831 [details]
resync after {wq,p}thread_hijack rather than merely doing a check
Comment 21 Julian Seward 2009-08-04 01:56:07 UTC
(In reply to comment #20)
> Created an attachment (id=35831) [details]
> resync after {wq,p}thread_hijack rather than merely doing a check

Dan, Peter, Alexander, can you try this patch and see if makes any
difference?
Comment 22 Alexander Potapenko 2009-08-04 12:03:02 UTC
I've run the patch on the problematic Chromium tests under Memcheck and ThreadSanitizer. All of them passed without any 'aspacem' reports under Memcheck, and some of them even pass under TSAN (some still hang, however, but without any reports). Neither of the tests crashed.
Comment 23 Julian Seward 2009-08-04 17:52:48 UTC
(In reply to comment #22)
> I've run the patch on the problematic Chromium tests under Memcheck and
> ThreadSanitizer. All of them passed without any 'aspacem' reports under
> Memcheck, and some of them even pass under TSAN (some still hang, however, but
> without any reports). Neither of the tests crashed.

So are you saying that it's an improvement?  What I really want to know
is, not whether it still prints the messages, but whether the associated
crashes (as noted in bug 201936) have gone away.
Comment 24 Dan Kegel 2009-08-04 19:18:26 UTC
Trying it now.
Comment 25 Dan Kegel 2009-08-05 00:25:54 UTC
I've seen one associated crash on the chrome ui tests in four hours
using your sync patch (plus the fork patch) on top of the july 15 sources.
That doesn't mean much, there could be several sources of corruption,
and sometimes it takes a long time before I see a burst of errors.
Will continue running overnight. 

It definitely got rid of the aspacem warnings, though.
Comment 26 Dan Kegel 2009-08-05 15:49:02 UTC
I've now seen four crashes in an overnight run,
which is within the range I was seeing before.

I'm inclined to blame the crashes on the app
rather than valgrind.  Chrome is quite complicated
and does crash at some low rate in the real world,
so there is almost certainly some memory corruption
bug in it somewhere.  We'll just have to keep looking
for it.  In the meantime, getting rid of the aspacem
warnings is a definite Good Thing.
Comment 27 Julian Seward 2009-08-07 22:30:27 UTC
Patch in comment #20 committed as r10751.  Not sure if we should
close this bug or not.  The patch might help, but to properly fix
this will require major reengineering of m_aspacemgr.
Comment 28 Nicholas Nethercote 2009-08-10 03:45:43 UTC
I'll close it and open a new one regarding the remaining problems.
Comment 29 Nicholas Nethercote 2009-08-10 04:09:06 UTC
The follow-up bug is bug 203254.  It's marked as a 3.6.0 blocker, it's too big to be done for 3.5.0.