Bug 359871 - Incorrect mask handling in ppoll
Summary: Incorrect mask handling in ppoll
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Ivo Raisr
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-27 21:52 UTC by Steven Smith
Modified: 2016-03-08 09:05 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Test program (631 bytes, text/x-c++src)
2016-02-27 21:52 UTC, Steven Smith
Details
Kind of a fix, if you squint (751 bytes, patch)
2016-02-27 21:54 UTC, Steven Smith
Details
Another variant of the test program (461 bytes, text/x-c++src)
2016-03-01 15:45 UTC, Steven Smith
Details
patch for ppoll on Solaris (4.40 KB, patch)
2016-03-01 22:12 UTC, Ivo Raisr
Details
patch for ppoll on Solaris #2 (6.59 KB, patch)
2016-03-02 22:34 UTC, Ivo Raisr
Details
Port the ppoll patch to Linux, and extend to cover pselect6 (14.62 KB, patch)
2016-03-03 06:45 UTC, Steven Smith
Details
Second attempt at a linux fix (14.10 KB, patch)
2016-03-04 05:43 UTC, Steven Smith
Details
Third attempt at fixing it on Linux. Still doesn't quite work. (14.23 KB, patch)
2016-03-04 16:45 UTC, Steven Smith
Details
Forth attempt at a Linux fix. I think this one is actually correct. (21.05 KB, patch)
2016-03-05 09:02 UTC, Steven Smith
Details
combined patch for ppoll and pselect on Linux and Solaris (20.07 KB, patch)
2016-03-05 22:00 UTC, Ivo Raisr
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Smith 2016-02-27 21:52:47 UTC
Created attachment 97583 [details]
Test program

The ppoll() syscall wrapper applies precisely the signal mask which the guest asks for, which leads to deadlocks if the guest asks to block SIGVGKILL. The attached test program shows the issue (exits after 1 second with SIGALRM if run natively; sits until ctrl-C'd if run under Valgrind), and the attached patch seems to fix it (against r15817).

I'm not entirely convinced that the patch is the right approach, though (modifying the guest's copy of the sigmask when the kernel doesn't doesn't seem like a great idea). The obvious answer, of doing the entire call in the PRE() method and setting status to SsComplete, works even worse, though, because if you get a signal there then fixup_guest_state_after_syscall_interrupted() doesn't know how to deal with it.
Comment 1 Steven Smith 2016-02-27 21:54:19 UTC
Created attachment 97584 [details]
Kind of a fix, if you squint
Comment 2 Ivo Raisr 2016-02-29 20:17:36 UTC
Thank you for the test case provided.
It should not hardcode "64" but instead rely on at least _VKI_NSIG. Eventually block a range of signals.

I thought we were already doing some work on signals in sanitize_client_sigmask(). Seems like I was mistaken.
Comment 3 Steven Smith 2016-03-01 15:45:15 UTC
Created attachment 97622 [details]
Another variant of the test program

I didn't use VKI_NSIG or VKI_SIGVGKILL in the test program because they're not exposed in any of the headers in the Valgrind packages I have locally, and it was easier to make it not need any VG-internal bits. Just blocking every signal you can get at seems to reproduce it as well, so how about this one?
Comment 4 Steven Smith 2016-03-01 15:48:51 UTC
The reason that sanitize_client_sigmask() doesn't help here is that ppoll() sets its own temporary signal mask, independent of the normal thread one, and sanitization is only applied to the thread's mask. The special mask used by ppoll() is passed through verbatim from the client to the kernel, so if the client says ``block SIGVGKILL'', that's exactly what happens.

I don't know a great deal (or anything, really) about the Valgrind internals, but I imagine the correct fix would involve applying sanitization to the ppoll() mask as well. My first attempt at a fix was to have the PRE() hook copy the necessary bits into monitor memory and then VG_(do_syscall) itself to actually run the call itself, except that then you've unblocked signals outside of the blksys_setup/blksys_finished range, which seems to confuse VG_(fixup_guest_state_after_syscall_interrupted). Extending the core signal handling logic enough to handle that seemed like it was going to be hairy; hence giving up and just modifying the client's sigmask in-place. Which seems to work, and was certainly enough that the program I was working on started running under Valgrind, but would perhaps cause problems for some other clients?
Comment 5 Ivo Raisr 2016-03-01 22:12:08 UTC
Created attachment 97632 [details]
patch for ppoll on Solaris

This fixes the problematic functionality for ppoll on Solaris.
On Linux needs some investigation if pselect and ppoll use the same syscall wrapper or different ones.
Comment 6 Ivo Raisr 2016-03-01 22:13:26 UTC
Steven, thank you for the test case provided and a suggested fix.
Indeed, simple things are very often also effective.
I will prepare patch for Linux as well and have that reviewed.
Comment 7 Ivo Raisr 2016-03-02 06:12:33 UTC
The attached patch has two problems:
- it modifies a client memory which may be read-only (see for example bugs for modifying env)
- it modifies a client memory and does not restore the changes after syscall
Comment 8 Ivo Raisr 2016-03-02 22:34:49 UTC
Created attachment 97643 [details]
patch for ppoll on Solaris #2

This patch fixes shortcomings listed previously, that is client memory is not modified (as it could be read-only).
Comment 9 Ivo Raisr 2016-03-02 22:38:22 UTC
Steven, would you be able to prepare a patch for Linux using similar technique?
I cannot currently devote to developing it for a while.
Beware that Linux has ppoll and pselect6 syscalls. And pselect6 encodes the last argument
as a pointer to a structure of the form [1]:

struct {
    const sigset_t *ss;     /* Pointer to signal set */
    size_t          ss_len;    /* Size (in bytes) of object pointed to by 'ss' */
};

Linux syscall wrapper is completely missing any checking of 'ss' contents. You could also add that.

[1] http://linux.die.net/man/2/pselect6
Comment 10 Steven Smith 2016-03-03 06:45:12 UTC
Created attachment 97645 [details]
Port the ppoll patch to Linux, and extend to cover pselect6

Sure, here's a patch which does something similar for Linux ppoll and pselect, and also adds a new test case for pselect().

I'm not sure what the usual workflow is for Valgrind development, so I've just attached a single flattened patch. There's also a slightly more decomposed variant on my github page at https://github.com/sos22/vgrind-359871, if you'd prefer that.

Caveat: the only machine I have to test with at the moment is amd64 Linux, so I've not checked it on any other platforms
Comment 11 Ivo Raisr 2016-03-03 21:19:32 UTC
Thank you for the patch covering Linux.

I have only three minor comments:
- Why assert ARG7 == 0 in pselect6 wrapper? On some architectures arg7 would be on stack,
  containing some random values. On others arg7 could refer to a register containing garbage
  as well.
- The first argument (cost center) to VG_(malloc)() should reflect the function/wrapper name. On Solaris it was pollsys, on Linux it is pselect6 or ppoll.
- Please use 'vgopts: -q' in pselect_alarm.vgtest so the stack trace is not a part of stderr.exp. It differs between operating systems. If you insist on having there some stack information then for example a filter can be used to filter anything before main().
Comment 12 Steven Smith 2016-03-04 05:43:10 UTC
Created attachment 97670 [details]
Second attempt at a linux fix

Okay, here's another variant which sets more sensible cost centers for VG_(malloc), uses -q in the test case, and avoids playing with ARG7.

The ARG7 business was there to let the PRE hook communicate to the POST one whether it had actually done the substitution, so that it knew whether it had to call VG_(free). I didn't want to call ML_(safe_to_deref) again because I was worried about racing with the guess calling mmap(), and I didn't want to do it unconditionally because I wasn't sure how to build the substitution if some of the calls to ML_(safe_to_deref) said no. Looking at getSyscallArgsFromState(), it seemed like ARG7 was always initialised to zero on Linux, so it'd be a safe place for an extra flag (and I figured that if I was wrong an assertion failure would be easier to track down than a bad VG_(free)()), but I see now that mips leaves it uninitialised, so that isn't going to work.

The new patch always allocates and releases the substitution whenever ARG6 is non-NULL, and just VKI_EFAULTs any calls where it can't get the guest's desired mask. I also converted the ppoll wrapper to the same model, just for symmetry.

Thank you for the review.
Comment 13 Ivo Raisr 2016-03-04 11:01:47 UTC
Thank you for the fixes. Everything seems to be reasonable, except for the error paths.
If an error path is taken (setting status failure), allocated memory is not assigned to ARG6/ARG4.
This means the memory is leaked and VG_(free) will attempt to free arbitrary client memory.
Comment 14 Steven Smith 2016-03-04 16:45:08 UTC
Created attachment 97682 [details]
Third attempt at fixing it on Linux. Still doesn't quite work.

Oh, crud, that was embarrassing. Here's another one which avoids that particular error, and even extends the test case to include a failing pselect...

... which just brings us to the next problem:

valgrind: m_syswrap/syswrap-main.c:1950 (vgPlain_client_syscall): Assertion 'eq_SyscallArgs(&sci->args, &sci->orig_args)' failed.

host stacktrace:
==12189==    at 0x38084978: show_sched_status_wrk (m_libcassert.c:343)
==12189==    by 0x38084A94: report_and_quit (m_libcassert.c:415)
==12189==    by 0x38084C21: vgPlain_assert_fail (m_libcassert.c:481)
==12189==    by 0x380D7A44: vgPlain_client_syscall (syswrap-main.c:1950)
==12189==    by 0x380D414A: handle_syscall (scheduler.c:1118)

From here:

   if (sci->status.what == SsComplete && sr_isError(sci->status.sres)) {
      /* The pre-handler decided to fail syscall itself. */
      PRINT(" --> [pre-fail] %s", VG_(sr_as_string)(sci->status.sres));
      /* In this case, the pre-handler is also allowed to ask for the
         post-handler to be run anyway.  Changing the args is not
         allowed. */
      vg_assert(0 == (sci->flags & ~(SfMayBlock | SfPostOnFail | SfPollAfter)));
      vg_assert(eq_SyscallArgs(&sci->args, &sci->orig_args));
   }

So I can't use the syscall arguments to communicate from the PRE to POST handlers if the PRE call fails the syscall. Do you know if there any other ways of coordinating between them? Or at least some way for the POST handler to tell whether an error came from the PRE handler or from the actual kernel syscall?

I'm sorry to have needed quite this many iterations on this one; it turns out to be a little more involved than I'd expected.
Comment 15 Ivo Raisr 2016-03-04 16:54:45 UTC
Thank you for quick fix and analysis. Yes, indeed, changing args for failed syscall wrapper is not allowed. I wonder why you need to explicitly set failure status if kernel is going to do anyway.
Could you just check if all conditions are met and only then allocate memory and assign it to ARG6/ARG4?
Comment 16 Ivo Raisr 2016-03-04 17:21:02 UTC
Please could you also move the test for failed pselect into memcheck/tests/x86-linux/scalar.c
where it truly belongs. See its header.
Thanks a lot!
Comment 17 Steven Smith 2016-03-05 09:02:29 UTC
Created attachment 97688 [details]
Forth attempt at a Linux fix. I think this one is actually correct.

My main concern was that, because we drop the global lock while making the syscall, we might be racing with some other guest thread calling mmap/munmap, so the system call might happen even if ML_(safe_to_deref) says no, and if we've skipped the sanitisation the bug I was trying to fix will come back. Similarly, if POST calls ML_(safe_to_deref) to figure out whether to VG_(free) things then it might either free something it shouldn't or allow the memory to leak. Force-failing guest calls where we can't do the sanitisation seemed like the easiest way of avoiding the issues, except that I still need some way for the PRE call to tell the POST one what/whether it needs to VG_(free), which brought me to the assertion failures in my last attempt.

But I suppose it's possible to make syscalls fail without using SET_STATUS_Failure, so that's what this variant does. I think it covers all the relevant cases, or at least all of the ones I've thought of, even if it is a bit more dependent on kernel implementation details than I would have liked.

Perhaps I was just overthinking things and the right answer would have been to just ignore the races? The guest would have to be doing something pretty crazy for them to matter, after all. On the other hand, I think this one'll work, and it'd be nice to get the edge cases right, even if it is perhaps a little pedantic.

This patch also includes some more bits in scalar.c, but I'm not convinced they represent a sensible way of testing this:

1) I can reproduce the issue with --tool=none, so it clearly isn't a problem with memcheck, and it seems a bit odd to put the test case there.
2) The scalar.c regtest fails on my machine even without any of my changes, seemingly because it only works in 32 bit more and I have a 64 bit system, and it doesn't seem to like the debug symbols on the libc in my cross-compile setup. That makes it tricky for me to do any sensible testing of the test case.

I defer to your judgement on whether this ends up being a net positive for the codebase.
Comment 18 Ivo Raisr 2016-03-05 21:06:28 UTC
Thank you for a quick turnaround. I think this should work.
As regards the scalar test, if you are not able to get it working with high confidence then we should probably leave it alone.
I will combine patches for Linux and Solaris together and commit them.
Comment 19 Ivo Raisr 2016-03-05 22:00:54 UTC
Created attachment 97701 [details]
combined patch for ppoll and pselect on Linux and Solaris
Comment 20 Steven Smith 2016-03-07 04:16:39 UTC
Looks good to me, thank you.
Comment 21 Ivo Raisr 2016-03-08 09:05:06 UTC
Committed in SVN revision 15823.