342040 – Valgrind mishandles clone with CLONE_VFORK | CLONE_VM that clones to a different stack

Bug 342040 - Valgrind mishandles clone with CLONE_VFORK | CLONE_VM that clones to a different stack

Summary: Valgrind mishandles clone with CLONE_VFORK | CLONE_VM that clones to a differ...

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	3.10 SVN
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-12-19 21:19 UTC by Steven Stewart-Gallus
Modified:	2016-12-13 19:59 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Steven Stewart-Gallus 2014-12-19 21:19:05 UTC

Typically, clone is used with the CLONE_VFORK | CLONE_VM options to emulate the vfork system call but with additional options. However, it is extremely awkward and dangerous to use vfork functionality in such a way that stacks are shared. To make things safer and avoid getting into trouble with compilers like Clang misoptimizing things I plan to use clone with CLONE_VFORK | CLONE_VM and not share stacks but jump to a newly allocated one. CLONE_VM alone without CLONE_VFORK is explicitly unsupported by Valgrind.

Reproducible: Always

Steps to Reproduce:
1. Compile attached program
2. Run under Valgrind
3. See weird crash

Actual Results:  
==26098== Memcheck, a memory error detector
==26098== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==26098== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==26098== Command: ./a.out
==26098== 
==26099== Invalid write of size 4
==26099==    at 0x4009B0: main (valgrind-vfork.c:48)
==26099==  Address 0xffffffffffffffd4 is not stack'd, malloc'd or (recently) free'd
==26099== 
==26099== 
==26099== Process terminating with default action of signal 11 (SIGSEGV)
==26099==  Access not within mapped region at address 0xFFFFFFFFFFFFFFD4
==26099==    at 0x4009B0: main (valgrind-vfork.c:48)
==26099==  If you believe this happened as a result of a stack
==26099==  overflow in your program's main thread (unlikely but
==26099==  possible), you can try to increase the size of the
==26099==  main thread stack using the --main-stacksize= flag.
==26099==  The main thread stack size used in this run was 8720384.
==26099== 
==26099== HEAP SUMMARY:
==26099==     in use at exit: 0 bytes in 0 blocks
==26099==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==26099== 
==26099== All heap blocks were freed -- no leaks are possible
==26099== 
==26099== For counts of detected and suppressed errors, rerun with: -v
==26099== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==26098== 
==26098== HEAP SUMMARY:
==26098==     in use at exit: 0 bytes in 0 blocks
==26098==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==26098== 
==26098== All heap blocks were freed -- no leaks are possible
==26098== 
==26098== For counts of detected and suppressed errors, rerun with: -v
==26098== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)


Expected Results:  
Run normally.

My program that needs this behaviour: https://gitorious.org/linted/linted

The program to reproduce the crash:

#define _GNU_SOURCE

#include <assert.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>

int fork_routine(void *arg)
{
	_Exit(EXIT_SUCCESS);
}

int main(void)
{
	long page_size = sysconf(_SC_PAGE_SIZE);
	assert(page_size != -1);

	/* We need an extra page for signals */
	long stack_size = sysconf(_SC_THREAD_STACK_MIN) + page_size;
	assert(stack_size != -1);

	size_t stack_and_guard_size = page_size + stack_size + page_size;
	void *child_stack = mmap(
	    NULL, stack_and_guard_size, PROT_READ | PROT_WRITE,
	    MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN | MAP_STACK, -1, 0);
	if (NULL == child_stack) {
		perror("mmap");
		return EXIT_FAILURE;
	}

	/* Guard pages are shared between the stacks */
	if (-1 == mprotect((char *)child_stack, page_size, PROT_NONE)) {
		perror("mprotect");
		return EXIT_FAILURE;
	}

	if (-1 == mprotect((char *)child_stack + page_size + stack_size,
	                   page_size, PROT_NONE)) {
		perror("mprotect");
		return EXIT_FAILURE;
	}

	void *stack_start = (char *)child_stack + page_size + stack_size;
	pid_t child =
	    clone(fork_routine, stack_start,
	          SIGCHLD | CLONE_VFORK | CLONE_VM, NULL);
	if (-1 == child) {
		perror("clone");
		return EXIT_FAILURE;
	}

	for (;;) {
		int xx;
		switch (waitpid(child, &xx, 0)) {
		case -1:
			switch (errno) {
			case EINTR:
				continue;
			default:
				perror("waitpid");
				return EXIT_FAILURE;
			}

		default:
			return EXIT_SUCCESS;
		}
	}
}

Comment 1 Nach 2016-08-28 18:32:51 UTC

While testing my own implementation of posix_spawn() (http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_spawn.html) using the following snippet:

char stack[4096];
pid_t pid = clone(child, stack+sizeof(stack), CLONE_VM|CLONE_VFORK|SIGCHLD, args);

I also noticed this being mishandled. Running valgrind (valgrind-3.12.0.SVN) through strace, I see valgrind is running this code as:
clone(child_stack=0, flags=SIGCHLD)     = 4070

While dropping off some of the flags is annoying, and it really should NOT be doing that, it's setting the child_stack to 0! This is a garaunteed segmentation fault or other nasty behavior occuring. Notice the address in above message  "Access not within mapped region at address 0xFFFFFFFFFFFFFFD4", that's stack growing downwards on a 64-bit platform from 0. 0 only seems to be allowed as a child_stack with very specific flags.

The popular musl (http://www.musl-libc.org/) C library's posix_spawn() is also affected as it uses similar code internally (http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=8f7bc690f07e90177b176b6e19736ad7c1d49840#n168).

Valgrind should honor the specified child_stack or instead replace with its own managed stack instead of just setting it to 0 with these flags, which makes it impossible to debug programs using these implementations of posix_spawn().

Comment 2 Philippe Waroquiers 2016-12-07 21:55:16 UTC

(In reply to Steven Stewart-Gallus from comment #0)
> Typically, clone is used with the CLONE_VFORK | CLONE_VM options to emulate
> the vfork system call but with additional options. However, it is extremely
> awkward and dangerous to use vfork functionality in such a way that stacks
> are shared. To make things safer and avoid getting into trouble with
> compilers like Clang misoptimizing things I plan to use clone with
> CLONE_VFORK | CLONE_VM and not share stacks but jump to a newly allocated
> one. CLONE_VM alone without CLONE_VFORK is explicitly unsupported by
> Valgrind.
CLONE_VM without CLONE_VFORK is effectively unsupported.
However, the way valgrind "supports" CLONE_VM|CLONE_VFORK is to just
remove these 2 flags: valgrind does not support vfork semantic, and
so transforms a vfork into a fork, hoping the application does not
depends on the vfork semantic.

See the following code in e.g. syswrap-amd64-linux.c:
   case VKI_CLONE_VFORK | VKI_CLONE_VM: /* vfork */
      /* FALLTHROUGH - assume vfork == fork */
      cloneflags &= ~(VKI_CLONE_VFORK | VKI_CLONE_VM);

   case 0: /* plain fork */
      ...

That being said, it is unclear why SEGV is raised.
A simple fork test (none/tests/fork) gives the following strace -f valgrind:
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f15d396a9d0) = 18741

With your test program, strace -f valgrind gives:
clone(child_stack=0, flags=SIGCHLD) = 18774

So, I would guess that what causes the SEGV is not the child stack being 0,
as this looks 'classical' when vfork-ing or clone-ing under Valgrind,
but rather the CLEARTID/SETTID and child_tidptr aspects.

Comment 3 Philippe Waroquiers 2016-12-08 20:24:40 UTC

Can you please try the patch attached to bug 373192 ?
Thanks

Comment 4 Nach 2016-12-08 23:21:04 UTC

Hello Philippe,

I double checked with Valgrind 3.12.0 to ensure the bug was still occuring (it was), and then applied your patch.
The basic code I was testing no longer crashes and seems to be mostly working as it should with your patch!

Strace now shows: clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8e5944d9d0) = 16514

Which I'm not sure what to make of.


Anyways, I'm setting a seperate stack to use, and some basic testing shows things are behaving sanely, although I'm not certain the stack I'm specifying is even being used.

Of the flags, CLONE_VM|CLONE_VFORK|SIGCHLD, SIGCHLD is indeed being delivered to the parent, and the child appears to be scheduled before the parent (at least when tested across a bunch of runs), although the memory is not being shared, so it would appear CLONE_VM is being ignored.

For the needs of a well implemented posix_spawn() such as in musl or mine, CLONE_VM and CLONE_VFORK are really just optimizations, so it matters little if either was not respected. The main issue is to ensure a seperate stack is in operation in the child in order to not mangle the parent's, and preliminary testing shows it now appears to be fine.

I'll do further testing with more complicated cases (heavier stack use in child) to ensure whatever clone mangling is done seems to work for these scenerios. However, if a library/application is actually expecting the shared memory semantics, it seems a little more work is required.

Comment 5 Philippe Waroquiers 2016-12-11 21:41:14 UTC

Fixed in revision 16186.
Retesting welcome.

Thanks for the test case, added as none/tests/linux/clonev

Comment 6 Philippe Waroquiers 2016-12-11 22:25:10 UTC

(In reply to Nach from comment #4)
> Strace now shows: clone(child_stack=NULL,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7f8e5944d9d0) = 16514
> 
> Which I'm not sure what to make of.
The above is normal. Valgrind does not keep the CLONE_VM flag, as there
is no way to support this: valgrind has to do a lot of work in the child
and with CLONE_VM, any work done on any Valgrind data structure would
corrupt the parent Valgrind data structures.


> 
> 
> Anyways, I'm setting a seperate stack to use, and some basic testing shows
> things are behaving sanely, although I'm not certain the stack I'm
> specifying is even being used.
The fact that you see a NULL stack is also normal: this is the 'valgrind host
stack'. The stack that you provide in the clone syscall is set as the
guest stack (not visible in the syscall, this is only visible in the simulated
CPU SP register).


> Of the flags, CLONE_VM|CLONE_VFORK|SIGCHLD, SIGCHLD is indeed being
> delivered to the parent, and the child appears to be scheduled before the
> parent (at least when tested across a bunch of runs), although the memory is
> not being shared, so it would appear CLONE_VM is being ignored.
In the final version of the commited patch, CLONE_VFORK is kept.
There is no way to support CLONE_VM.
> 
> For the needs of a well implemented posix_spawn() such as in musl or mine,
> CLONE_VM and CLONE_VFORK are really just optimizations, so it matters little
> if either was not respected. The main issue is to ensure a seperate stack is
> in operation in the child in order to not mangle the parent's, and
> preliminary testing shows it now appears to be fine.
> 
> I'll do further testing with more complicated cases (heavier stack use in
> child) to ensure whatever clone mangling is done seems to work for these
> scenerios. However, if a library/application is actually expecting the
> shared memory semantics, it seems a little more work is required.
Thanks for any additional testing or verification you could do.

Comment 7 Nach 2016-12-13 19:59:15 UTC

I ran a bunch more tests to ensure stability with the stack in the child.

It's not using the exact stack the parent is specifying to use (beyond setting some a few bytes at the top of it), which may or may not be a problem for some (application which save child state?). However I can confirm that the stack in the child is as big as it needs to be. The larger the stack I tell clone() to use, the larger the stack appears to be in the child (which without would stack overflow). So it would seem any sane application using clone() with a stack which doesn't care about actually getting the stack data directly in the parent seems to work as expected.

On very large stacks I did see this warning:
==8967== Warning: client switching stacks?  SP change: 0xa5fd030 --> 0x65fd020
==8967==          to suppress, use: --max-stackframe=67108880 or greater

But again, things work as expected. So all my testing shows this patch is good except it does not add support for more advanced cases (CLONE_VM, child stack access in parent).

I cannot speak for the original bug reporter, but for the cases I'm dealing with, I consider this bug fixed.