Bug 373192 - Calling posix_spawn in glibc 2.24 completely broken
Summary: Calling posix_spawn in glibc 2.24 completely broken
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (other bugs)
Version First Reported In: 3.12.0
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-02 16:23 UTC by Fredrik Hallenberg
Modified: 2016-12-12 10:00 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Trace with fc25 (7.06 KB, text/plain)
2016-12-06 08:58 UTC, Fredrik Hallenberg
Details
Trace with fc24 (17.81 KB, text/plain)
2016-12-06 09:00 UTC, Fredrik Hallenberg
Details
tentative patch (ugly, only for amd64) patch to better support clone(vfork) and stack (3.25 KB, text/plain)
2016-12-08 20:23 UTC, Philippe Waroquiers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Fredrik Hallenberg 2016-12-02 16:23:32 UTC
Executing the program below with valgrind 3.12 built from sources on Fedora 25 gives the following output:

pid 21786 child 21787 r 0
pid 21787 child 83036382 r 0

Without valgrind:

pid 21805 child 21806 r 0
<output from ls>

Valgrind completely breaks the posix_spawn call by causing some kind of fork and also not executing the ls.


#include <unistd.h>
#include <spawn.h>
#include <stdio.h>

int main()
{
  pid_t pid;
  char cmd[] = "ls";
  char* arg[] = { cmd, NULL };
  int r = posix_spawnp(&pid, cmd, NULL, NULL, arg, NULL);
  fprintf(stderr, "pid %d child %d r %d\n", getpid(), pid, r);
}
Comment 1 Fredrik Hallenberg 2016-12-02 16:35:15 UTC
The same example worked on fedora 24 (glibc 2.23, linux 4.7.9), fedora 25 uses glibc 2.24 and linux 4.8.8.
Comment 2 Philippe Waroquiers 2016-12-03 13:12:51 UTC
Looking at the differences between 2.23 and 2.24, I suspect the problem
is created because 2.24 is implementing posix_spawn(p) using
clone (CLONE_VM| CLONE_VFORK), while 2.23 uses fork or vfork.
Valgrind does not implement vfork, it just considers vfork is a fork,
hoping for the best.

It looks like the new glibc posix_spawn implementation depends more
on the precise semantic of vfork (or more exactly, of clone(CLONE_VM| CLONE_VFORK),
such as sharing address space and/or having the calling thread suspended
waiting for the child to exec or exit.
If the dependency is only on the 'being suspended" aspect, that might 
be relatively easy to fix.

Hwoever, from the trace, it looks like both the parent and the child
have executed the fprintf. Not very clear how that can happen.

Can you run with --trace-syscalls=yes --trace-children=yes --time-stamp=yes
and attach the resulting trace ?


Thanks
Comment 3 Fredrik Hallenberg 2016-12-06 08:58:55 UTC
Created attachment 102643 [details]
Trace with fc25
Comment 4 Fredrik Hallenberg 2016-12-06 09:00:53 UTC
Created attachment 102644 [details]
Trace with fc24
Comment 5 Fredrik Hallenberg 2016-12-06 09:07:26 UTC
Also, as has been noted in bug 342040, it seems the clone call is not handled correctly when looking with strace:

strace valgrind ~/spawn|& grep clone
=>
clone(child_stack=NULL, flags=SIGCHLD)

strace ~/spawn|& grep clone
=>
clone(child_stack=0x7f13dd5d6ff0, flags=CLONE_VM|CLONE_VFORK|SIGCHLD)
Comment 6 Philippe Waroquiers 2016-12-07 21:26:24 UTC
(In reply to megahallon from comment #5)
> Also, as has been noted in bug 342040, it seems the clone call is not
> handled correctly when looking with strace:
> 
> strace valgrind ~/spawn|& grep clone
> =>
> clone(child_stack=NULL, flags=SIGCHLD)
> 
> strace ~/spawn|& grep clone
> =>
> clone(child_stack=0x7f13dd5d6ff0, flags=CLONE_VM|CLONE_VFORK|SIGCHLD)

This is normal. Valgrind transforms a vfork into a fork, as it does
not support vfork semantic.
See the following code in e.g. syswrap-amd64-linux.c:
   case VKI_CLONE_VFORK | VKI_CLONE_VM: /* vfork */
      /* FALLTHROUGH - assume vfork == fork */
      cloneflags &= ~(VKI_CLONE_VFORK | VKI_CLONE_VM);

   case 0: /* plain fork */
      ...
Comment 7 Philippe Waroquiers 2016-12-07 22:48:20 UTC
(In reply to megahallon from comment #0)
> Executing the program below with valgrind 3.12 built from sources on Fedora
> 25 gives the following output:
> 
> pid 21786 child 21787 r 0
> pid 21787 child 83036382 r 0

As far as I can see, the above happens because Valgrind transforms a
  clone (SIGCHLD | CLONE_VFORK | CLONE_VM)
into a simple
  clone(SIGCHLD)
but wrongly assumes that both parent and child will check for clone
return code, and executes 'their' part of the code.

I have not (yet) understood where/how Valgrind gives the 
guest thread function pointer for a 'thread clone'.
If/when I understand that, it might be possible to ensure that the cloned
child 'jumps' to the correct function for a vfork clone transformed
into a 'normal fork'.
Comment 8 Philippe Waroquiers 2016-12-08 20:23:48 UTC
Created attachment 102691 [details]
tentative patch (ugly, only for amd64) patch to better support clone(vfork) and stack

Can you please try the attached patch ? (patch on latest SVN version)

It seems to improve the clone test case provided in bug 342040,
and I suspect fc25 posix_spawn might fail for similar reason, i.e.
that the parent has prepared a stack for the child, but the child was just
using the copy of the parent stack.
Comment 9 Fredrik Hallenberg 2016-12-09 09:03:46 UTC
(In reply to Philippe Waroquiers from comment #8)
> Created attachment 102691 [details]
> tentative patch (ugly, only for amd64) patch to better support clone(vfork)
> and stack
> 
> Can you please try the attached patch ? (patch on latest SVN version)
> 
> It seems to improve the clone test case provided in bug 342040,
> and I suspect fc25 posix_spawn might fail for similar reason, i.e.
> that the parent has prepared a stack for the child, but the child was just
> using the copy of the parent stack.

Yes as far as I can see it works fine with this patch.
Comment 10 Philippe Waroquiers 2016-12-11 21:40:20 UTC
Fixed in revision 16186.
Retesting welcome.
Comment 11 Fredrik Hallenberg 2016-12-12 10:00:43 UTC
(In reply to Philippe Waroquiers from comment #10)
> Fixed in revision 16186.
> Retesting welcome.

I have tested with the svn version and it is still fine, thanks for the fix.