Bug 427433 - Unable to run QProcess: invalid file descriptor in syscall clone()
Summary: Unable to run QProcess: invalid file descriptor in syscall clone()
Status: RESOLVED WORKSFORME
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (other bugs)
Version First Reported In: unspecified
Platform: Arch Linux Linux
: NOR critical
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-08 05:57 UTC by Matthias Klein
Modified: 2021-10-13 21:06 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Source code of example to reproduce and valgrind logfile (1.81 KB, application/x-zip-compressed)
2020-10-08 05:57 UTC, Matthias Klein
Details
Simplified (but not working) testcase (4.13 KB, text/x-csrc)
2020-10-08 19:18 UTC, Thiago Macieira
Details
Simplified (but not working) testcase with 4 pipes (5.55 KB, text/x-csrc)
2020-10-08 20:43 UTC, Thiago Macieira
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klein 2020-10-08 05:57:34 UTC
Created attachment 132203 [details]
Source code of example to reproduce and valgrind logfile

A simple Qt program with QProcess running under valgrind/memcheck will cause the child process not to run, and valgrind will write errors/stacktraces. See attachment. A test with Qt 5.12.9 ran without problems.

- Distribution: Arch Linux
- GCC 10.2.0
- Valgrind: 3.16.1
- Kernel: 5.8.13-arch1-1
- Qt 5.15.1

I have previously reported the problem to Qt, and they think it is a valgring bug: https://bugreports.qt.io/browse/QTBUG-87230

> The clone syscall doesn't get a file descriptor as a parameter, only as output, which means Valgrind is failing to interpret correctly what the kernel returned.
> This looks like a valgrind bug.
Comment 1 Thiago Macieira 2020-10-08 19:18:52 UTC
Created attachment 132222 [details]
Simplified (but not working) testcase

The attached is a simplified testcase I had used to submit https://sourceware.org/bugzilla/show_bug.cgi?id=26562. It uses clone with CLONE_PIDFD directly. It does produce the first warning (the waitid one) in Valgrind.

But it is not causing the assertion failure for me. So this is not a proper testcase, just something to show what Qt is attempting to do.
Comment 2 Thiago Macieira 2020-10-08 19:27:52 UTC
Relevant section of strace of attachment 132222 [details] (parent process):
eventfd2(0, 0)                          = 3
waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor)
pipe([4, 5])                            = 0
clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[6]) = 601608
write(3, "\1\0\0\0\0\0\0\0", 8)         = 8
close(5)                                = 0
waitid(P_PIDFD, 6, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0}, WEXITED|__WALL, NULL) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

Child process:
read(3, "\1\0\0\0\0\0\0\0", 8)          = 8
close(3)                                = 0
close(4)                                = 0
dup2(5, 1)                              = 1
execve("/bin/true", ["/bin/true"], 0x7ffc69cbd8d8 /* 109 vars */) = 0


Relevant sections of strace of attachment 132203 [details] (parent):
pipe2([3, 4], O_CLOEXEC)                = 0
pipe2([5, 6], O_CLOEXEC)                = 0
pipe2([7, 8], O_CLOEXEC)                = 0
pipe2([9, 11], O_CLOEXEC)               = 0
getcwd("/tmp/qtbug-87230", 4096)        = 17
statx(AT_FDCWD, "/usr/bin/uname", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0755, stx_size=31464, ...}) = 0
access("/usr/bin/uname", X_OK)          = 0
waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor)
clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[12]) = 601703
close(11)                               = 0
close(3)                                = 0
fcntl(4, F_GETFL)                       = 0x1 (flags O_WRONLY)
fcntl(4, F_SETFL, O_WRONLY|O_NONBLOCK)  = 0
close(6)                                = 0
fcntl(5, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
close(8)                                = 0
fcntl(7, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
ppoll([{fd=9, events=POLLIN}], 1, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=9, revents=POLLHUP}], left {tv_sec=9, tv_nsec=999333222})
read(9, "", 12)                         = 0
close(9)                                = 0
ppoll([{fd=-1}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=12, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999564841})
ioctl(5, FIONREAD, [115])               = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601703, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

Child process:
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f8434879530}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
dup3(3, 0, 0)                           = 0
dup3(6, 1, 0)                           = 1
dup3(8, 2, 0)                           = 2
close(9)                                = 0
execve("/usr/bin/uname", ["/usr/bin/uname", "-a"], 0x7ffceab80858 /* 109 vars */) = 0


The QProcess example has 4 pipes instead of 1 and it stops on poll() instead of waitid(). Otherwise, the code is supposed to be the same.
Comment 3 Thiago Macieira 2020-10-08 20:43:28 UTC
Created attachment 132224 [details]
Simplified (but not working) testcase with 4 pipes

This more closely resembles what QProcess does, but still does not trigger the assertion failure in Valgrind.
Comment 4 Paul Floyd 2020-10-23 05:53:41 UTC
I'm not able to reproduce this issue on Fedora 32.

gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5)
valgrind-3.16.1
vmlinuz-5.8.15-201.fc32.x86_64
/usr/lib64/libc-2.31.so
Qt 5.15.1 from qt.io

I do get
==11044== Syscall param waitid(infop) points to unaddressable byte(s)

Is this specific to Arch, or can you reproduce with other distros?
Comment 5 Thiago Macieira 2020-10-23 19:37:42 UTC
(In reply to Paul Floyd from comment #4)
> Is this specific to Arch, or can you reproduce with other distros?

I can reproduce it on OpenSUSE Tumbleweed.

Please note it has to be the original testcase from Matthias.
Comment 6 Paul Floyd 2020-10-23 20:36:07 UTC
This is with the original testcase.
Comment 7 Paul Floyd 2020-11-19 22:31:34 UTC
I have tried repeatedly to install Tumbleweed in VirtualBox. Epic fail, truly awful installer (though possible VB is partly to blame). Then I tried JeOS, but it's a major battle to get a desktop installed.
Comment 8 Paul Floyd 2020-11-20 09:02:25 UTC
Finally managed to get it installed, switching to VB under Windows.

SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6050, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6078, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60a0, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60c8, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](79) sys_getcwd ( 0x1ffeffe7e0, 4096 )[sync] --> Success(0x16) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e6ef8(/home/paulf/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e7158(/usr/local/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e73a8(/usr/bin/uname), 0x1ffefff780 )[sync] --> Success(0x0) 
SYSCALL[1740,1](21) sys_access ( 0x85e73a8(/usr/bin/uname), 1 )[sync] --> Success(0x0) 
SYSCALL[1740,1](247) sys_waitid( 3, 2147483647, 0x0, 5, 0x0 )==1740== Syscall param waitid(infop) points to unaddressable byte(s)
==1740==    at 0x5DE234D: syscall (in /lib64/libc-2.32.so)
==1740==    by 0x53B1996: sys_waitid (forkfd_linux.c:65)
==1740==    by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125)
==1740==    by 0x53B1996: system_forkfd (forkfd_linux.c:141)
==1740==    by 0x53B1996: forkfd (forkfd.c:651)
==1740==    by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465)
==1740==    by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247)
==1740==    by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095)
==1740==    by 0x4012A4: main (in /home/paulf/Downloads/test)
==1740==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1740== 
 --> [async] ... 
SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 
SYSCALL[1740,1](56) sys_clone ( 1011, 0x0, 0x1ffefff8f8, 0x0, 0x0 )   clone(fork): process 1740 created child 1741
 --> [pre-success] Success(0x6cd) 
SYSCALL[1740,1](3) sys_close ( 11 )[sync] --> Success(0x0) 
SYSCALL[1740,1](3) sys_close ( 4 )[sync] --> Success(0x0) 
SYSCALL[1740,1](72) ==1741== Warning: invalid file descriptor 1476427248 in syscall clone()
 --> [pre-fail] Failure(0x18)
valgrind: m_syswrap/syswrap-main.c:1957 (vgPlain_client_syscall): Assertion '0 == (sci->flags & ~(SfMayBlock | SfPostOnFail | SfPollAfter))' failed.

host stacktrace:
==1741==    at 0x580450A1: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580450D7: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580430D5: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x581F2DC0: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x5809DA5E: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580D57E6: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1
Comment 9 Paul Floyd 2020-11-20 10:12:29 UTC
The 1st error that Valgrind is seeing comes from this (trimmed)

static int detect_clone_pidfd_support()
{
    sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL);
    return errno == EBADF ? 1 : -1;
}

Passing a dummmy siginfo_t for arg3 would fix the error.

However the detection function is still doing its job. Valgrind says

SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 

where 9 is EBADF.

After that, the clone is succeeding. I need to do more analysis and debugging to see what is causing the fd warning and assert.
Comment 10 Thiago Macieira 2020-11-20 16:02:21 UTC
(In reply to Paul Floyd from comment #9)
> The 1st error that Valgrind is seeing comes from this (trimmed)
> 
> static int detect_clone_pidfd_support()
> {
>     sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL);
>     return errno == EBADF ? 1 : -1;
> }
> 
> Passing a dummmy siginfo_t for arg3 would fix the error.

I know, but this error is innocuous and the kernel doesn't complain, so I didn't feel the need to pass something there.

> However the detection function is still doing its job. Valgrind says
> 
> SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 
> 
> where 9 is EBADF.
> 
> After that, the clone is succeeding. I need to do more analysis and
> debugging to see what is causing the fd warning and assert.

Thank you for the effort.
Comment 11 Paul Floyd 2020-11-22 17:14:10 UTC
Before my relating my debug efforts, I did notice that there was one change in git - 22aa8640e6c44c78c228ffa726cfacf918455343 - that might affect this problem, adding PIDFDs, but that was back in Feb 2020, so it should be in 3.16.1. 

So, debugging the problem. I just tried with Valgrind built from git HEAD and there was no problem.

paulf@localhost:~/Downloads> ../valgrind/vg-in-place -q ./test
Qt Version: 5.15.2
==16771== Syscall param waitid(infop) points to unaddressable byte(s)
==16771==    at 0x5DE334D: syscall (in /lib64/libc-2.32.so)
==16771==    by 0x53B1996: sys_waitid (forkfd_linux.c:65)
==16771==    by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125)
==16771==    by 0x53B1996: system_forkfd (forkfd_linux.c:141)
==16771==    by 0x53B1996: forkfd (forkfd.c:651)
==16771==    by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465)
==16771==    by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247)
==16771==    by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095)
==16771==    by 0x4012A4: main (in /home/paulf/Downloads/test)
==16771==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==16771== 
"Linux localhost.localdomain 5.9.8-2-default #1 SMP Thu Nov 12 07:43:32 UTC 2020 (ea93937) x86_64 x86_64 x86_64 GNU/Linux\n"

Then I checked out the 3.16.1 code, same thing.

Next I tried debugging the package installed Valgrind. No joy, it's been stripped and has no debug info.

paulf@localhost:~/Downloads> ll /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux
-rwxr-xr-x 1 root  root   2743648 Sep 30 20:10 /usr/lib64/valgrind/memcheck-amd64-linux
-rwxr-xr-x 1 paulf users 16555896 Nov 22 16:42 ../valgrind/memcheck/memcheck-amd64-linux
paulf@localhost:~/Downloads> file /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux
/usr/lib64/valgrind/memcheck-amd64-linux:  ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=fb2fe27acb715941050c3806c632d51fe2c03455, stripped
../valgrind/memcheck/memcheck-amd64-linux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=8192d34835865874742478127c0276f5fc75b277, with debug_info, not stripped

I had a look here

https://build.opensuse.org/package/view_file/devel:tools/valgrind/valgrind.spec?expand=1

I don't know what their optflags are. I tried adding the LTO patch and configuring with --enable-lto=yes, and still could not reproduce the issue.

That's about all that I can try for the moment.
Comment 12 Thiago Macieira 2020-11-22 18:28:05 UTC
You can get debug symbols by installing package valgrind-debuginfo (after enabling  "repo-debug": zypper mr -e repo-debug).

But if it works on the current Git master, then I'll be satisfied with the bug being closed.
Comment 13 Paul Floyd 2020-11-22 20:02:10 UTC
Debuginfo works. I'll look more tomorrow.
Comment 14 Paul Floyd 2020-11-24 08:42:20 UTC
Looking more didn't help that much. I wasn't able to get GDB to stop where the error occurs. My conclusion is that this is due to some compiler or linker optimization.

Closing this for now.