427433 – Unable to run QProcess: invalid file descriptor in syscall clone()

Bug 427433 - Unable to run QProcess: invalid file descriptor in syscall clone()

Summary: Unable to run QProcess: invalid file descriptor in syscall clone()

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	memcheck (other bugs)
Version First Reported In:	unspecified
Platform:	Arch Linux Linux

Importance:	NOR critical
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-10-08 05:57 UTC by Matthias Klein
Modified:	2021-10-13 21:06 UTC (History)
CC List:	3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Source code of example to reproduce and valgrind logfile (1.81 KB, application/x-zip-compressed) 2020-10-08 05:57 UTC, Matthias Klein	Details
Simplified (but not working) testcase (4.13 KB, text/x-csrc) 2020-10-08 19:18 UTC, Thiago Macieira	Details
Simplified (but not working) testcase with 4 pipes (5.55 KB, text/x-csrc) 2020-10-08 20:43 UTC, Thiago Macieira	Details
Show Obsolete (1) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Matthias Klein 2020-10-08 05:57:34 UTC

Created attachment 132203 [details]
Source code of example to reproduce and valgrind logfile

A simple Qt program with QProcess running under valgrind/memcheck will cause the child process not to run, and valgrind will write errors/stacktraces. See attachment. A test with Qt 5.12.9 ran without problems.

- Distribution: Arch Linux
- GCC 10.2.0
- Valgrind: 3.16.1
- Kernel: 5.8.13-arch1-1
- Qt 5.15.1

I have previously reported the problem to Qt, and they think it is a valgring bug: https://bugreports.qt.io/browse/QTBUG-87230

> The clone syscall doesn't get a file descriptor as a parameter, only as output, which means Valgrind is failing to interpret correctly what the kernel returned.
> This looks like a valgrind bug.

Comment 1 Thiago Macieira 2020-10-08 19:18:52 UTC

Created attachment 132222 [details]
Simplified (but not working) testcase

The attached is a simplified testcase I had used to submit https://sourceware.org/bugzilla/show_bug.cgi?id=26562. It uses clone with CLONE_PIDFD directly. It does produce the first warning (the waitid one) in Valgrind.

But it is not causing the assertion failure for me. So this is not a proper testcase, just something to show what Qt is attempting to do.

Comment 2 Thiago Macieira 2020-10-08 19:27:52 UTC

Relevant section of strace of attachment 132222 [details] (parent process):
eventfd2(0, 0)                          = 3
waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor)
pipe([4, 5])                            = 0
clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[6]) = 601608
write(3, "\1\0\0\0\0\0\0\0", 8)         = 8
close(5)                                = 0
waitid(P_PIDFD, 6, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0}, WEXITED|__WALL, NULL) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

Child process:
read(3, "\1\0\0\0\0\0\0\0", 8)          = 8
close(3)                                = 0
close(4)                                = 0
dup2(5, 1)                              = 1
execve("/bin/true", ["/bin/true"], 0x7ffc69cbd8d8 /* 109 vars */) = 0


Relevant sections of strace of attachment 132203 [details] (parent):
pipe2([3, 4], O_CLOEXEC)                = 0
pipe2([5, 6], O_CLOEXEC)                = 0
pipe2([7, 8], O_CLOEXEC)                = 0
pipe2([9, 11], O_CLOEXEC)               = 0
getcwd("/tmp/qtbug-87230", 4096)        = 17
statx(AT_FDCWD, "/usr/bin/uname", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0755, stx_size=31464, ...}) = 0
access("/usr/bin/uname", X_OK)          = 0
waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor)
clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[12]) = 601703
close(11)                               = 0
close(3)                                = 0
fcntl(4, F_GETFL)                       = 0x1 (flags O_WRONLY)
fcntl(4, F_SETFL, O_WRONLY|O_NONBLOCK)  = 0
close(6)                                = 0
fcntl(5, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
close(8)                                = 0
fcntl(7, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK)  = 0
ppoll([{fd=9, events=POLLIN}], 1, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=9, revents=POLLHUP}], left {tv_sec=9, tv_nsec=999333222})
read(9, "", 12)                         = 0
close(9)                                = 0
ppoll([{fd=-1}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=12, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999564841})
ioctl(5, FIONREAD, [115])               = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601703, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

Child process:
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f8434879530}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
dup3(3, 0, 0)                           = 0
dup3(6, 1, 0)                           = 1
dup3(8, 2, 0)                           = 2
close(9)                                = 0
execve("/usr/bin/uname", ["/usr/bin/uname", "-a"], 0x7ffceab80858 /* 109 vars */) = 0


The QProcess example has 4 pipes instead of 1 and it stops on poll() instead of waitid(). Otherwise, the code is supposed to be the same.

Comment 3 Thiago Macieira 2020-10-08 20:43:28 UTC

Created attachment 132224 [details]
Simplified (but not working) testcase with 4 pipes

This more closely resembles what QProcess does, but still does not trigger the assertion failure in Valgrind.

Comment 4 Paul Floyd 2020-10-23 05:53:41 UTC

I'm not able to reproduce this issue on Fedora 32.

gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5)
valgrind-3.16.1
vmlinuz-5.8.15-201.fc32.x86_64
/usr/lib64/libc-2.31.so
Qt 5.15.1 from qt.io

I do get
==11044== Syscall param waitid(infop) points to unaddressable byte(s)

Is this specific to Arch, or can you reproduce with other distros?

Comment 5 Thiago Macieira 2020-10-23 19:37:42 UTC

(In reply to Paul Floyd from comment #4)
> Is this specific to Arch, or can you reproduce with other distros?

I can reproduce it on OpenSUSE Tumbleweed.

Please note it has to be the original testcase from Matthias.

Comment 6 Paul Floyd 2020-10-23 20:36:07 UTC

This is with the original testcase.

Comment 7 Paul Floyd 2020-11-19 22:31:34 UTC

I have tried repeatedly to install Tumbleweed in VirtualBox. Epic fail, truly awful installer (though possible VB is partly to blame). Then I tried JeOS, but it's a major battle to get a desktop installed.

Comment 8 Paul Floyd 2020-11-20 09:02:25 UTC

Finally managed to get it installed, switching to VB under Windows.

SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6050, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6078, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60a0, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60c8, 0x80000 )[sync] --> Success(0x0) 
SYSCALL[1740,1](79) sys_getcwd ( 0x1ffeffe7e0, 4096 )[sync] --> Success(0x16) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e6ef8(/home/paulf/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e7158(/usr/local/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) 
SYSCALL[1740,1](4) sys_newstat ( 0x85e73a8(/usr/bin/uname), 0x1ffefff780 )[sync] --> Success(0x0) 
SYSCALL[1740,1](21) sys_access ( 0x85e73a8(/usr/bin/uname), 1 )[sync] --> Success(0x0) 
SYSCALL[1740,1](247) sys_waitid( 3, 2147483647, 0x0, 5, 0x0 )==1740== Syscall param waitid(infop) points to unaddressable byte(s)
==1740==    at 0x5DE234D: syscall (in /lib64/libc-2.32.so)
==1740==    by 0x53B1996: sys_waitid (forkfd_linux.c:65)
==1740==    by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125)
==1740==    by 0x53B1996: system_forkfd (forkfd_linux.c:141)
==1740==    by 0x53B1996: forkfd (forkfd.c:651)
==1740==    by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465)
==1740==    by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247)
==1740==    by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095)
==1740==    by 0x4012A4: main (in /home/paulf/Downloads/test)
==1740==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1740== 
 --> [async] ... 
SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 
SYSCALL[1740,1](56) sys_clone ( 1011, 0x0, 0x1ffefff8f8, 0x0, 0x0 )   clone(fork): process 1740 created child 1741
 --> [pre-success] Success(0x6cd) 
SYSCALL[1740,1](3) sys_close ( 11 )[sync] --> Success(0x0) 
SYSCALL[1740,1](3) sys_close ( 4 )[sync] --> Success(0x0) 
SYSCALL[1740,1](72) ==1741== Warning: invalid file descriptor 1476427248 in syscall clone()
 --> [pre-fail] Failure(0x18)
valgrind: m_syswrap/syswrap-main.c:1957 (vgPlain_client_syscall): Assertion '0 == (sci->flags & ~(SfMayBlock | SfPostOnFail | SfPollAfter))' failed.

host stacktrace:
==1741==    at 0x580450A1: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580450D7: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580430D5: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x581F2DC0: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x5809DA5E: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==1741==    by 0x580D57E6: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Comment 9 Paul Floyd 2020-11-20 10:12:29 UTC

The 1st error that Valgrind is seeing comes from this (trimmed)

static int detect_clone_pidfd_support()
{
    sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL);
    return errno == EBADF ? 1 : -1;
}

Passing a dummmy siginfo_t for arg3 would fix the error.

However the detection function is still doing its job. Valgrind says

SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 

where 9 is EBADF.

After that, the clone is succeeding. I need to do more analysis and debugging to see what is causing the fd warning and assert.

Comment 10 Thiago Macieira 2020-11-20 16:02:21 UTC

(In reply to Paul Floyd from comment #9)
> The 1st error that Valgrind is seeing comes from this (trimmed)
> 
> static int detect_clone_pidfd_support()
> {
>     sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL);
>     return errno == EBADF ? 1 : -1;
> }
> 
> Passing a dummmy siginfo_t for arg3 would fix the error.

I know, but this error is innocuous and the kernel doesn't complain, so I didn't feel the need to pass something there.

> However the detection function is still doing its job. Valgrind says
> 
> SYSCALL[1740,1](247) ... [async] --> Failure(0x9) 
> 
> where 9 is EBADF.
> 
> After that, the clone is succeeding. I need to do more analysis and
> debugging to see what is causing the fd warning and assert.

Thank you for the effort.

Comment 11 Paul Floyd 2020-11-22 17:14:10 UTC

Before my relating my debug efforts, I did notice that there was one change in git - 22aa8640e6c44c78c228ffa726cfacf918455343 - that might affect this problem, adding PIDFDs, but that was back in Feb 2020, so it should be in 3.16.1. 

So, debugging the problem. I just tried with Valgrind built from git HEAD and there was no problem.

paulf@localhost:~/Downloads> ../valgrind/vg-in-place -q ./test
Qt Version: 5.15.2
==16771== Syscall param waitid(infop) points to unaddressable byte(s)
==16771==    at 0x5DE334D: syscall (in /lib64/libc-2.32.so)
==16771==    by 0x53B1996: sys_waitid (forkfd_linux.c:65)
==16771==    by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125)
==16771==    by 0x53B1996: system_forkfd (forkfd_linux.c:141)
==16771==    by 0x53B1996: forkfd (forkfd.c:651)
==16771==    by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465)
==16771==    by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247)
==16771==    by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095)
==16771==    by 0x4012A4: main (in /home/paulf/Downloads/test)
==16771==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==16771== 
"Linux localhost.localdomain 5.9.8-2-default #1 SMP Thu Nov 12 07:43:32 UTC 2020 (ea93937) x86_64 x86_64 x86_64 GNU/Linux\n"

Then I checked out the 3.16.1 code, same thing.

Next I tried debugging the package installed Valgrind. No joy, it's been stripped and has no debug info.

paulf@localhost:~/Downloads> ll /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux
-rwxr-xr-x 1 root  root   2743648 Sep 30 20:10 /usr/lib64/valgrind/memcheck-amd64-linux
-rwxr-xr-x 1 paulf users 16555896 Nov 22 16:42 ../valgrind/memcheck/memcheck-amd64-linux
paulf@localhost:~/Downloads> file /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux
/usr/lib64/valgrind/memcheck-amd64-linux:  ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=fb2fe27acb715941050c3806c632d51fe2c03455, stripped
../valgrind/memcheck/memcheck-amd64-linux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=8192d34835865874742478127c0276f5fc75b277, with debug_info, not stripped

I had a look here

https://build.opensuse.org/package/view_file/devel:tools/valgrind/valgrind.spec?expand=1

I don't know what their optflags are. I tried adding the LTO patch and configuring with --enable-lto=yes, and still could not reproduce the issue.

That's about all that I can try for the moment.

Comment 12 Thiago Macieira 2020-11-22 18:28:05 UTC

You can get debug symbols by installing package valgrind-debuginfo (after enabling  "repo-debug": zypper mr -e repo-debug).

But if it works on the current Git master, then I'll be satisfied with the bug being closed.

Comment 13 Paul Floyd 2020-11-22 20:02:10 UTC

Debuginfo works. I'll look more tomorrow.

Comment 14 Paul Floyd 2020-11-24 08:42:20 UTC

Looking more didn't help that much. I wasn't able to get GDB to stop where the error occurs. My conclusion is that this is due to some compiler or linker optimization.

Closing this for now.