Created attachment 132203 [details] Source code of example to reproduce and valgrind logfile A simple Qt program with QProcess running under valgrind/memcheck will cause the child process not to run, and valgrind will write errors/stacktraces. See attachment. A test with Qt 5.12.9 ran without problems. - Distribution: Arch Linux - GCC 10.2.0 - Valgrind: 3.16.1 - Kernel: 5.8.13-arch1-1 - Qt 5.15.1 I have previously reported the problem to Qt, and they think it is a valgring bug: https://bugreports.qt.io/browse/QTBUG-87230 > The clone syscall doesn't get a file descriptor as a parameter, only as output, which means Valgrind is failing to interpret correctly what the kernel returned. > This looks like a valgrind bug.
Created attachment 132222 [details] Simplified (but not working) testcase The attached is a simplified testcase I had used to submit https://sourceware.org/bugzilla/show_bug.cgi?id=26562. It uses clone with CLONE_PIDFD directly. It does produce the first warning (the waitid one) in Valgrind. But it is not causing the assertion failure for me. So this is not a proper testcase, just something to show what Qt is attempting to do.
Relevant section of strace of attachment 132222 [details] (parent process): eventfd2(0, 0) = 3 waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor) pipe([4, 5]) = 0 clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[6]) = 601608 write(3, "\1\0\0\0\0\0\0\0", 8) = 8 close(5) = 0 waitid(P_PIDFD, 6, {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0}, WEXITED|__WALL, NULL) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601608, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- Child process: read(3, "\1\0\0\0\0\0\0\0", 8) = 8 close(3) = 0 close(4) = 0 dup2(5, 1) = 1 execve("/bin/true", ["/bin/true"], 0x7ffc69cbd8d8 /* 109 vars */) = 0 Relevant sections of strace of attachment 132203 [details] (parent): pipe2([3, 4], O_CLOEXEC) = 0 pipe2([5, 6], O_CLOEXEC) = 0 pipe2([7, 8], O_CLOEXEC) = 0 pipe2([9, 11], O_CLOEXEC) = 0 getcwd("/tmp/qtbug-87230", 4096) = 17 statx(AT_FDCWD, "/usr/bin/uname", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0755, stx_size=31464, ...}) = 0 access("/usr/bin/uname", X_OK) = 0 waitid(P_PIDFD, 2147483647, NULL, WNOHANG|WEXITED, NULL) = -1 EBADF (Bad file descriptor) clone(child_stack=NULL, flags=CLONE_PIDFD|SIGCHLD, parent_tid=[12]) = 601703 close(11) = 0 close(3) = 0 fcntl(4, F_GETFL) = 0x1 (flags O_WRONLY) fcntl(4, F_SETFL, O_WRONLY|O_NONBLOCK) = 0 close(6) = 0 fcntl(5, F_GETFL) = 0 (flags O_RDONLY) fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 close(8) = 0 fcntl(7, F_GETFL) = 0 (flags O_RDONLY) fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 ppoll([{fd=9, events=POLLIN}], 1, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=9, revents=POLLHUP}], left {tv_sec=9, tv_nsec=999333222}) read(9, "", 12) = 0 close(9) = 0 ppoll([{fd=-1}, {fd=5, events=POLLIN}, {fd=7, events=POLLIN}, {fd=12, events=POLLIN}], 4, {tv_sec=10, tv_nsec=0}, NULL, 8) = 1 ([{fd=5, revents=POLLIN}], left {tv_sec=9, tv_nsec=999564841}) ioctl(5, FIONREAD, [115]) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=601703, si_uid=1000, si_status=0, si_utime=0, si_stime=0} --- Child process: rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[PIPE], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f8434879530}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 dup3(3, 0, 0) = 0 dup3(6, 1, 0) = 1 dup3(8, 2, 0) = 2 close(9) = 0 execve("/usr/bin/uname", ["/usr/bin/uname", "-a"], 0x7ffceab80858 /* 109 vars */) = 0 The QProcess example has 4 pipes instead of 1 and it stops on poll() instead of waitid(). Otherwise, the code is supposed to be the same.
Created attachment 132224 [details] Simplified (but not working) testcase with 4 pipes This more closely resembles what QProcess does, but still does not trigger the assertion failure in Valgrind.
I'm not able to reproduce this issue on Fedora 32. gcc (GCC) 10.2.1 20201005 (Red Hat 10.2.1-5) valgrind-3.16.1 vmlinuz-5.8.15-201.fc32.x86_64 /usr/lib64/libc-2.31.so Qt 5.15.1 from qt.io I do get ==11044== Syscall param waitid(infop) points to unaddressable byte(s) Is this specific to Arch, or can you reproduce with other distros?
(In reply to Paul Floyd from comment #4) > Is this specific to Arch, or can you reproduce with other distros? I can reproduce it on OpenSUSE Tumbleweed. Please note it has to be the original testcase from Matthias.
This is with the original testcase.
I have tried repeatedly to install Tumbleweed in VirtualBox. Epic fail, truly awful installer (though possible VB is partly to blame). Then I tried JeOS, but it's a major battle to get a desktop installed.
Finally managed to get it installed, switching to VB under Windows. SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6050, 0x80000 )[sync] --> Success(0x0) SYSCALL[1740,1](293) sys_pipe2 ( 0x85e6078, 0x80000 )[sync] --> Success(0x0) SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60a0, 0x80000 )[sync] --> Success(0x0) SYSCALL[1740,1](293) sys_pipe2 ( 0x85e60c8, 0x80000 )[sync] --> Success(0x0) SYSCALL[1740,1](79) sys_getcwd ( 0x1ffeffe7e0, 4096 )[sync] --> Success(0x16) SYSCALL[1740,1](4) sys_newstat ( 0x85e6ef8(/home/paulf/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) SYSCALL[1740,1](4) sys_newstat ( 0x85e7158(/usr/local/bin/uname), 0x1ffefff780 )[sync] --> Failure(0x2) SYSCALL[1740,1](4) sys_newstat ( 0x85e73a8(/usr/bin/uname), 0x1ffefff780 )[sync] --> Success(0x0) SYSCALL[1740,1](21) sys_access ( 0x85e73a8(/usr/bin/uname), 1 )[sync] --> Success(0x0) SYSCALL[1740,1](247) sys_waitid( 3, 2147483647, 0x0, 5, 0x0 )==1740== Syscall param waitid(infop) points to unaddressable byte(s) ==1740== at 0x5DE234D: syscall (in /lib64/libc-2.32.so) ==1740== by 0x53B1996: sys_waitid (forkfd_linux.c:65) ==1740== by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125) ==1740== by 0x53B1996: system_forkfd (forkfd_linux.c:141) ==1740== by 0x53B1996: forkfd (forkfd.c:651) ==1740== by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465) ==1740== by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247) ==1740== by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095) ==1740== by 0x4012A4: main (in /home/paulf/Downloads/test) ==1740== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==1740== --> [async] ... SYSCALL[1740,1](247) ... [async] --> Failure(0x9) SYSCALL[1740,1](56) sys_clone ( 1011, 0x0, 0x1ffefff8f8, 0x0, 0x0 ) clone(fork): process 1740 created child 1741 --> [pre-success] Success(0x6cd) SYSCALL[1740,1](3) sys_close ( 11 )[sync] --> Success(0x0) SYSCALL[1740,1](3) sys_close ( 4 )[sync] --> Success(0x0) SYSCALL[1740,1](72) ==1741== Warning: invalid file descriptor 1476427248 in syscall clone() --> [pre-fail] Failure(0x18) valgrind: m_syswrap/syswrap-main.c:1957 (vgPlain_client_syscall): Assertion '0 == (sci->flags & ~(SfMayBlock | SfPostOnFail | SfPollAfter))' failed. host stacktrace: ==1741== at 0x580450A1: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==1741== by 0x580450D7: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==1741== by 0x580430D5: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==1741== by 0x581F2DC0: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==1741== by 0x5809DA5E: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==1741== by 0x580D57E6: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) sched status: running_tid=1
The 1st error that Valgrind is seeing comes from this (trimmed) static int detect_clone_pidfd_support() { sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL); return errno == EBADF ? 1 : -1; } Passing a dummmy siginfo_t for arg3 would fix the error. However the detection function is still doing its job. Valgrind says SYSCALL[1740,1](247) ... [async] --> Failure(0x9) where 9 is EBADF. After that, the clone is succeeding. I need to do more analysis and debugging to see what is causing the fd warning and assert.
(In reply to Paul Floyd from comment #9) > The 1st error that Valgrind is seeing comes from this (trimmed) > > static int detect_clone_pidfd_support() > { > sys_waitid(P_PIDFD, INT_MAX, NULL, WEXITED|WNOHANG, NULL); > return errno == EBADF ? 1 : -1; > } > > Passing a dummmy siginfo_t for arg3 would fix the error. I know, but this error is innocuous and the kernel doesn't complain, so I didn't feel the need to pass something there. > However the detection function is still doing its job. Valgrind says > > SYSCALL[1740,1](247) ... [async] --> Failure(0x9) > > where 9 is EBADF. > > After that, the clone is succeeding. I need to do more analysis and > debugging to see what is causing the fd warning and assert. Thank you for the effort.
Before my relating my debug efforts, I did notice that there was one change in git - 22aa8640e6c44c78c228ffa726cfacf918455343 - that might affect this problem, adding PIDFDs, but that was back in Feb 2020, so it should be in 3.16.1. So, debugging the problem. I just tried with Valgrind built from git HEAD and there was no problem. paulf@localhost:~/Downloads> ../valgrind/vg-in-place -q ./test Qt Version: 5.15.2 ==16771== Syscall param waitid(infop) points to unaddressable byte(s) ==16771== at 0x5DE334D: syscall (in /lib64/libc-2.32.so) ==16771== by 0x53B1996: sys_waitid (forkfd_linux.c:65) ==16771== by 0x53B1996: detect_clone_pidfd_support (forkfd_linux.c:125) ==16771== by 0x53B1996: system_forkfd (forkfd_linux.c:141) ==16771== by 0x53B1996: forkfd (forkfd.c:651) ==16771== by 0x53954EC: QProcessPrivate::startProcess() (qprocess_unix.cpp:465) ==16771== by 0x538FABD: QProcessPrivate::start(QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2247) ==16771== by 0x538FCF2: QProcess::start(QString const&, QStringList const&, QFlags<QIODevice::OpenModeFlag>) (qprocess.cpp:2095) ==16771== by 0x4012A4: main (in /home/paulf/Downloads/test) ==16771== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==16771== "Linux localhost.localdomain 5.9.8-2-default #1 SMP Thu Nov 12 07:43:32 UTC 2020 (ea93937) x86_64 x86_64 x86_64 GNU/Linux\n" Then I checked out the 3.16.1 code, same thing. Next I tried debugging the package installed Valgrind. No joy, it's been stripped and has no debug info. paulf@localhost:~/Downloads> ll /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux -rwxr-xr-x 1 root root 2743648 Sep 30 20:10 /usr/lib64/valgrind/memcheck-amd64-linux -rwxr-xr-x 1 paulf users 16555896 Nov 22 16:42 ../valgrind/memcheck/memcheck-amd64-linux paulf@localhost:~/Downloads> file /usr/lib64/valgrind/memcheck-amd64-linux ../valgrind/memcheck/memcheck-amd64-linux /usr/lib64/valgrind/memcheck-amd64-linux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=fb2fe27acb715941050c3806c632d51fe2c03455, stripped ../valgrind/memcheck/memcheck-amd64-linux: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=8192d34835865874742478127c0276f5fc75b277, with debug_info, not stripped I had a look here https://build.opensuse.org/package/view_file/devel:tools/valgrind/valgrind.spec?expand=1 I don't know what their optflags are. I tried adding the LTO patch and configuring with --enable-lto=yes, and still could not reproduce the issue. That's about all that I can try for the moment.
You can get debug symbols by installing package valgrind-debuginfo (after enabling "repo-debug": zypper mr -e repo-debug). But if it works on the current Git master, then I'll be satisfied with the bug being closed.
Debuginfo works. I'll look more tomorrow.
Looking more didn't help that much. I wasn't able to get GDB to stop where the error occurs. My conclusion is that this is due to some compiler or linker optimization. Closing this for now.