SUMMARY nlcontrolc hangs, first seen on the `yocto poky` master build after updating to glibc2.33. Confirmed with a archlinux build for x86-64 seen both on valgrind master and 3.16.1 release STEPS TO REPRODUCE get most recent archlinux packages and valgrind source repo 1. ./configure --without-mpicc --enable-tls 2. make 3. make regtest OBSERVED RESULT nlcontrolc hangs for more than 15 min EXPECTED RESULT test passes under a minute SOFTWARE/OS VERSIONS seen on most recent archlinux and yocto poky build ADDITIONAL INFORMATION might be related to https://bugs.kde.org/show_bug.cgi?id=338633 difference is glibc, gdb version and x86-64
this commit for glibc (present in 2.33 but not 2.32) is causing this test to fail when i rebuilt glibc with a patch to revert this commit, the test passes. ``` commit 2433d39b69743f100f972e7886f91a2e21795ef0 Author: Adhemerval Zanella <adhemerval.zanella@linaro.org> Date: Mon Jul 6 16:06:51 2020 -0300 linux: Add time64 select support ```
Thanks for tracking this down to that specific glibc commit. It changes which select system call is called depending on the architecture and kernel version. This might be related to https://bugs.kde.org/show_bug.cgi?id=338633 where we disable the nlcontrolc.vgtest on arm64 because it hangs (speculated to be because arm64 doesn't provide a traditional select system call).
Here is a working call to sleepers *without* the select update patch in 2.33 it used to directly passes timeout to the underlying syscall. With 2.33, it doesn't anymore ``` root@qemux86-64:/usr/lib/valgrind/ptest/gdbserver_tests# valgrind --trace-syscalls=yes ./sleepers 1000000000 1000000000 1000000000 BSBSBSBS 2>&1 | grep select SYSCALL[6021,2](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041b0 ) --> [async] ... SYSCALL[6021,3](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041c0 ) --> [async] ... SYSCALL[6021,1](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041a0 ) --> [async] ... SYSCALL[6021,4](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041d0 ) --> [async] ... ``` It used to pass timeout directly to `select`, now it calls `select6` ``` -__select (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, - struct timeval *timeout) +__select64 (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, + struct __timeval64 *timeout) { -#ifdef __NR_select - return SYSCALL_CANCEL (select, nfds, readfds, writefds, exceptfds, - timeout); ``` here is how it calls it now ``` + r = SYSCALL_CANCEL (pselect6, nfds, readfds, writefds, exceptfds, pts32, + NULL); ```
here are my findings i summarized to the yocto bugboard. https://bugzilla.yoctoproject.org/show_bug.cgi?id=14223 ## what changed in this patch to cause it to fail? timeout argument the user passes no longer makes its way to the syscall. Glibc copies it over and converts into a different format to call a different syscall `pselect` the failing test tries to modify said timeout argument to make the syscall end faster. Unfortunately doesn't work. ## what about actually fixing this bug though? talking to valgrind about the purpose of nlcontrolc test - can we use a different syscall to sleep for a duration? - what is the exact purpose if this test ## other questions that arise Is glibc setting a new standard? what is the expected libc implementation of select? Grey area... according to `man select` ``` On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1-2001 permits either behavior.) ``` Did glibc violate the "linux standard"?
*** Bug 427931 has been marked as a duplicate of this bug. ***
Currently nlcontrol works because gdbserver_tests/nlcontrolc.stdinB.gdb does: # Here, all tasks should be blocked in a loooonnnng select, all in WaitSys info threads # We will unblock them by changing their timeout argument # To avoid going into the frame where the timeval arg is, # it has been defined as global variables, as the nr # of calls on the stack differs between 32bits and 64bits, # and/or between OS. # ensure select finishes in a few milliseconds max: p t[0].tv_sec = 0 p t[1].tv_sec = 0 p t[2].tv_sec = 0 p t[3].tv_sec = 0 First, I am surprised this works. Once the thread is stuck in the select system call it seems changing the user space tv_sec shouldn't have effect on the select call in progress. Also each new select call will reset the tv_sec: t[s->t].tv_sec = sleepms / 1000; t[s->t].tv_usec = (sleepms % 1000) * 1000; And sleepms won't change. So it seems it only worked by accident. It doesn't seem to work on other kernels as stated in nlcontrol.vgtest: # This test is disabled on Solaris because modifying select/poll/ppoll timeout # has no effect if a thread is already blocked in that syscall. Now that glibc always seems to call pselect6 for which it has to copy and translate the given timeval to a timespec, the GNU/Linux implementation also won't work anymore with this testcase. This also explains why it never worked on arm64, because that doesn't have a plain select syscall. So glibc was always translating the given timeval already. Ideally we fix this by interrupting the select syscalls some other way. But I don't know how to do that.
Created attachment 136473 [details] fix nlcontrolc.vgtest blocking on arm64 or newer glibc Attach patch should fix the blockage. Tested on debian 10/amd64 and on an arm64 platform.
(In reply to Philippe Waroquiers from comment #7) > Created attachment 136473 [details] > fix nlcontrolc.vgtest blocking on arm64 or newer glibc > > Attach patch should fix the blockage. Tested on debian 10/amd64 and on an > arm64 platform. Tested on x86_64 against glibc 2.17, 2.32, and 2.33.9000, on arm64 against glibc 2.33. Passed on all.
Fixed in c79180a3
tested on qemuarm64 and qemux86-64 with glibc2.33 thanks.