Bug 432870 - gdbserver_tests:nlcontrolc hangs with newest glibc2.33 x86-64
Summary: gdbserver_tests:nlcontrolc hangs with newest glibc2.33 x86-64
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 427931 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-02-12 20:52 UTC by Yi Fan Yu
Modified: 2021-03-08 21:39 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
fix nlcontrolc.vgtest blocking on arm64 or newer glibc (8.18 KB, text/plain)
2021-03-07 21:40 UTC, Philippe Waroquiers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yi Fan Yu 2021-02-12 20:52:11 UTC
SUMMARY

nlcontrolc hangs, first seen on the `yocto poky` master build after 
updating to glibc2.33. Confirmed with a archlinux build for x86-64

seen both on valgrind master and 3.16.1 release

STEPS TO REPRODUCE
get most recent archlinux packages and valgrind source repo
1. ./configure  --without-mpicc --enable-tls
2. make 
3. make regtest

OBSERVED RESULT

nlcontrolc hangs for more than 15 min

EXPECTED RESULT

test passes under a minute

SOFTWARE/OS VERSIONS
seen on most recent archlinux and yocto poky build

ADDITIONAL INFORMATION
might be related to https://bugs.kde.org/show_bug.cgi?id=338633
difference is glibc, gdb version and x86-64
Comment 1 Yi Fan Yu 2021-02-16 17:48:38 UTC
this commit for glibc (present in 2.33 but not 2.32) is causing this test to fail
when i rebuilt glibc with a patch to revert this commit, the test passes.

```
commit 2433d39b69743f100f972e7886f91a2e21795ef0
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date:   Mon Jul 6 16:06:51 2020 -0300

    linux: Add time64 select support

```
Comment 2 Mark Wielaard 2021-02-17 11:17:17 UTC
Thanks for tracking this down to that specific glibc commit.

It changes which select system call is called depending on the architecture and kernel version. This might be related to https://bugs.kde.org/show_bug.cgi?id=338633 where we disable the nlcontrolc.vgtest on arm64 because it hangs (speculated to be because arm64 doesn't provide a traditional select system call).
Comment 3 Yi Fan Yu 2021-02-17 17:41:48 UTC
Here is a working call to sleepers *without* the select update patch in 2.33
it used to directly passes timeout to the underlying syscall. 
With 2.33, it doesn't anymore


```
root@qemux86-64:/usr/lib/valgrind/ptest/gdbserver_tests# valgrind --trace-syscalls=yes  ./sleepers 1000000000 1000000000 1000000000 BSBSBSBS 2>&1 | grep select
SYSCALL[6021,2](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041b0 ) --> [async] ...
SYSCALL[6021,3](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041c0 ) --> [async] ...
SYSCALL[6021,1](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041a0 ) --> [async] ...
SYSCALL[6021,4](23) sys_select ( 0, 0x0, 0x0, 0x0, 0x4041d0 ) --> [async] ...
```

It used to pass timeout directly to `select`, now it calls `select6`
```
-__select (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
-         struct timeval *timeout)
+__select64 (int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
+           struct __timeval64 *timeout)
 {
-#ifdef __NR_select
-  return SYSCALL_CANCEL (select, nfds, readfds, writefds, exceptfds,
-                        timeout);
```

here is how it calls it now 
```
+  r = SYSCALL_CANCEL (pselect6, nfds, readfds, writefds, exceptfds, pts32,
+                     NULL);
```
Comment 4 Yi Fan Yu 2021-02-17 21:24:23 UTC
here are my findings i summarized to the yocto bugboard.
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14223


## what changed in this patch to cause it to fail?

timeout argument the user passes no longer makes its way to the syscall.
Glibc copies it over and converts into a different format to call a different syscall `pselect`

the failing test tries to modify said timeout argument to make the syscall end faster. Unfortunately doesn't work.

## what about actually fixing this bug though?

talking to valgrind about the purpose of nlcontrolc test
    - can we use a different syscall to sleep for a duration?
    - what is the exact purpose if this test

## other questions that arise
Is glibc setting a new standard? what is the expected libc implementation of select? 
Grey area... 

according to `man select`
```
On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1-2001 permits either behavior.) 
```
Did glibc violate the "linux standard"?
Comment 5 Mark Wielaard 2021-02-21 00:31:31 UTC
*** Bug 427931 has been marked as a duplicate of this bug. ***
Comment 6 Mark Wielaard 2021-02-27 23:09:36 UTC
Currently nlcontrol works because gdbserver_tests/nlcontrolc.stdinB.gdb does:

# Here, all tasks should be blocked in a loooonnnng select, all in WaitSys
info threads
# We will unblock them by changing their timeout argument
# To avoid going into the frame where the timeval arg is,
# it has been defined as global variables, as the nr
# of calls on the stack differs between 32bits and 64bits,
# and/or between OS.
# ensure select finishes in a few milliseconds max:
p t[0].tv_sec = 0
p t[1].tv_sec = 0
p t[2].tv_sec = 0
p t[3].tv_sec = 0

First, I am surprised this works. Once the thread is stuck in the select system call it seems changing the user space tv_sec shouldn't have effect on the select call in progress. Also each new select call will reset the tv_sec:

   t[s->t].tv_sec = sleepms / 1000;
   t[s->t].tv_usec = (sleepms % 1000) * 1000;

And sleepms won't change. So it seems it only worked by accident. It doesn't seem to work on other kernels as stated in nlcontrol.vgtest:

# This test is disabled on Solaris because modifying select/poll/ppoll timeout
# has no effect if a thread is already blocked in that syscall.

Now that glibc always seems to call pselect6 for which it has to copy and translate the given timeval to a timespec, the GNU/Linux implementation also won't work anymore with this testcase.

This also explains why it never worked on arm64, because that doesn't have a plain select syscall. So glibc was always translating the given timeval already.

Ideally we fix this by interrupting the select syscalls some other way. But I don't know how to do that.
Comment 7 Philippe Waroquiers 2021-03-07 21:40:11 UTC
Created attachment 136473 [details]
fix nlcontrolc.vgtest blocking on arm64 or newer glibc

Attach patch should fix the blockage. Tested on debian 10/amd64 and on an arm64 platform.
Comment 8 Mark Wielaard 2021-03-08 10:13:25 UTC
(In reply to Philippe Waroquiers from comment #7)
> Created attachment 136473 [details]
> fix nlcontrolc.vgtest blocking on arm64 or newer glibc
> 
> Attach patch should fix the blockage. Tested on debian 10/amd64 and on an
> arm64 platform.

Tested on x86_64 against glibc 2.17, 2.32, and 2.33.9000, on arm64 against glibc 2.33. Passed on all.
Comment 9 Philippe Waroquiers 2021-03-08 19:23:12 UTC
Fixed in c79180a3
Comment 10 Yi Fan Yu 2021-03-08 21:39:29 UTC
tested on qemuarm64 and qemux86-64 with glibc2.33

thanks.