501365 – unhandled amd64-linux syscall: 323 (userfaultfd)

Bug 501365 - unhandled amd64-linux syscall: 323 (userfaultfd)

Summary: unhandled amd64-linux syscall: 323 (userfaultfd)

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	3.24.0
Platform:	RedHat Enterprise Linux Linux

Importance:	NOR minor
Target Milestone:	---
Assignee:	mcermak

URL:
Keywords:

Depends on:
Blocks:

Reported:	2025-03-11 15:33 UTC by mcermak
Modified:	2025-03-18 17:28 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
possible patch (2.79 KB, patch) 2025-03-13 09:33 UTC, mcermak	Details
possible patch (2.38 KB, patch) 2025-03-17 09:28 UTC, mcermak	Details
possible patch (4.76 KB, patch) 2025-03-18 09:40 UTC, mcermak	Details
possible patch (8.78 KB, patch) 2025-03-18 15:23 UTC, mcermak	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description mcermak 2025-03-11 15:33:10 UTC

Running LTP on RHEL-10 kernel: 6.12.0-55.2.1.el10_0.x86_64 shows unhandled amd64-linux syscall: 323.

$ valgrind -q --tool=none --log-file=xxx accept03
tst_test.c:1900: TINFO: LTP version: 20250130
tst_test.c:1904: TINFO: Tested kernel: 6.12.0-55.2.1.el10_0.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Mar  3 04:14:37 EST 2025 x86_64
tst_kconfig.c:88: TINFO: Parsing kernel config '/lib/modules/6.12.0-55.2.1.el10_0.x86_64/config'
tst_test.c:1722: TINFO: Overall timeout per run is 0h 00m 30s
accept03.c:48: TPASS: accept() on file : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on O_PATH file : EBADF (9)
accept03.c:48: TPASS: accept() on directory : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on /dev/zero : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on /proc/self/maps : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on pipe read end : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on pipe write end : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on epoll : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on eventfd : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on signalfd : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on timerfd : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on pidfd : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on fanotify : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on inotify : ENOTSOCK (88)
tst_fd.c:170: TCONF: Skipping userfaultfd: ENOSYS (38)
accept03.c:48: TPASS: accept() on perf event : ENOTSOCK (88)
tst_fd.c:199: TCONF: Skipping io uring: EPERM (1)
accept03.c:48: TPASS: accept() on bpf map : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on fsopen : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on fspick : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on open_tree : EBADF (9)
accept03.c:48: TPASS: accept() on memfd : ENOTSOCK (88)
accept03.c:48: TPASS: accept() on memfd secret : ENOTSOCK (88)

Summary:
passed   21
failed   0
broken   0
skipped  2
warnings 0
$ cat xxx
--19712-- WARNING: unhandled amd64-linux syscall: 323
--19712-- You may be able to write your own handler.
--19712-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--19712-- Nevertheless we consider this a bug.  Please report
--19712-- it at http://valgrind.org/support/bug_reports.html.
$ rpm -q valgrind
valgrind-3.24.0-3.el10.x86_64
$

Comment 1 mcermak 2025-03-11 15:58:57 UTC

Same test results with valgrind-3.25.0.GIT.

Comment 2 mcermak 2025-03-13 09:33:05 UTC

Created attachment 179365 [details]
possible patch

Comment 3 Paul Floyd 2025-03-14 05:31:22 UTC

I don’t think that this should be generic as it’s Linux only.

It looks to me as if this creates a file descriptor. If so it needs to use the fd tracking mechanism.

Is it OK to leave everything to the kernel? Or are there cases where VEX needs to write to the fd?

Comment 4 Mark Wielaard 2025-03-14 14:51:17 UTC

(In reply to Paul Floyd from comment #3)
> I don’t think that this should be generic as it’s Linux only.

Yes, this should be moved from syswrap-generic.c to syswrap-linux.c
And it should then use LINX_ instead of GENX_.

> It looks to me as if this creates a file descriptor. If so it needs to use
> the fd tracking mechanism.

Yes, so this needs a POST (called after the syscall succeeds)
that does the file descriptor tracking, something like:

POST(sys_userfaultfd)
{
   if (!ML_(fd_allowed)(RES, "userfaultfd", tid, True)) {
      VG_(close)(RES);
      SET_STATUS_Failure( VKI_EMFILE );
   } else {
      if (VG_(clo_track_fds))
         ML_(record_fd_open_nameless)(tid, RES);
   }
}

And when a syscall wrapper has a POST it needs to be registered with XY (X has PRE, Y has POST).
So LINXY in the syswrap-<arch>-linux.c syscall_main_table array.

BTW. Although this isn't a syscall defined in include/vki/vki-scnums-shared-linux.h, __NR_userfaultfd is defined for amd64, arm, arm64, mips32, mips64, nanomips, ppc32, ppc64, riscv64 and x86. And it is called similarly on each linux arch, so it really should/can be hooked up in all  coregrind/m_syswrap/syswrap-*-linux.c files.

> Is it OK to leave everything to the kernel? Or are there cases where VEX
> needs to write to the fd?

It is ok for this syscall just creating the fd. But when it get configured we might have to intercept/discard reads from it if they trigger for pages allocated for Valgrind itself. See https://www.man7.org/linux/man-pages/man2/ioctl_userfaultfd.2.html but we can do that when we encounter programs that actually do that.

BTW. I don't think the *flags |= SfMayBlock; in the PRE handler is necessary in this case. It is if the syscall may block and might depend on another thread (or some other event) calling back into the kernel. It might in the case of FUSE file systems for example for file descriptor creation, but I don't think userfaultfd file description creation is, it should return immediately.

Also you want to use PRE_REG_READ1 not PRE_REG_READ0 (which is for syscalls without arguments).
The first two arguments of the PRE_REG_READ? macros are the return time and syscall name, the rest are the argument types and names.
You can check if you got those right using ./vg-in-place --trace-syscalls ...

Comment 5 mcermak 2025-03-17 09:28:21 UTC

Created attachment 179492 [details]
possible patch

Hello guys, thank you for all the comments.  I've found existing wrapper for sys_epoll_create that seems to be close to this new sys_userfaultfd, and modelled the new wrapper after it.  Does the new patch look fine?

Comment 6 Mark Wielaard 2025-03-17 13:14:03 UTC

(In reply to mcermak from comment #5)
> Created attachment 179492 [details]
> possible patch
> 
> Hello guys, thank you for all the comments.  I've found existing wrapper for
> sys_epoll_create that seems to be close to this new sys_userfaultfd, and
> modelled the new wrapper after it.  Does the new patch look fine?

Yes, that looks good. Thanks.

You are only hooking this up for amd64-linux, but __NR_userfaultfd is defined for all linux arches in:
include/vki/vki-scnums-amd64-linux.h
include/vki/vki-scnums-arm-linux.h
include/vki/vki-scnums-arm64-linux.h
include/vki/vki-scnums-mips32-linux.h
include/vki/vki-scnums-mips64-linux.h
include/vki/vki-scnums-nanomips-linux.h
include/vki/vki-scnums-ppc32-linux.h
include/vki/vki-scnums-ppc64-linux.h
include/vki/vki-scnums-riscv64-linux.h
include/vki/vki-scnums-x86-linux.h

So you really can/should hook this up also in:
coregrind/m_syswrap/syswrap-arm-linux.c
coregrind/m_syswrap/syswrap-arm64-linux.c
coregrind/m_syswrap/syswrap-mips32-linux.c
coregrind/m_syswrap/syswrap-mips64-linux.c
coregrind/m_syswrap/syswrap-nanomips-linux.c
coregrind/m_syswrap/syswrap-ppc32-linux.c
coregrind/m_syswrap/syswrap-ppc64-linux.c
coregrind/m_syswrap/syswrap-riscv64-linux.c
coregrind/m_syswrap/syswrap-s390x-linux.c
coregrind/m_syswrap/syswrap-x86-linux.c

Comment 7 mcermak 2025-03-18 09:40:39 UTC

Created attachment 179531 [details]
possible patch

I've grabbed aarch64, ppc64le and s390x test systems, and tested that this updated patch works on those arches.  Does the patch look OK?  Should I add hook up the remaining arches (I don't have access to those) without true testing?

Comment 8 Mark Wielaard 2025-03-18 10:07:10 UTC

(In reply to mcermak from comment #7)
> I've grabbed aarch64, ppc64le and s390x test systems, and tested that this
> updated patch works on those arches.  Does the patch look OK?

Yes, thanks.
And you saw that I lied (sorry, not intentionally!) and __NR_userfaultfd wasn't defined for linux-s390x.
Thanks for looking up the correct number.

>  Should I add hook up the remaining arches (I don't have access to those) without true
> testing?

Yes please, this code is simple enough that it should be fine on any arch that has the userfaultfd define already.
Note that most code is build for all arches, so you should at least see any compile errors (like the missing __NR_userfaultfd on s390x).
And that we have nightly testers and the sourceware buildbot for various arches:
https://builder.sourceware.org/buildbot/#/builders?tags=valgrind

Comment 9 mcermak 2025-03-18 12:59:44 UTC

> Yes please, this code is simple enough

Mark, is it really only about "hooking" the existing PRE() and POST() up?  I've noticed that for instance i386 passes sycall params via stack (as opposed to amd64 where this gets passed via regs).  The existing PRE() relies on register parameter passing.  Not sure about other arches.  Thoughts?

Comment 10 Paul Floyd 2025-03-18 13:32:34 UTC

(In reply to mcermak from comment #9)
> > Yes please, this code is simple enough
> 
> Mark, is it really only about "hooking" the existing PRE() and POST() up? 
> I've noticed that for instance i386 passes sycall params via stack (as
> opposed to amd64 where this gets passed via regs).  The existing PRE()
> relies on register parameter passing.  Not sure about other arches. 
> Thoughts?

Marshalling syscall args to/from registers/stack from/to the syscall is handled elsewhere.

The only issue the sometimes arises is the case where a 32bit OS syscall has a 64bit argument. In that case a separate syscall wrapper is needed because the 64bit arguemnt will get split into two 32 bit ones. As an example take a look at fallocate.

Comment 11 mcermak 2025-03-18 15:23:12 UTC

Created attachment 179542 [details]
possible patch

Gotcha, thank you Paul.  I've added the hooks for the remaining arches too, and carefully eyeballed it.  I've requested gcc compile farm account so that next time I can hopefully test my updates on all needed arches.  Does the attached patch look good now?

Comment 12 Mark Wielaard 2025-03-18 17:28:22 UTC

Patch does look good, pushed as:

commit f697142e15c44c2d3e3910ae436a6597e8782aea
Author: Martin Cermak <mcermak@redhat.com>
Date:   Tue Mar 18 18:17:30 2025 +0100

    Wrap linux specific userfaultfd syscall
    
    userfaultfd takes a flags argument and returns a file descriptor.
    It shows up in the Linux Test Project syscalls tests as unhandled.
    
    Declare a sys_userfaultfd wrapper in priv_syswrap-linux.h and hook it
    for {amd64,arm,arm64,mips64,nanomips,ppc32,ppc64,riscv64,s390x,x86}-linux
    using LINXY with PRE/POST handlers in syswrap-linux.c.
    
    Define __NR_userfaultfd in vki-scnums-s390x-linux.h. It was already
    defined for all other arches.
    
    https://bugs.kde.org/show_bug.cgi?id=501365