Bug 498936

Summary:	POSIX timer signal not being delivered
Product:	[Developer tools] valgrind	Reporter:	Tavian Barnes <tavianator>
Component:	general	Assignee:	Paul Floyd <pjfloyd>
Status:	RESOLVED NOT A BUG
Severity:	normal	CC:	pjfloyd
Priority:	NOR
Version First Reported In:	3.24 GIT
Target Milestone:	---
Platform:	Arch Linux
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	Test case

Description Tavian Barnes 2025-01-20 20:16:12 UTC

Created attachment 177557 [details]
Test case

The attached test case creates a POSIX timer that should deliver SIGALRM every 100 microseconds, creates a background thread, blocks SIGALRM in the main thread, and waits for the SIGALRM handler to fire 100 times.  This runs fine on bare metal, but on valgrind it hangs forever:

$ gcc -g -Wall vgalrm.c -o vgalrm
$ ./vgalrm
$ valgrind ./vgalrm
==2019420== Memcheck, a memory error detector
==2019420== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==2019420== Using Valgrind-3.24.0 and LibVEX; rerun with -h for copyright info
==2019420== Command: ./vgalrm
==2019420==

Comment 1 Tavian Barnes 2025-01-21 15:41:19 UTC

Okay here's a more reduced testcase:

```
#include <pthread.h>
#include <stdatomic.h>

atomic_bool start, stop;

void *work(void *ptr) {
	start = 1;
	while (!stop);
	return NULL;
}

int main(void) {
	pthread_t thread;
	pthread_create(&thread, NULL, work, NULL);
	while (!start);
	stop = 1;
	pthread_join(thread, NULL);
	return 0;
}
```

When I run this with valgrind, it hangs.  Running GDB on valgrind itself, it seems like the new thread is blocked on acquiring a lock:

(gdb) bt
#0  0x000000005801e839 in vgMemCheck_helperc_LOADV8 (a=1097769) at /usr/src/debug/valgrind/valgrind-3.24.0/memcheck/mc_main.c:5637
#1  0x0000001002e8d7d9 in ?? ()
#2  0x0000001002da9e80 in ?? ()
#3  0x0000000059a98040 in ?? ()
#4  0x0000000059a98038 in vgPlain_brk_limit ()
#5  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (LWP 2026357)]
#0  0x00000000580010e2 in do_syscall_WRK ()
(gdb) bt
#0  0x00000000580010e2 in do_syscall_WRK ()
#1  0x000000005804d1b8 in vgPlain_do_syscall (a7=0, a8=0, sysno=0, a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=0, a5=0, a6=0) at ../coregrind/m_syscall.c:1183
#2  vgPlain_read (fd=<optimized out>, buf=<optimized out>, count=<optimized out>) at ../coregrind/m_libcfile.c:338
#3  0x00000000580d3914 in vgModuleLocal_sema_down (as_LL=0 '\000', sema=0x10020063f0) at ../coregrind/m_scheduler/sema.c:107
#4  acquire_sched_lock (p=0x10020063f0) at ../coregrind/m_scheduler/sched-lock-generic.c:69
#5  0x0000000058099d77 in vgModuleLocal_acquire_sched_lock (p=<optimized out>) at ../coregrind/m_scheduler/sched-lock.c:88
#6  vgPlain_acquire_BigLock_LL (who=0x0) at ../coregrind/m_scheduler/scheduler.c:422
#7  vgPlain_acquire_BigLock (tid=2, who=0x5827ec58 "thread_wrapper(starting new thread)") at ../coregrind/m_scheduler/scheduler.c:346
#8  0x00000000580db229 in thread_wrapper (tidW=2) at ../coregrind/m_syswrap/syswrap-linux.c:83
#9  run_a_thread_NORETURN (tidW=2) at ../coregrind/m_syswrap/syswrap-linux.c:155
#10 0x00000000580db6bf in vgModuleLocal_start_thread_NORETURN (arg=<optimized out>) at ../coregrind/m_syswrap/syswrap-linux.c:329
#11 0x0000000058001181 in do_syscall_clone_amd64_linux ()
#12 0xdeadbeefdeadbeef in ?? ()
#13 0xdeadbeefdeadbeef in ?? ()
#14 0xdeadbeefdeadbeef in ?? ()
#15 0xdeadbeefdeadbeef in ?? ()
#16 0x0000000000000000 in ?? ()

Comment 2 Tavian Barnes 2025-01-21 19:16:21 UTC

--fair-sched=yes seems to work around it

Comment 3 Paul Floyd 2025-01-24 09:39:37 UTC

(In reply to Tavian Barnes from comment #2)
> --fair-sched=yes seems to work around it

OK, can we close this item?

Comment 4 Tavian Barnes 2025-01-24 14:56:49 UTC

I guess?  I'm surprised that valgrind will totally starve a thread by default.

This may be a dupe of https://bugs.kde.org/show_bug.cgi?id=343357 anyway.

Comment 5 Paul Floyd 2025-01-24 20:40:01 UTC

(In reply to Tavian Barnes from comment #4)
> I guess?  I'm surprised that valgrind will totally starve a thread by
> default.
> 
> This may be a dupe of https://bugs.kde.org/show_bug.cgi?id=343357 anyway.

It can happen. The default Valgrind scheduler is well, no scheduling. There's just a global lock (using FIFO reads and writes). When the running thread releases the lock it's pot luck as to which thread gets to take it. And I suspect that there are cases where the running thread is hot in the cache and so it keeps taking the lock.