492678 – SIGALRM race condition (sighandler called after timer disarmed)

Bug 492678 - SIGALRM race condition (sighandler called after timer disarmed)

Summary: SIGALRM race condition (sighandler called after timer disarmed)

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (other bugs)
Version First Reported In:	3.18.1
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Paul Floyd

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-09-05 14:10 UTC by Clément
Modified:	2025-01-26 00:05 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
bug reproduction example (1.97 KB, text/x-csrc) 2024-09-05 14:10 UTC, Clément	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Clément 2024-09-05 14:10:22 UTC

Created attachment 173345 [details]
bug reproduction example

SUMMARY
There is a small chance that a SIGALRM is received AFTER that a timer has been disarmed.

STEPS TO REPRODUCE
1. setup a SIGALRM sighandler
2. setup a SIGALRM timer with setitimer (works better with a small clock period)
3. disarm the SIGALRM timer
4. disable the SIGALRM sighangler to use SIG_DFL
5. run it with Valgrind
or
1.  compile the attached example
2.  run it with Valgrind

OBSERVED RESULT
There is a small chance that the SIGALRM default handler is called after the timer has been disarmed.

EXPECTED RESULT
SIGALRM handler should not be called after the timer has been disarmed (which is what happen when not using Valgrind).

SOFTWARE/OS VERSIONS
Linux/Ubuntu: 6.8.0-40-generic #40~22.04.3-Ubuntu
GCC: 11.4.0

Comment 1 Paul Floyd 2024-09-09 06:15:18 UTC

Sounds like this is due to the way that Valgrind handles asynchronous signals. From memory, signals are masked in the host and get polled. So there may be a gap between the signal being sent and the host polling for it.

Does lowering --vex-guest-max-insns help with this?

Comment 2 Clément 2024-09-09 08:43:39 UTC

This indeed seems to be related to how Valgrind handles interruptions.
If there is a gap between the signal being sent and the handler being called, it's totally normal to observe this behavior.
https://valgrind.org/docs/manual/mc-tech-docs.html#mc-tech-docs.signals

I don't see any correlation with vex-guest-max-insns thought:

| vex-guest-max-insns | failures out of 1.000.000 runs |
|---------------------|--------------------------------|
| 1                   | 5                              |
| 2                   | 10                             |
| 3                   | 23                             |
| 4                   | 5                              |
| 5                   | 119                            |
| 6                   | 15                             |
| 7                   | 83                             |
| 8                   | 98                             |
| 9                   | 89                             |
| 10                  | 71                             |
| 25                  | 21                             |
| 50                  | 35                             |
| 75                  | 2                              |
| 100                 | 12                             |

Comment 3 Tavian Barnes 2025-01-21 16:15:43 UTC

I think this should fix it:

diff --git a/coregrind/m_syswrap/syswrap-linux.c b/coregrind/m_syswrap/syswrap-linux.c
index e8978b5bc..83af91344 100644
--- a/coregrind/m_syswrap/syswrap-linux.c
+++ b/coregrind/m_syswrap/syswrap-linux.c
@@ -3244,6 +3244,7 @@ PRE(sys_timer_delete)
 {
    PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1);
    PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid);
+   *flags |= SfPollAfter;
 }
 
 /* ---------------------------------------------------------------------

Comment 4 Paul Floyd 2025-01-21 19:34:12 UTC

This works for me on FreeBSD

diff --git a/coregrind/m_syswrap/syswrap-freebsd.c b/coregrind/m_syswrap/syswrap-freebsd.c
index d358c90eb..339724938 100644
--- a/coregrind/m_syswrap/syswrap-freebsd.c
+++ b/coregrind/m_syswrap/syswrap-freebsd.c
@@ -2411,6 +2411,7 @@ PRE(sys_timer_delete)
 {
    PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1);
    PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid);
+   *flags |= SfPollAfter;
 }
 
 // SYS_ktimer_settime   237
diff --git a/coregrind/m_syswrap/syswrap-generic.c b/coregrind/m_syswrap/syswrap-generic.c
index bfe4c6fe0..b76013dc8 100644
--- a/coregrind/m_syswrap/syswrap-generic.c
+++ b/coregrind/m_syswrap/syswrap-generic.c
@@ -2967,6 +2967,7 @@ PRE(sys_setitimer)
       PRE_timeval_WRITE( "setitimer(&ovalue->it_value)",
                          &(ovalue->it_value));
    }
+   *flags |= SfPollAfter;
 }
 
 POST(sys_setitimer)

For this testcase, it's the setitimer that seems to matter. Not sure if that should be for all values or just for a new time of 0 to turn off the itimer.

Comment 5 Paul Floyd 2025-01-21 20:38:30 UTC

And on Fedora (same PC) I had to increase the timer period to reproduce the problem. The same change  PRE(sys_setitimer) also seems to fix it.

Comment 6 Paul Floyd 2025-01-21 20:40:07 UTC

(In reply to Tavian Barnes from comment #3)
> I think this should fix it:
> 
> diff --git a/coregrind/m_syswrap/syswrap-linux.c
> b/coregrind/m_syswrap/syswrap-linux.c
> index e8978b5bc..83af91344 100644
> --- a/coregrind/m_syswrap/syswrap-linux.c
> +++ b/coregrind/m_syswrap/syswrap-linux.c
> @@ -3244,6 +3244,7 @@ PRE(sys_timer_delete)
>  {
>     PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1);
>     PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid);
> +   *flags |= SfPollAfter;
>  }
>  
>  /* ---------------------------------------------------------------------

Hmm. If you run the testcase with strace do you see calls to timer_delete rather than setitimer?

Comment 7 Tavian Barnes 2025-01-22 18:36:03 UTC

(In reply to Paul Floyd from comment #6)
> Hmm. If you run the testcase with strace do you see calls to timer_delete
> rather than setitimer?

No, actually I did not run the attached test case, just saw that the description matched my own test case that happens to use timer_*() rather than *itimer().  Presumably the same fix should be applied to setitimer() and timer_settime() as well.

Comment 8 Paul Floyd 2025-01-25 12:10:02 UTC

OK, I modified the testcase to use timer_create/timer_delete and reproduced both the error and the fix on Linux. I couldn't reproduce the error with FreeBSD. I'll try to push a fix and 2 testcases this weekend.

Comment 9 Paul Floyd 2025-01-25 16:56:17 UTC

Thanks for the suggestion to use SfPollAfter.

commit 3a1498f13f1676a20ab80fc553fd4f6b94b3a347 (HEAD -> master, origin/users/paulf/try-bug492678, origin/master, origin/HEAD, bug492678)
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sat Jan 25 14:54:08 2025 +0100

    Bug 492678 - SIGALRM race condition (sighandler called after timer disarmed

Comment 10 Tavian Barnes 2025-01-26 00:05:04 UTC

Awesome, thanks for the quick fix