| Summary: | SIGALRM race condition (sighandler called after timer disarmed) | ||
|---|---|---|---|
| Product: | [Developer tools] valgrind | Reporter: | Clément <clement.saccoccio> |
| Component: | general | Assignee: | Paul Floyd <pjfloyd> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | pjfloyd, tavianator |
| Priority: | NOR | ||
| Version First Reported In: | 3.18.1 | ||
| Target Milestone: | --- | ||
| Platform: | Other | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
| Attachments: | bug reproduction example | ||
Sounds like this is due to the way that Valgrind handles asynchronous signals. From memory, signals are masked in the host and get polled. So there may be a gap between the signal being sent and the host polling for it. Does lowering --vex-guest-max-insns help with this? This indeed seems to be related to how Valgrind handles interruptions. If there is a gap between the signal being sent and the handler being called, it's totally normal to observe this behavior. https://valgrind.org/docs/manual/mc-tech-docs.html#mc-tech-docs.signals I don't see any correlation with vex-guest-max-insns thought: | vex-guest-max-insns | failures out of 1.000.000 runs | |---------------------|--------------------------------| | 1 | 5 | | 2 | 10 | | 3 | 23 | | 4 | 5 | | 5 | 119 | | 6 | 15 | | 7 | 83 | | 8 | 98 | | 9 | 89 | | 10 | 71 | | 25 | 21 | | 50 | 35 | | 75 | 2 | | 100 | 12 | I think this should fix it:
diff --git a/coregrind/m_syswrap/syswrap-linux.c b/coregrind/m_syswrap/syswrap-linux.c
index e8978b5bc..83af91344 100644
--- a/coregrind/m_syswrap/syswrap-linux.c
+++ b/coregrind/m_syswrap/syswrap-linux.c
@@ -3244,6 +3244,7 @@ PRE(sys_timer_delete)
{
PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1);
PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid);
+ *flags |= SfPollAfter;
}
/* ---------------------------------------------------------------------
This works for me on FreeBSD
diff --git a/coregrind/m_syswrap/syswrap-freebsd.c b/coregrind/m_syswrap/syswrap-freebsd.c
index d358c90eb..339724938 100644
--- a/coregrind/m_syswrap/syswrap-freebsd.c
+++ b/coregrind/m_syswrap/syswrap-freebsd.c
@@ -2411,6 +2411,7 @@ PRE(sys_timer_delete)
{
PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1);
PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid);
+ *flags |= SfPollAfter;
}
// SYS_ktimer_settime 237
diff --git a/coregrind/m_syswrap/syswrap-generic.c b/coregrind/m_syswrap/syswrap-generic.c
index bfe4c6fe0..b76013dc8 100644
--- a/coregrind/m_syswrap/syswrap-generic.c
+++ b/coregrind/m_syswrap/syswrap-generic.c
@@ -2967,6 +2967,7 @@ PRE(sys_setitimer)
PRE_timeval_WRITE( "setitimer(&ovalue->it_value)",
&(ovalue->it_value));
}
+ *flags |= SfPollAfter;
}
POST(sys_setitimer)
For this testcase, it's the setitimer that seems to matter. Not sure if that should be for all values or just for a new time of 0 to turn off the itimer.
And on Fedora (same PC) I had to increase the timer period to reproduce the problem. The same change PRE(sys_setitimer) also seems to fix it. (In reply to Tavian Barnes from comment #3) > I think this should fix it: > > diff --git a/coregrind/m_syswrap/syswrap-linux.c > b/coregrind/m_syswrap/syswrap-linux.c > index e8978b5bc..83af91344 100644 > --- a/coregrind/m_syswrap/syswrap-linux.c > +++ b/coregrind/m_syswrap/syswrap-linux.c > @@ -3244,6 +3244,7 @@ PRE(sys_timer_delete) > { > PRINT("sys_timer_delete( %#" FMT_REGWORD "x )", ARG1); > PRE_REG_READ1(long, "timer_delete", vki_timer_t, timerid); > + *flags |= SfPollAfter; > } > > /* --------------------------------------------------------------------- Hmm. If you run the testcase with strace do you see calls to timer_delete rather than setitimer? (In reply to Paul Floyd from comment #6) > Hmm. If you run the testcase with strace do you see calls to timer_delete > rather than setitimer? No, actually I did not run the attached test case, just saw that the description matched my own test case that happens to use timer_*() rather than *itimer(). Presumably the same fix should be applied to setitimer() and timer_settime() as well. OK, I modified the testcase to use timer_create/timer_delete and reproduced both the error and the fix on Linux. I couldn't reproduce the error with FreeBSD. I'll try to push a fix and 2 testcases this weekend. Thanks for the suggestion to use SfPollAfter. commit 3a1498f13f1676a20ab80fc553fd4f6b94b3a347 (HEAD -> master, origin/users/paulf/try-bug492678, origin/master, origin/HEAD, bug492678) Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Sat Jan 25 14:54:08 2025 +0100 Bug 492678 - SIGALRM race condition (sighandler called after timer disarmed Awesome, thanks for the quick fix |
Created attachment 173345 [details] bug reproduction example SUMMARY There is a small chance that a SIGALRM is received AFTER that a timer has been disarmed. STEPS TO REPRODUCE 1. setup a SIGALRM sighandler 2. setup a SIGALRM timer with setitimer (works better with a small clock period) 3. disarm the SIGALRM timer 4. disable the SIGALRM sighangler to use SIG_DFL 5. run it with Valgrind or 1. compile the attached example 2. run it with Valgrind OBSERVED RESULT There is a small chance that the SIGALRM default handler is called after the timer has been disarmed. EXPECTED RESULT SIGALRM handler should not be called after the timer has been disarmed (which is what happen when not using Valgrind). SOFTWARE/OS VERSIONS Linux/Ubuntu: 6.8.0-40-generic #40~22.04.3-Ubuntu GCC: 11.4.0