Bug 339160 - Running signal handler with alternate stack allocated on current stack crashes callgrind
Summary: Running signal handler with alternate stack allocated on current stack crashe...
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: callgrind (other bugs)
Version First Reported In: unspecified
Platform: Other Linux
: NOR crash
Target Milestone: ---
Assignee: Josef Weidendorfer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-17 23:05 UTC by Josef Weidendorfer
Modified: 2025-03-06 21:02 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
test case (708 bytes, text/x-csrc)
2014-09-17 23:05 UTC, Josef Weidendorfer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Josef Weidendorfer 2014-09-17 23:05:05 UTC
Created attachment 88733 [details]
test case

This may be the same reason as in bug 249435.

Compile test case with "cc t.c -o t".
Running "valgrind --tool=callgrind ./t" results in

==29303== Callgrind, a call-graph generating cache profiler
==29303== Copyright (C) 2002-2013, and GNU GPL'd, by Josef Weidendorfer et al.
==29303== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==29303== Command: ./t
==29303== 
==29303== For interactive control, run 'callgrind_control -h'.
Got signal 10 with SP=0xffeffe46c
0x0000000000036c30
BB# 28778

Callgrind: threads.c:246 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.

host stacktrace:
==29303==    at 0x3803C11E: show_sched_status_wrk (m_libcassert.c:317)
==29303==    by 0x3803C234: report_and_quit (m_libcassert.c:376)
==29303==    by 0x3803C3B6: vgPlain_assert_fail (m_libcassert.c:441)
==29303==    by 0x38039CF2: vgCallgrind_post_signal (threads.c:246)
==29303==    by 0x380BC105: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (syswrap-amd64-linux.c:501)
==29303==    by 0x3808CC64: vgPlain_client_syscall (syswrap-main.c:1586)
==29303==    by 0x380896A2: handle_syscall (scheduler.c:1086)
==29303==    by 0x3808ABD6: vgPlain_scheduler (scheduler.c:1392)
==29303==    by 0x3809A11C: run_a_thread_NORETURN (syswrap-linux.c:103)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==29303==    at 0x4C5DBB9: raise (raise.c:56)
==29303==    by 0x40084B: main (in /home/weidendo/tmp/vgaltstack/t)
Comment 1 Josef Weidendorfer 2014-09-17 23:19:56 UTC
This seems to screw up callgrind's internal shadow stack synchronization mechanism. When doing a return from the signal handler, we still are on the alternate stack which actually is within an old stack frame. Callgrind seems to confuse this with a longjmp, doing an automatic internal unwinding of its shadow stack down to the frame where the alternate stack was allocated from. Afterwards, Valgrind notifies Callgrind about leaving the signal handler, which does not match with callgrind's internal shadow stack (the signal frame got already unwound), resulting in the failed assertion.

Hmm. It seems that there needs to be a special case for this in shadow stack maintanance...
Comment 2 Julian Seward 2017-11-15 13:56:49 UTC
Josef, what is the possibility to get this fixed?  I have been running
firefox builds for 64-bit Windows, on Wine, on Callgrind, because I want
to profile them, and it fails for me in the same way, alas.

If it can't be easily fixed, is there any possible workaround?  I don't
care if I get slightly inaccurate profile results for a while, if that's
the short-term cost to keep Callgrind alive on such a workload.
Comment 3 Johannes Jordan 2019-09-20 18:14:47 UTC
It appears that because of this problem, we are still not able to run callgrind on mingw32-w64 builds with Wine.
Comment 4 Anatolii 2025-03-06 00:01:17 UTC
Hello.
I encountered to the same problem while analyzing Nomad application via Callgrind tool.

## Bug description 

Valgrind versions tested: 
- valgrind-3.24.0 (Arch Linux, pacman)
  commit `2d09ef48e` from the master branch.

Reproduction command and the resulting log:

```
# valgrind --tool=callgrind /tmp/nomad -h

...

Callgrind: threads.c:247 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.

host stacktrace:
==486603==    at 0x5801F79F: show_sched_status_wrk.lto_priv.0 (m_libcassert.c:407)
==486603==    by 0x5801F7FF: report_and_quit (m_libcassert.c:478)
==486603==    by 0x5801F992: vgPlain_assert_fail (m_libcassert.c:544)
==486603==    by 0x58023021: vgCallgrind_post_signal (threads.c:247)
==486603==    by 0x5807A256: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (sigframe-amd64-linux.c:599)
==486603==    by 0x581DFD54: handle_syscall.isra.0 (syswrap-main.c:2358)
==486603==    by 0x5806956E: vgPlain_scheduler (scheduler.c:1582)
==486603==    by 0x580B30A9: run_a_thread_NORETURN (syswrap-linux.c:102)
```

## Localization

I attempted to isolate the location of the bug and to determine the root cause. In my opinion, I achieved some noticeable results.

I have inserted additional debug prints (using `VG_(printf)`) into the functions listed below. Therefore, I am able to capture more necessary context. See attached file [debug-prints-339160-commit-2d09ef48e.patch](https://gist.github.com/Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb#file-debug-prints-339160-commit-2d09ef48e-patch).
- [unwind_thread inside the loop](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/main.c;h=39aa230ed2716a41f1c49bf2f7cc44167bdd18ff;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l1432)
- [CLG_(pop_call_stack)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/callstack.c;h=fc1d6eeab130f94c382062c7071e15311cb953f9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l321)
- [CLG_(pre_signal)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l201)
- [CLG_(run_post_signal_on_call_stack_bottom)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l234)
- [CLG_(post_signal)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l247)
- [VG_(sigframe_destroy)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=coregrind/m_sigframe/sigframe-amd64-linux.c;h=0e24306096c7577f959b5c38c4af143e0f77a93d;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l598) 

The callgrind output with additional debug print:

```
# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
==575331== Callgrind, a call-graph generating cache profiler
==575331== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.
==575331== Using Valgrind-3.25.0.GIT and LibVEX; rerun with -h for copyright info
==575331== Command: /tmp/nomad -h
==575331== 
==575331== For interactive control, run 'callgrind_control -h'.
BB# 4793874
-- pre_signal(TID 1, sig 23, alt_st yes)
=== pop_call_stack current_tid=1  CLG_(current_state).sig=23
=== run_post_signal_on_call_stack_bottom
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=23
=== post_signal before exec_state_restore es->sig=0
=== sigframe_destroy before VG_TRACK( post_deliver_signal, tid, sigNo )
BB# 4793961
-- post_signal(TID 1, sig 23)
=== post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=0
0x000000000003dcd0

Callgrind: threads.c:262 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed.
```

Two points to consider:
1. There is the second `post_signal` call before the assertion fail. During the second call variable `CLG_(current_state)` appears reset, but the `sigNum` (post_signal argument) still retains old value "23".
2. `post_signal` is called twice. The first trace is `pop_call_stack -> run_post_signal_on_call_stack_bottom -> post_signal`. And the second: `sigframe_destroy -> track_post_deliver_signal -> post_signal`.

I have not investigated this bug further, but based on the information above, I suppose that the signal is processed (called `post_signal`) twice, which is incorrect behavior. The first processing may be valid, but the second processing causes Callgrind to crash with an assertion error.

## Workaround

I wrote a workaround in order to bypass the second `post_signal` (wrong one), that is being called from the `sigframe_destroy`. I added a new function, `CLG_(post_signal_delivered)`, which wraps `CLG_(post_signal)` and checks is `CLG_(current_state).sig` equal to zero.

I have prepared a patch-file (tested only on commit `2d09ef48e`).
See file [bug-339160-workaround-commit-2d09ef48e] (пер. https://gist.github.com/Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb#file-bug-339160-workaround-commit-2d09ef48e-patch).

The results:

```
# ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h
...

Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160
Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args]

Common commands:
    run         Run a new job or update an existing job
    ......
    agent       Runs a Nomad agent

Other commands:
    acl                 Interact with ACL policies and tokens
    action              Run a pre-defined command from a given context
    ......
    version             Prints the Nomad version
    volume              Interact with volumes
==639712== 
==639712== Events    : Ir
==639712== Collected : 619290724
==639712== 
==639712== I   refs:      619,290,724
```

## Validation and discussion

I think my workaround cannot be treated as a bugfix. So, I provided patch-file here without making Merge Request.
However, if you encounter the same problem, you can apply the patch and test the Callgrind behavior.

To fully resolve the issue, the next steps should focus on identifying why `post_signal` is being called twice.