443605 – Don't call final_tidyup (__libc_freeres) on FatalSignal

Bug 443605 - Don't call final_tidyup (__libc_freeres) on FatalSignal

Summary: Don't call final_tidyup (__libc_freeres) on FatalSignal

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (show other bugs)
Version:	unspecified
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-10-11 16:37 UTC by Mark Wielaard
Modified:	2021-10-12 20:59 UTC (History)
CC List:	0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mark Wielaard 2021-10-11 16:37:26 UTC

When a program gets a fatal signal (one it doesn't handle) valgrind terminates the program. Before termination it will try to call final_tidyup which tries to run __libc_freeres and __gnu_cxx::__freeres to get rid of some memory glibc or libstdc++ don't normally release.

But when the program got the fatal signal in a critical section inside glibc it might leave the datastructures in a bad state and cause __libc_freeres to crash.  This makes valgrind itself crash just before producing its own error summary, making the valgrind run unusable.

A reproducer can found at https://bugzilla.redhat.com/show_bug.cgi?id=1952836 and https://bugzilla.redhat.com/show_bug.cgi?id=1225994#c7

This reproducer is really a worse case scenario with multiple threads racing to get into the critical section that when interrupted will make __libc_freeres unable to cleanup. But it seems a good policy in general. If a program is terminated by a fatal signal instead of normal termination, it seems not having some of the glibc/libstdc++ resource cleaned up is an expected thing.

The following patch implements this:

diff --git a/coregrind/m_main.c b/coregrind/m_main.c
index 56f9c6cbf..70b6c0549 100644
--- a/coregrind/m_main.c
+++ b/coregrind/m_main.c
@@ -2168,6 +2168,7 @@ void shutdown_actions_NORETURN( ThreadId tid,
             || tids_schedretcode == VgSrc_ExitProcess
              || tids_schedretcode == VgSrc_FatalSig );
 
+   /* Try to do final tidyup on "normal" exit, not on FatalSig.  */
    if (tids_schedretcode == VgSrc_ExitThread) {
 
       // We are the last surviving thread.  Right?
@@ -2185,7 +2186,7 @@ void shutdown_actions_NORETURN( ThreadId tid,
       vg_assert(VG_(is_running_thread)(tid));
       vg_assert(VG_(count_living_threads)() == 1);
 
-   } else {
+   } else if (tids_schedretcode == VgSrc_ExitProcess) {
 
       // We may not be the last surviving thread.  However, we
       // want to shut down the entire process.  We hold the lock

Comment 1 Mark Wielaard 2021-10-12 20:59:56 UTC

commit cf9ebf8313952caed53394498fe849251f477c97
Author: Mark Wielaard <mark@klomp.org>
Date:   Tue Oct 12 22:47:57 2021 +0200

    coregrind: Don't call final_tidyup (__libc_freeres) on FatalSignal
    
    When a program gets a fatal signal (one it doesn't handle) valgrind
    terminates the program. Before termination it will try to call
    final_tidyup which tries to run __libc_freeres and
    __gnu_cxx::__freeres to get rid of some memory glibc or libstdc++
    don't normally release.
    
    But when the program got the fatal signal in a critical section inside
    glibc it might leave the datastructures in a bad state and cause
    __libc_freeres to crash.  This makes valgrind itself crash just before
    producing its own error summary, making the valgrind run unusable.
    
    A reproducer can found at
    https://bugzilla.redhat.com/show_bug.cgi?id=1952836 and
    https://bugzilla.redhat.com/show_bug.cgi?id=1225994#c7
    
    This reproducer is really a worse case scenario with multiple threads
    racing to get into the critical section that when interrupted will
    make __libc_freeres unable to cleanup. But it seems a good policy in
    general. If a program is terminated by a fatal signal instead of
    normal termination, it seems not having some of the glibc/libstdc++
    resource cleaned up is an expected thing.
    
    https://bugs.kde.org/show_bug.cgi?id=443605