507188 – memcheck with track-fds=yes on x86 with popen: Assertion 'n_ips >= 1 && n_ips <= VG_(clo_backtrace_size)' failed.

Bug 507188 - memcheck with track-fds=yes on x86 with popen: Assertion 'n_ips >= 1 && n_ips <= VG_(clo_backtrace_size)' failed.

Summary: memcheck with track-fds=yes on x86 with popen: Assertion 'n_ips >= 1 && n_ips...

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	memcheck (other bugs)
Version First Reported In:	3.25 GIT
Platform:	Debian stable Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2025-07-18 10:57 UTC by Mike Crowe
Modified:	2025-10-20 10:59 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Verbose debug output when reproducing the problem (33.28 KB, application/gzip) 2025-07-18 10:57 UTC, Mike Crowe	Details
Test program to reproduce (540 bytes, application/gzip) 2025-07-18 10:57 UTC, Mike Crowe	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mike Crowe 2025-07-18 10:57:12 UTC

Created attachment 183319 [details]
Verbose debug output when reproducing the problem

I'm seeing an assertion failure inside Memcheck when running v3.25.1 and master at 4ecf8d2832530de0904803c772126aabcf8fb075 on Debian 12 i686:

 valgrind: m_execontext.c:471 (record_ExeContext_wrk2): Assertion 'n_ips >= 1 && n_ips <= VG_(clo_backtrace_size)' failed.

when running a test program that uses `popen`:

 int main()
 {
     FILE *fp = popen("du -s .\n", "r");
     assert(fp);
     uint64_t result;
     assert(fscanf(fp, "%" PRIu64, &result) == 1);
     pclose(fp);
 }

with:
 valgrind --tool=memcheck --track-fds=yes ./reproduce

Tweaking the assert showed that n_ips == 0.

After the assertion failure execution continues and the assert in the test program fails too because fscanf returns -1.  This doesn't happen when the program is run outside Valgrind so I think that the failing Valgrind assert has lasting effects.

The similar https://bugs.kde.org/show_bug.cgi?id=391861 suggests that I should run with lots of verbosity and debugging. The result of that is attached along with the full reproduction case.

Debian 12's Valgrind 3.19.0 runs the test case successfully. I can try to bisect if that would be useful.

Comment 1 Mike Crowe 2025-07-18 10:57:59 UTC

Created attachment 183320 [details]
Test program to reproduce

Comment 2 Mark Wielaard 2025-07-19 23:38:30 UTC

Could reproduce on Debian i686 (but not on any other arch).
This issue seems to have been triggered by this commit:

commit 41441379baa63b5471385361d08c8df317705b69
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Mar 30 17:38:21 2025 +0200

    Handle top __syscall_cancel frames when getting stack traces
    
    Since glibc 2.41 there are extra frames inserted before doing a
    syscall to support proper thread cancellation.  This breaks various
    suppressions and regtests involving checking syscall arguments.
    
    Solve this by removing those extra frames from the top of the call
    stack when we are processing a linux system call.
    
    https://bugs.kde.org/show_bug.cgi?id=502126

This also removed the _dl_sysinfo_int80 call.
Looks like for some reason there isn't anything left after that, so n_ips == 0, triggering the assert.

Comment 3 Mike Crowe 2025-07-20 10:21:19 UTC

Thank you for investigating.

I can confirm that reverting 4ecf8d2832530de0904803c772126aabcf8fb075 resolves the problem on Debian 12 (glibc 2.36), and the OpenEmbedded-based system (glibc 2.39) that I'm using.

Comment 4 Mike Crowe 2025-07-20 10:22:57 UTC

(In reply to Mike Crowe from comment #3)
> I can confirm that reverting 4ecf8d2832530de0904803c772126aabcf8fb075
> resolves the problem on Debian 12 (glibc 2.36), and the OpenEmbedded-based
> system (glibc 2.39) that I'm using.

I copied and pasted the wrong commit hash. :( I of course I meant 41441379baa63b5471385361d08c8df317705b69.

Comment 5 Mark Wielaard 2025-10-17 16:35:03 UTC

In the end this turned out to be a very simple fix:

-      for (i = 0; i < found; i++) {
+      /* We want to keep at least one frame.  */
+      for (i = 0; i < found - 1; i++) {

Sorry this took so long to resolve.

commit a4593438d9fb95bae841531bd70a9217818c482b
Author: Mark Wielaard <mark@klomp.org>
Date:   Fri Oct 17 18:23:58 2025 +0200

    Keep at least one frame while peeling syscall frames
    
    VG_(get_StackTrace_with_deltas) might peel extra glibc syscall
    (cancel) frames. But if the backtrace failed, or only contains such
    syscall frames then we should keep at least one (the initial frame will
    always be there). Various routines expect n_ips of a Stacktrace to be
    at least 1.
    
    https://bugs.kde.org/show_bug.cgi?id=507188

Comment 6 Mike Crowe 2025-10-20 10:59:42 UTC

a4593438d9fb95bae841531bd70a9217818c482b on top of 3.25.1 fixes the problem for me. Thanks!