Version: 3.2.3 (using KDE KDE 3.5.1) Installed from: Fedora RPMs Compiler: gcc version 4.1.0 20060304 (Red Hat 4.1.0-3) OS: Linux While running callgrind I received this error: valgrind: m_scheduler/scheduler.c:996 (vgPlain_scheduler): the 'impossible' happened. valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure ==2159== at 0x38018B07: report_and_quit (m_libcassert.c:136) ==2159== by 0x38018E6A: vgPlain_assert_fail (m_libcassert.c:200) ==2159== by 0x38037666: vgPlain_scheduler (scheduler.c:994) ==2159== by 0x38052A6E: run_a_thread_NORETURN (syswrap-linux.c:87) I'll post the full log output as an attachment.
Created attachment 20194 [details] valgrind log output Here is the full output from the valgrind run that crashed. The callgrind output ended up as an empty file so nothing helpful there.
This is believed to be a bug in the Linux kernel for amd64 in that it does not save/restore the FPU state exactly right. Are you able to reproduce the problem? I'm sure this was filed as a kernel bug report bug I now can't find it, despite some digging.
The kernel bug is: http://bugzilla.kernel.org/show_bug.cgi?id=7223 It is fixed in 2.6.19 and some of the later 2.6.18 point release IIRC.
This is a kernel bug, not a Valgrind bug.
*** Bug 169693 has been marked as a duplicate of this bug. ***
*** Bug 176641 has been marked as a duplicate of this bug. ***
FWIW, something very similar to this just happened to me on a 2.6.28 kernel. valgrind: m_scheduler/scheduler.c:1144 (vgPlain_scheduler): the 'impossible' happened. valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure ==11229== at 0x3802A7AC: report_and_quit (m_libcassert.c:140) ==11229== by 0x3802AABA: vgPlain_assert_fail (m_libcassert.c:205) ==11229== by 0x3804E283: vgPlain_scheduler (scheduler.c:1165) ==11229== by 0x38060CB0: run_a_thread_NORETURN (syswrap-linux.c:89) So perhaps it was fixed in the kernel but then re-appeared. Or perhaps this is a different bug. valgrind --version: valgrind-3.4.1-Debian uname -srm: Linux 2.6.28-11-generic x86_64 This is running under VMWare Fusion, in case that might be relevant.
Update: on July 17, 2009 this issue reoccurred on a non-virtual system with a 2.6.28 Debian kernel (Ubuntu 9.04). See also http://sourceforge.net/mailarchive/forum.php?thread_name=20090716162641.GA7107%40ocean&forum_name=valgrind-developers.
(In reply to comment #8) Bart, is this repeatable? I think (not sure) that one of the symptoms the previous version of this problem, is that the failure is not reliably repeatable, because it depends on how the process is scheduled/descheduled. Considering comments #7 and #8, it looks like this bug is back from the dead in kernel 2.6.28. Maybe http://bugzilla.kernel.org/show_bug.cgi?id=7223 should be reopened?
(In reply to comment #9) > (In reply to comment #8) > Bart, is this repeatable? I think (not sure) that one of the symptoms > the previous version of this problem, is that the failure is not reliably > repeatable, because it depends on how the process is scheduled/descheduled. > > Considering comments #7 and #8, it looks like this bug is back from the > dead in kernel 2.6.28. Maybe http://bugzilla.kernel.org/show_bug.cgi?id=7223 > should be reopened? I have tried to reproduce kernel bug #7223 on a vanilla 2.6.28.10 kernel with the testfpu program attached to that bug report but without success so far. We need a way to reproduce the CPU state corruption before reopening the kernel bug.
By the way, the test I ran in parallel to the testfpu program (ten processes performing network I/O over an InfiniBand connection) triggered about 170.000 interrupts per second and about 200.000 context switches per second.
The system described in comment #8 is mine. The crash occurs very occasionally on my machine, but it's not at all repeatable :(
(In reply to comment #7) > This is running under VMWare Fusion, in case that might be relevant. I have now reproduced this on a non-virtual system. It is not 100% reproducible on VMs or PMs, but it can be reproduced in a day or so. Is there anything I can do to help debug this? E.g., this happens about half the time in our nightly buildbot testing, so I would be glad to run a specially instrumented valgrind build on our Ubuntu 9.04/64 buildbot slave and send any logs or core dumps it produces.
(In reply to comment #13) > I have now reproduced this on a non-virtual system. > It is not 100% reproducible on VMs or PMs, but it can be reproduced in a day or > so. > > Is there anything I can do to help debug this? E.g., this happens about half > the time in our nightly buildbot testing, so I would be glad to run a specially > instrumented valgrind build on our Ubuntu 9.04/64 buildbot slave and send any > logs or core dumps it produces. It is great news that you found a way to reproduce this issue systematically. Before reopening the kernel bugzilla entry, can you please verify whether the issue also occurs with a vanilla 2.6.30.2 kernel ? You can find instructions to install the 2.6.30.2 kernel image provided by Ubuntu here: https://wiki.ubuntu.com/KernelTeam/MainlineBuilds.
Occurred again on my machine (see comment 8) last night, for none/tests/async-sigs.c: testing: blocking=0 caught=11 fatal=7... valgrind: m_scheduler/scheduler.c:1199 (vgPlain_scheduler): the 'impossible' happened. valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure at 0x........: report_and_quit (m_libcassert.c:145) by 0x........: vgPlain_assert_fail (m_libcassert.c:217) by 0x........: vgPlain_scheduler (scheduler.c:1224) by 0x........: run_a_thread_NORETURN (syswrap-linux.c:91) sched status: running_tid=1 Thread 1: status = VgTs_Runnable at 0x........: test (async-sigs.c:94) by 0x........: main (async-sigs.c:129) Note: see also the FAQ in the source distribution. It contains workarounds to several common problems. In particular, if Valgrind aborted or crashed after identifying problems in your program, there's a good chance that fixing those problems will prevent Valgrind aborting or crashing, especially if it happened in m_mallocfree.c. If that doesn't help, please report this bug to: www.valgrind.org In the bug report, send all the above text, the valgrind version, and what OS and version you are using. Thanks. FAILED: child exited with unexpected status exit 1 testing: blocking=0 caught=11 fatal=1... PASSED testing: blocking=0 caught=10 fatal=7... PASSED testing: blocking=0 caught=10 fatal=1... PASSED testing: blocking=1 caught=11 fatal=7... PASSED testing: blocking=1 caught=11 fatal=1... PASSED testing: blocking=1 caught=10 fatal=7... PASSED testing: blocking=1 caught=10 fatal=1... PASSED
(In reply to comment #14) > (In reply to comment #13) > Before reopening the kernel bugzilla entry, can you please verify whether the > issue also occurs with a vanilla 2.6.30.2 kernel ? Just to let you know, I don't have any results yet but am trying to reproduce this on the 2.6.30.2 kernel as you suggest.
(In reply to comment #16) > Just to let you know, I don't have any results yet but am trying to reproduce > this on the 2.6.30.2 kernel as you suggest. Julian, is it necessary to continue these tests after the fix that has been committed on July 22, 2009 (VEX r1912, http://sourceforge.net/mailarchive/forum.php?thread_name=20090722110618.3D904108852%40jail0086.vps.exonetric.net&forum_name=valgrind-developers) ?
Yeah, I was wondering if r1912 was fixing this problem or something else. But I saw it on July 30 (comment 15) which was after the July 22 commit so I figure it's not fixed.
(In reply to comment #16) > (In reply to comment #14) > > (In reply to comment #13) > > Before reopening the kernel bugzilla entry, can you please verify whether the > > issue also occurs with a vanilla 2.6.30.2 kernel ? > > Just to let you know, I don't have any results yet but am trying to reproduce > this on the 2.6.30.2 kernel as you suggest. Sorry to have taken so many months to return to this. It takes a lot of testing to convince oneself that a problem like this is gone. It appears to be gone in the 2.6.30.2 kernel. Once we upgraded all of our Ubuntu 9.0.4 machines to the 2.6.30.2 kernel, we never saw it again, and we used to see it with pretty regular frequency.
(In reply to comment #19) > Sorry to have taken so many months to return to this. It takes a lot of > testing to convince oneself that a problem like this is gone. It appears to be > gone in the 2.6.30.2 kernel. Once we upgraded all of our Ubuntu 9.0.4 machines > to the 2.6.30.2 kernel, we never saw it again, and we used to see it with > pretty regular frequency. Thanks for the update.