143888 – m_scheduler/scheduler.c:996 (vgPlain_scheduler): the 'impossible' happened.

Bug 143888 - m_scheduler/scheduler.c:996 (vgPlain_scheduler): the 'impossible' happened.

Summary: m_scheduler/scheduler.c:996 (vgPlain_scheduler): the 'impossible' happened.

Status:	RESOLVED NOT A BUG

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	general (other bugs)
Version First Reported In:	3.2.3
Platform:	Fedora RPMs Linux

Importance:	NOR crash
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Duplicates (2):	169693 176641 (view as bug list)
Depends on:
Blocks:

Reported:	2007-04-05 22:13 UTC by Jack Lloyd
Modified:	2009-11-04 08:07 UTC (History)
CC List:	5 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
valgrind log output (16.56 KB, text/plain) 2007-04-05 22:14 UTC, Jack Lloyd	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Jack Lloyd 2007-04-05 22:13:17 UTC

Version:           3.2.3 (using KDE KDE 3.5.1)
Installed from:    Fedora RPMs
Compiler:          gcc version 4.1.0 20060304 (Red Hat 4.1.0-3) 
OS:                Linux

While running callgrind I received this error:

valgrind: m_scheduler/scheduler.c:996 (vgPlain_scheduler): the 'impossible' happened.
valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure
==2159==    at 0x38018B07: report_and_quit (m_libcassert.c:136)
==2159==    by 0x38018E6A: vgPlain_assert_fail (m_libcassert.c:200)
==2159==    by 0x38037666: vgPlain_scheduler (scheduler.c:994)
==2159==    by 0x38052A6E: run_a_thread_NORETURN (syswrap-linux.c:87)

I'll post the full log output as an attachment.

Comment 1 Jack Lloyd 2007-04-05 22:14:38 UTC

Created attachment 20194 [details]
valgrind log output

Here is the full output from the valgrind run that crashed.

The callgrind output ended up as an empty file so nothing helpful there.

Comment 2 Julian Seward 2007-04-07 15:22:37 UTC

This is believed to be a bug in the Linux kernel for amd64 in that it does
not save/restore the FPU state exactly right.  Are you able to reproduce
the problem?  I'm sure this was filed as a kernel bug report bug I now
can't find it, despite some digging.

Comment 3 Tom Hughes 2007-04-09 14:04:38 UTC

The kernel bug is:

  http://bugzilla.kernel.org/show_bug.cgi?id=7223

It is fixed in 2.6.19 and some of the later 2.6.18 point release IIRC.

Comment 4 Julian Seward 2007-05-01 11:04:28 UTC

This is a kernel bug, not a Valgrind bug.

Comment 5 Julian Seward 2008-08-24 01:42:54 UTC

*** Bug 169693 has been marked as a duplicate of this bug. ***

Comment 6 M Welinder 2008-12-23 19:30:19 UTC

*** Bug 176641 has been marked as a duplicate of this bug. ***

Comment 7 Ben Denckla 2009-05-29 22:11:14 UTC

FWIW, something very similar to this just happened to me on a 2.6.28 kernel.

valgrind: m_scheduler/scheduler.c:1144 (vgPlain_scheduler): the 'impossible' happened.
valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure
==11229==    at 0x3802A7AC: report_and_quit (m_libcassert.c:140)
==11229==    by 0x3802AABA: vgPlain_assert_fail (m_libcassert.c:205)
==11229==    by 0x3804E283: vgPlain_scheduler (scheduler.c:1165)
==11229==    by 0x38060CB0: run_a_thread_NORETURN (syswrap-linux.c:89)

So perhaps it was fixed in the kernel but then re-appeared.  Or perhaps this is a different bug.

valgrind --version: valgrind-3.4.1-Debian
uname -srm: Linux 2.6.28-11-generic x86_64

This is running under VMWare Fusion, in case that might be relevant.

Comment 8 Bart Van Assche 2009-07-17 07:53:19 UTC

Update: on July 17, 2009 this issue reoccurred on a non-virtual system with a 2.6.28 Debian kernel (Ubuntu 9.04). See also http://sourceforge.net/mailarchive/forum.php?thread_name=20090716162641.GA7107%40ocean&forum_name=valgrind-developers.

Comment 9 Julian Seward 2009-07-17 13:55:31 UTC

(In reply to comment #8)
Bart, is this repeatable?  I think (not sure) that one of the symptoms
the previous version of this problem, is that the failure is not reliably
repeatable, because it depends on how the process is scheduled/descheduled.

Considering comments #7 and #8, it looks like this bug is back from the
dead in kernel 2.6.28.  Maybe http://bugzilla.kernel.org/show_bug.cgi?id=7223
should be reopened?

Comment 10 Bart Van Assche 2009-07-17 21:04:10 UTC

(In reply to comment #9)
> (In reply to comment #8)
> Bart, is this repeatable?  I think (not sure) that one of the symptoms
> the previous version of this problem, is that the failure is not reliably
> repeatable, because it depends on how the process is scheduled/descheduled.
> 
> Considering comments #7 and #8, it looks like this bug is back from the
> dead in kernel 2.6.28.  Maybe http://bugzilla.kernel.org/show_bug.cgi?id=7223
> should be reopened?

I have tried to reproduce kernel bug #7223 on a vanilla 2.6.28.10 kernel with the testfpu program attached to that bug report but without success so far. We need a way to reproduce the CPU state corruption before reopening the kernel bug.

Comment 11 Bart Van Assche 2009-07-17 21:24:52 UTC

By the way, the test I ran in parallel to the testfpu program (ten processes performing network I/O over an InfiniBand connection) triggered about 170.000 interrupts per second and about 200.000 context switches per second.

Comment 12 Nicholas Nethercote 2009-07-20 05:49:52 UTC

The system described in comment #8 is mine.  The crash occurs very occasionally on my machine, but it's not at all repeatable :(

Comment 13 Ben Denckla 2009-07-20 22:48:40 UTC

(In reply to comment #7)
> This is running under VMWare Fusion, in case that might be relevant.

I have now reproduced this on a non-virtual system.

It is not 100% reproducible on VMs or PMs, but it can be reproduced in a day or so.

Is there anything I can do to help debug this?  E.g., this happens about half the time in our nightly buildbot testing, so I would be glad to run a specially instrumented valgrind build on our Ubuntu 9.04/64 buildbot slave and send any logs or core dumps it produces.

Comment 14 Bart Van Assche 2009-07-21 12:53:35 UTC

(In reply to comment #13)
> I have now reproduced this on a non-virtual system.
> It is not 100% reproducible on VMs or PMs, but it can be reproduced in a day or
> so.
> 
> Is there anything I can do to help debug this?  E.g., this happens about half
> the time in our nightly buildbot testing, so I would be glad to run a specially
> instrumented valgrind build on our Ubuntu 9.04/64 buildbot slave and send any
> logs or core dumps it produces.

It is great news that you found a way to reproduce this issue systematically. Before reopening the kernel bugzilla entry, can you please verify whether the issue also occurs with a vanilla 2.6.30.2 kernel ? You can find instructions to install the 2.6.30.2 kernel image provided by Ubuntu here: https://wiki.ubuntu.com/KernelTeam/MainlineBuilds.

Comment 15 Nicholas Nethercote 2009-07-30 00:36:23 UTC

Occurred again on my machine (see comment 8) last night, for none/tests/async-sigs.c:


testing: blocking=0 caught=11 fatal=7... 
valgrind: m_scheduler/scheduler.c:1199 (vgPlain_scheduler): the 'impossible' happened.
valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure
   at 0x........: report_and_quit (m_libcassert.c:145)
   by 0x........: vgPlain_assert_fail (m_libcassert.c:217)
   by 0x........: vgPlain_scheduler (scheduler.c:1224)
   by 0x........: run_a_thread_NORETURN (syswrap-linux.c:91)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
   at 0x........: test (async-sigs.c:94)
   by 0x........: main (async-sigs.c:129)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

FAILED: child exited with unexpected status exit 1
testing: blocking=0 caught=11 fatal=1... PASSED
testing: blocking=0 caught=10 fatal=7... PASSED
testing: blocking=0 caught=10 fatal=1... PASSED
testing: blocking=1 caught=11 fatal=7... PASSED
testing: blocking=1 caught=11 fatal=1... PASSED
testing: blocking=1 caught=10 fatal=7... PASSED
testing: blocking=1 caught=10 fatal=1... PASSED

Comment 16 Ben Denckla 2009-08-11 22:28:36 UTC

(In reply to comment #14)
> (In reply to comment #13)
> Before reopening the kernel bugzilla entry, can you please verify whether the
> issue also occurs with a vanilla 2.6.30.2 kernel ?

Just to let you know, I don't have any results yet but am trying to reproduce this on the 2.6.30.2 kernel as you suggest.

Comment 17 Bart Van Assche 2009-08-11 22:36:12 UTC

(In reply to comment #16)
> Just to let you know, I don't have any results yet but am trying to reproduce
> this on the 2.6.30.2 kernel as you suggest.

Julian, is it necessary to continue these tests after the fix that has been committed on July 22, 2009 (VEX r1912, http://sourceforge.net/mailarchive/forum.php?thread_name=20090722110618.3D904108852%40jail0086.vps.exonetric.net&forum_name=valgrind-developers) ?

Comment 18 Nicholas Nethercote 2009-08-12 01:09:34 UTC

Yeah, I was wondering if r1912 was fixing this problem or something else.  But I saw it on July 30 (comment 15) which was after the July 22 commit so I figure it's not fixed.

Comment 19 Ben Denckla 2009-11-04 02:14:06 UTC

(In reply to comment #16)
> (In reply to comment #14)
> > (In reply to comment #13)
> > Before reopening the kernel bugzilla entry, can you please verify whether the
> > issue also occurs with a vanilla 2.6.30.2 kernel ?
> 
> Just to let you know, I don't have any results yet but am trying to reproduce
> this on the 2.6.30.2 kernel as you suggest.

Sorry to have taken so many months to return to this.  It takes a lot of testing to convince oneself that a problem like this is gone.  It appears to be gone in the 2.6.30.2 kernel.  Once we upgraded all of our Ubuntu 9.0.4 machines to the 2.6.30.2 kernel, we never saw it again, and we used to see it with pretty regular frequency.

Comment 20 Bart Van Assche 2009-11-04 08:07:43 UTC

(In reply to comment #19)
> Sorry to have taken so many months to return to this.  It takes a lot of
> testing to convince oneself that a problem like this is gone.  It appears to be
> gone in the 2.6.30.2 kernel.  Once we upgraded all of our Ubuntu 9.0.4 machines
> to the 2.6.30.2 kernel, we never saw it again, and we used to see it with
> pretty regular frequency.

Thanks for the update.