Bug 375171

Summary: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure
Product: [Developer tools] valgrind Reporter: Patrik Nyberg <pnyberg>
Component: generalAssignee: Julian Seward <jseward>
Status: REPORTED ---    
Severity: crash CC: caibbor, rick.ramstetter+kde, sergiodj
Priority: NOR    
Version: 3.12.0   
Target Milestone: ---   
Platform: Gentoo Packages   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Patrik Nyberg 2017-01-17 09:12:37 UTC
Valgrind receive "m_scheduler/scheduler.c:1592 (vgPlain_scheduler): the 'impossible' happened." error when running "hello world" program in 32-bit. Also tested on Callgrind producing the same error.

Reproducible: Always

Steps to reproduce:
1. Compile hello world -> gcc -m32 hello.c
2. Run valgrind -> valgrind ./a.out

Output:
==18138== Memcheck, a memory error detector
==18138== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==18138== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==18138== Command: ./a.out
==18138== 
--18138-- Valgrind options:
--18138--    -v
--18138-- Contents of /proc/version:
--18138--   Linux version 4.8.15 (root@gentoo64.transmode.se) (gcc version 4.9.4 (Gentoo 4.9.4 p1.0, pie-0.6.4) ) #2 SMP PREEMPT Sat Dec 17 10:14:28 CET 2016
--18138-- Arch and hwcaps: X86, LittleEndian, x86-mmxext-sse1-sse2-lzcnt
--18138-- Page sizes: currently 4096, max supported 4096
--18138-- Valgrind library directory: /usr/lib64/valgrind
--18138-- Reading syms from /lib32/ld-2.23.so
--18138--   Considering /usr/lib/debug/lib32/ld-2.23.so.debug ..
--18138--   .. CRC is valid
--18138-- Reading syms from /home/pnyberg/tmp/a.out
--18138-- Reading syms from /usr/lib64/valgrind/memcheck-x86-linux
--18138--    object doesn't have a symbol table
--18138--    object doesn't have a dynamic symbol table
--18138-- Scheduler: using generic scheduler lock implementation.
--18138-- Reading suppressions file: /usr/lib64/valgrind/default.supp
==18138== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-18138-by-pnyberg-on-???
==18138== embedded gdbserver: writing to   /tmp/vgdb-pipe-to-vgdb-from-18138-by-pnyberg-on-???
==18138== embedded gdbserver: shared mem   /tmp/vgdb-pipe-shared-mem-vgdb-18138-by-pnyberg-on-???
==18138== 
==18138== TO CONTROL THIS PROCESS USING vgdb (which you probably
==18138== don't want to do, unless you know exactly what you're doing,
==18138== or are doing some strange experiment):
==18138==   /usr/lib64/valgrind/../../bin/vgdb --pid=18138 ...command...
==18138== 
==18138== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==18138==   /path/to/gdb ./a.out
==18138== and then give GDB the following command
==18138==   target remote | /usr/lib64/valgrind/../../bin/vgdb --pid=18138
==18138== --pid is optional if only one valgrind process is running
==18138== 
--18138-- REDIR: 0x4018610 (ld-linux.so.2:strlen) redirected to 0x38075552 (???)

valgrind: m_scheduler/scheduler.c:1592 (vgPlain_scheduler): the 'impossible' happened.
valgrind: VG_(scheduler), phase 3: run_innerloop detected host state invariant failure

host stacktrace:
==18138==    at 0x3805A4A4: ??? (in /usr/lib64/valgrind/memcheck-x86-linux)
==18138==    by 0x3805A5F6: ??? (in /usr/lib64/valgrind/memcheck-x86-linux)
==18138==    by 0x3805A759: ??? (in /usr/lib64/valgrind/memcheck-x86-linux)
==18138==    by 0x380B4BC3: ??? (in /usr/lib64/valgrind/memcheck-x86-linux)
==18138==    by 0x380C6F47: ??? (in /usr/lib64/valgrind/memcheck-x86-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==18138==    at 0x4007DD8: _dl_map_object (dl-load.c:1941)
==18138==    by 0x4000C64: map_doit (rtld.c:483)
==18138==    by 0x400EE34: _dl_catch_error (dl-error.c:187)
==18138==    by 0x4000870: do_preload (rtld.c:666)
==18138==    by 0x4003700: dl_main (rtld.c:1499)
==18138==    by 0x4015D31: _dl_sysdep_start (dl-sysdep.c:249)
==18138==    by 0x40047F0: _dl_start_final (rtld.c:307)
==18138==    by 0x40047F0: _dl_start (rtld.c:413)
==18138==    by 0x4000A76: ??? (in /lib32/ld-2.23.so)


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.


System:

uname -a

Linux hostname 4.8.15 #2 SMP PREEMPT Sat Dec 17 10:14:28 CET 2016 x86_64 Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz GenuineIntel GNU/Linux

gcc -v

Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.4/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.9.4/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-4.9.4/work/gcc-4.9.4/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.4 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.4 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.4/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.4/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/include/g++-v4 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.9.4/python --enable-languages=c,c++,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.9.4 p1.0, pie-0.6.4' --enable-libstdcxx-time --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-altivec --disable-fixed-point --enable-targets=all --disable-libgcj --enable-libgomp --disable-libmudflap --disable-libssp --disable-libcilkrts --enable-vtable-verify --enable-libvtv --enable-lto --without-cloog --enable-libsanitizer
Thread model: posix
gcc version 4.9.4 (Gentoo 4.9.4 p1.0, pie-0.6.4)
Comment 1 Patrik Nyberg 2017-01-17 10:48:14 UTC
Downgrading kernel to version 4.4.39-gentoo and the error is no longer present.
Comment 2 Julian Seward 2017-01-17 11:03:55 UTC
(In reply to Patrik Nyberg from comment #1)
> Downgrading kernel to version 4.4.39-gentoo and the error is no longer
> present.

Yeah, I did wonder if this is kernel specific.  The same failure
happened some years ago and it turned out to be, if I remember
correctly, a kernel bug, in which the kernel did not correctly
maintain the FPU state across context switches.  I wonder if 
4.9.x has some related problem.
Comment 3 Patrik Nyberg 2017-01-17 12:29:16 UTC
Okej, sounds like you might be on to something. I just upgraded to 4.9.4 and the error occurs again.
Comment 4 Julian Seward 2017-01-17 12:33:30 UTC
I don't have any such problems with the 4.9.3 kernel that comes
with Fedora 25.  Is it possible that this is a Gentoo-specific
problem?
Comment 5 Patrik Nyberg 2017-01-17 13:21:28 UTC
It might be, but it also seems to be a problem specific to the hardware. Several of my colleagues have now tried and it seems that all running the same kind of CPU as me (i5-6200U) are having the issue, while it is working fine for others (running on i7-4770 for example).
Comment 6 Patrik Nyberg 2017-01-17 13:27:45 UTC
This patch will solve the problem (at least for the simple hello world case).

--- valgrind-3.12.0/coregrind/m_dispatch/dispatch-x86-linux.S.org	2017-01-17 13:52:58.290661172 +0100
+++ valgrind-3.12.0/coregrind/m_dispatch/dispatch-x86-linux.S	2017-01-17 13:53:09.399596888 +0100
@@ -126,6 +126,7 @@
            or %fpucw.  We can't mess with %eax or %edx here as they
 	   holds the tentative return value, but any others are OK. */
 #if !defined(ENABLE_INNER)
+	jmp	remove_frame
         /* This check fails for self-hosting, so skip in that case */
 	pushl	$0
 	fstcw	(%esp)
Comment 7 Julian Seward 2017-01-17 13:32:19 UTC
(In reply to Patrik Nyberg from comment #6)
> This patch will solve the problem (at least for the simple hello world case).

Sure.  That just disables the assertion, though.  It doesn't resolve the
underlying issue.
Comment 8 Julian Seward 2017-01-17 13:34:01 UTC
(In reply to Patrik Nyberg from comment #5)

Are you sure that the i5-6200U connection is the only thing in
common?  That processor is a mid-range Skylake, and I am sure we
would have heard by now if there were problems with Valgrind on
Skylake.  That's why I ask.
Comment 9 Patrik Nyberg 2017-01-17 13:36:49 UTC
We just found this on the kernel bugzilla. Seems to be related https://bugzilla.kernel.org/show_bug.cgi?id=190061
Comment 10 Patrik Nyberg 2017-01-17 13:52:42 UTC
(In reply to Julian Seward from comment #8)
> (In reply to Patrik Nyberg from comment #5)
> 
> Are you sure that the i5-6200U connection is the only thing in
> common?  That processor is a mid-range Skylake, and I am sure we
> would have heard by now if there were problems with Valgrind on
> Skylake.  That's why I ask.

Yes the only thing in common we can find is that, I agree with you that it seems unlikely if this issue is on all Skylake processors.
Comment 11 Tom Hughes 2017-01-19 16:20:24 UTC
*** Bug 374850 has been marked as a duplicate of this bug. ***
Comment 12 Brandon 2017-01-19 16:34:52 UTC
I am coming over from this closed bug: https://bugs.kde.org/show_bug.cgi?id=375171

In response to whether my processor is Skylake, I believe it is.

laptop: https://www.amazon.com/Dell-Inspiron-i7559-2512BLK-Generation-GeForce/dp/B015PYZ0J6

CPU: Intel Quad Core i7-6700HQ 2.6 GHz 
https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz

the i7-6700HQ is listed under this wikipedia list of Skylakw processors: https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#List_of_Skylake_processors

The intel page doesn't specify the microarchitecture, but wikipedia says it is Skylake. So I presume I am, yes.
Comment 13 Julian Seward 2017-03-06 16:40:25 UTC
*** Bug 374482 has been marked as a duplicate of this bug. ***
Comment 14 Sergio Durigan Junior 2019-05-30 15:43:33 UTC
Hi,

I stumbled upon this issue while working to prepare a new GDB release on a Fedora Rawhide VM.  I talked to Mark, he pointed me to this bug, and we decided it would be a good idea to provide some information about how to reproduce it.

I found this bug while running the GDB testsuite.  I had executed the whole testsuite at one moment, and did not notice any failures related to valgrind.  Then, I had to upgrade my VM and make sure it was running the latest software available on Rawhide.  The upgrade installed the following packages:

  https://people.redhat.com/sdurigan/valgrind-375171/dnf-upgrade

As you can see, the Linux kernel was upgraded (from kernel-5.1.0-1.fc31.x86_64 to kernel-5.2.0-0.rc1.git3.1.fc31.x86_64).  After that, I ran the full testsuite again, and noticed two valgrind-related tests that started failing when they are compiled using -m32:

  gdb.base/valgrind-disp-step.exp
  gdb.base/valgrind-infcall.exp

They fail on upstream GDB as well, by the way.  If you would like to run the tests on your machine, you can:

1) Build GDB: https://sourceware.org/gdb/wiki/BuildingNatively

2) Run (from the build directory):

  $ make check-gdb TESTS='gdb.base/valgrind-disp-step.exp gdb.base/valgrind-infcall.exp' RUNTESTFLAGS='--target_board unix/-m32'

If you would like to see the full log that is generated when you run this command here, you can find it here:

  https://people.redhat.com/sdurigan/valgrind-375171/gdb.log

As I said, I'm using a VM running Fedora Rawhide.  The list of packages I have installed can be found here:

  https://people.redhat.com/sdurigan/valgrind-375171/packages

I am able to reproduce the problem every time I run the tests.

I can provide more information if needed.  Thanks!