Summary: | VG_(scheduler), phase 3: run_innerloop detected host state invariant failure | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Patrik Nyberg <pnyberg> |
Component: | general | Assignee: | Julian Seward <jseward> |
Status: | REPORTED --- | ||
Severity: | crash | CC: | caibbor, rick.ramstetter+kde, sergiodj |
Priority: | NOR | ||
Version: | 3.12.0 | ||
Target Milestone: | --- | ||
Platform: | Gentoo Packages | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
Patrik Nyberg
2017-01-17 09:12:37 UTC
Downgrading kernel to version 4.4.39-gentoo and the error is no longer present. (In reply to Patrik Nyberg from comment #1) > Downgrading kernel to version 4.4.39-gentoo and the error is no longer > present. Yeah, I did wonder if this is kernel specific. The same failure happened some years ago and it turned out to be, if I remember correctly, a kernel bug, in which the kernel did not correctly maintain the FPU state across context switches. I wonder if 4.9.x has some related problem. Okej, sounds like you might be on to something. I just upgraded to 4.9.4 and the error occurs again. I don't have any such problems with the 4.9.3 kernel that comes with Fedora 25. Is it possible that this is a Gentoo-specific problem? It might be, but it also seems to be a problem specific to the hardware. Several of my colleagues have now tried and it seems that all running the same kind of CPU as me (i5-6200U) are having the issue, while it is working fine for others (running on i7-4770 for example). This patch will solve the problem (at least for the simple hello world case). --- valgrind-3.12.0/coregrind/m_dispatch/dispatch-x86-linux.S.org 2017-01-17 13:52:58.290661172 +0100 +++ valgrind-3.12.0/coregrind/m_dispatch/dispatch-x86-linux.S 2017-01-17 13:53:09.399596888 +0100 @@ -126,6 +126,7 @@ or %fpucw. We can't mess with %eax or %edx here as they holds the tentative return value, but any others are OK. */ #if !defined(ENABLE_INNER) + jmp remove_frame /* This check fails for self-hosting, so skip in that case */ pushl $0 fstcw (%esp) (In reply to Patrik Nyberg from comment #6) > This patch will solve the problem (at least for the simple hello world case). Sure. That just disables the assertion, though. It doesn't resolve the underlying issue. (In reply to Patrik Nyberg from comment #5) Are you sure that the i5-6200U connection is the only thing in common? That processor is a mid-range Skylake, and I am sure we would have heard by now if there were problems with Valgrind on Skylake. That's why I ask. We just found this on the kernel bugzilla. Seems to be related https://bugzilla.kernel.org/show_bug.cgi?id=190061 (In reply to Julian Seward from comment #8) > (In reply to Patrik Nyberg from comment #5) > > Are you sure that the i5-6200U connection is the only thing in > common? That processor is a mid-range Skylake, and I am sure we > would have heard by now if there were problems with Valgrind on > Skylake. That's why I ask. Yes the only thing in common we can find is that, I agree with you that it seems unlikely if this issue is on all Skylake processors. *** Bug 374850 has been marked as a duplicate of this bug. *** I am coming over from this closed bug: https://bugs.kde.org/show_bug.cgi?id=375171 In response to whether my processor is Skylake, I believe it is. laptop: https://www.amazon.com/Dell-Inspiron-i7559-2512BLK-Generation-GeForce/dp/B015PYZ0J6 CPU: Intel Quad Core i7-6700HQ 2.6 GHz https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz the i7-6700HQ is listed under this wikipedia list of Skylakw processors: https://en.wikipedia.org/wiki/Skylake_(microarchitecture)#List_of_Skylake_processors The intel page doesn't specify the microarchitecture, but wikipedia says it is Skylake. So I presume I am, yes. *** Bug 374482 has been marked as a duplicate of this bug. *** Hi, I stumbled upon this issue while working to prepare a new GDB release on a Fedora Rawhide VM. I talked to Mark, he pointed me to this bug, and we decided it would be a good idea to provide some information about how to reproduce it. I found this bug while running the GDB testsuite. I had executed the whole testsuite at one moment, and did not notice any failures related to valgrind. Then, I had to upgrade my VM and make sure it was running the latest software available on Rawhide. The upgrade installed the following packages: https://people.redhat.com/sdurigan/valgrind-375171/dnf-upgrade As you can see, the Linux kernel was upgraded (from kernel-5.1.0-1.fc31.x86_64 to kernel-5.2.0-0.rc1.git3.1.fc31.x86_64). After that, I ran the full testsuite again, and noticed two valgrind-related tests that started failing when they are compiled using -m32: gdb.base/valgrind-disp-step.exp gdb.base/valgrind-infcall.exp They fail on upstream GDB as well, by the way. If you would like to run the tests on your machine, you can: 1) Build GDB: https://sourceware.org/gdb/wiki/BuildingNatively 2) Run (from the build directory): $ make check-gdb TESTS='gdb.base/valgrind-disp-step.exp gdb.base/valgrind-infcall.exp' RUNTESTFLAGS='--target_board unix/-m32' If you would like to see the full log that is generated when you run this command here, you can find it here: https://people.redhat.com/sdurigan/valgrind-375171/gdb.log As I said, I'm using a VM running Fedora Rawhide. The list of packages I have installed can be found here: https://people.redhat.com/sdurigan/valgrind-375171/packages I am able to reproduce the problem every time I run the tests. I can provide more information if needed. Thanks! |