Bug 339596 - AMD64 xop instructions unsupported. vex amd64->IR: unhandled instruction bytes: 0x8F 0xE8 0x78 0xCD 0xC1 0x4 0xC5 0xF9
Summary: AMD64 xop instructions unsupported. vex amd64->IR: unhandled instruction byte...
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.9.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 356138 424034 446669 (view as bug list)
Depends on: 323431 369053 381819 369000
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-02 06:45 UTC by p4plus2
Modified: 2022-08-22 15:06 UTC (History)
9 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
First attempt at an XOP/FMA4 patch (23.10 KB, patch)
2014-10-14 01:25 UTC, p4plus2
Details
Test cases for XOP/FMA4 instructions (7.30 KB, text/x-csrc)
2014-10-14 01:26 UTC, p4plus2
Details
XOP/FMA with CPUID support (33.46 KB, patch)
2016-08-16 20:12 UTC, p4plus2
Details
Testcases for fma4 instructions (264.52 KB, patch)
2016-09-06 15:23 UTC, Mark Wielaard
Details
Current stdout.diff file for new fma4.vgtest and proposed patch (72.34 KB, text/plain)
2016-09-06 15:25 UTC, Mark Wielaard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description p4plus2 2014-10-02 06:45:41 UTC
When running valgrind upon program startup I immediately run into an illegal instruction.  My first action was to try using the latest source from SVN, this however, did not help.  Below are a few extra details on the instruction in question, my system, and anything else I can think of.

vex amd64->IR: unhandled instruction bytes: 0x8F 0xE8 0x78 0xCD 0xC1 0x4 0xC5 0xF9
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==16432== valgrind: Unrecognised instruction at address 0x5adb623.
==16432==    at 0x5ADB623: findChar(QChar const*, int, QChar, int, Qt::CaseSensitivity) (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5AEC0C9: QString::split(QChar, QString::SplitBehavior, Qt::CaseSensitivity) const (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5BC8E61: QStandardPaths::standardLocations(QStandardPaths::StandardLocation) (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5B7D9B7: QStandardPaths::locate(QStandardPaths::StandardLocation, QString const&, QFlags<QStandardPaths::LocateOption>) (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5BB6028: QLoggingRegistry::init() (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5C2BC8A: QCoreApplication::init() (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5C2BEF5: QCoreApplication::QCoreApplication(QCoreApplicationPrivate&) (in /usr/lib64/libQt5Core.so.5.4.0)
==16432==    by 0x5560718: QGuiApplication::QGuiApplication(QGuiApplicationPrivate&) (in /usr/lib64/libQt5Gui.so.5.4.0)
==16432==    by 0x4F8F23C: QApplication::QApplication(int&, char**, int) (in /usr/lib64/libQt5Widgets.so.5.4.0)
==16432==    by 0x40D62D: main (main.cpp:37)

I've run it through GDB and grabbed a disassembly output (More can be provided if requested):
   0x0000000005adb619 <+329>:	mov    %r8,%rax
   0x0000000005adb61c <+332>:	vmovups (%rax),%xmm0
   0x0000000005adb620 <+336>:	mov    %rcx,%r8
=> 0x0000000005adb623 <+339>:	vpcomw $0x4,%xmm1,%xmm0,%xmm0
   0x0000000005adb629 <+345>:	vpmovmskb %xmm0,%esi
   0x0000000005adb62d <+349>:	test   %si,%si
   0x0000000005adb630 <+352>:	je     0x5adb610 <_ZL8findCharPK5QChariS_iN2Qt15CaseSensitivityE+320>
   0x0000000005adb632 <+354>:	bsf    %esi,%esi
   0x0000000005adb635 <+357>:	sub    %rdi,%rax

My system CPU is an AMD FX-8150, and Qt (where the instruction seems to originate) is compiled with GCC 4.8.  I am running a Gentoo based system and have used -march=native, -O2, and -fomit-frame-pointer as my only three default CFLAGS.   

If it helps I can setup a VM with SSH access temporarily to aid in testing/debugging of the problem.  However, for sanity sake I will only do this Valgrind developers.

My final observersations seem to be that the XOP and FM4 instructions introduced in the bulldozer generation AMD processors seem to cause the most trouble.  But that may be beyond the scope of this bug report.

Reproducible: Always
Comment 1 Tom Hughes 2014-10-02 07:43:37 UTC
The 0x8F suggests it is a XOP encoded AMD extension.
Comment 2 p4plus2 2014-10-02 08:25:29 UTC
One more bit of useful information this seems to be the location of failure more specifically: https://qt.gitorious.org/qt/qtbase/source/538248d930a7c62212a18f81c31d381752beff13:src/corelib/tools/qstring.cpp#L659

After I get home from class tomorrow I'll try deciphering how guest_amd64_toIR works and see where I can get from there.
Comment 3 Tom Hughes 2014-10-02 08:36:15 UTC
I doubt you will get very far as I don't think we have support for the AMD XOP encoded instructions yet, so it's not just a question of adding one instruction, you would likely need to add a whole load of framework first.

There's more information at http://en.wikipedia.org/wiki/XOP_instruction_set although the fact that it's largely the same as VEX will probably make things easier as we do support that.

In any case the easy fix is not to compile with -march=native (I assume that's what you're doing?) as if you do that with a modern machine, and especially an AMD one, you are likely to run into more than this one instruction that valgrind won't handle.
Comment 4 p4plus2 2014-10-02 17:29:55 UTC
I wasn't expecting it to be easy, I think that would have been a real surprise if it was.  While a recompile is nice in theory I'd have to rebuild my entire system which isn't really something I want to do.  Besides, I could use a new spare time project if nobody else has plans to do it.
Comment 5 p4plus2 2014-10-03 20:45:30 UTC
Well I have a starting implementation going VPCOMW, it's about half implemented.  I just need to add cases 0, 1, 6 and 7.  But with what I have handled so far it covers the scope of my problem, which is nice.  It seems I am running into a problem with some FMA4 instructions now.  I'll start working on that for the moment probably.  Once I have implemented the instructions I need I'll clean this up a bit more see where I can go with it from there.  So far so good at least.
Comment 6 p4plus2 2014-10-06 00:50:54 UTC
Alright, well I've made some major progress and implemented 19 missing instructions (Only the XMM portion, didn't do YMM yet).  This has solved my initial problem and I no longer run into any illegal instructions.  I've written some side tests in addition to my original program to verify the instructions I implemented are correct.  At this point I'll probably write some more tests for instructions which don't exist so I can continue working.  Plus i need to clean some bits up, particularly with VPHADDD, that particular instruction didn't turn out quite as nice as I'd like.  I also need to investigate rounding modes for floating point operations.
Comment 7 p4plus2 2014-10-08 06:43:53 UTC
So interesting question, what is the tolerance for floating point error?  My test case returns -451.367767 for example but my valgrind implementation of the same instruction results in -451.367737.  Trying other rounding modes didn't seem to help anything but I'll take another look in the morning to see what else I can turn up.
Comment 8 Tom Hughes 2014-10-08 07:04:32 UTC
There shouldn't really be any error with an SSE type instruction that is always working at 64 bits - the only "errors" are normally with x87 instructions and caused by us emulating them in 64 bits instead of 80 bits.
Comment 9 p4plus2 2014-10-09 02:30:34 UTC
Turns out I had overlooked the existance of Iop_MAddF32/64.  Making this correction has solved this problem.  With that out of the way all FMA4 instructions are operational now.  I have a few XOP operations to go still.  Still working out a relatively decent way to handle the horizontal instructions.
Comment 10 p4plus2 2014-10-14 01:25:20 UTC
Created attachment 89121 [details]
First attempt at an XOP/FMA4 patch

This patch implements all of the FMA4 instructions and a bulk of the XOP instructions.  In just a moment I'll also upload my test case for the implemented instructions.
While I'd like to keep working on it I won't have time for a couple weeks,  so, since it works for everything implemented I figured I'd at least try throw it out there.  My understanding of valgrinds internals is fairly new, so I would appreciate any criticisms to do things closer to the "valgrind way" so to speak.  I tried to adapt from observations, but the code base is sizable -- I'm sure I've missed a thing or ten.  If there are any reasons why the patch is inadequate I am all ears,  I will correct it when I have time if it is within my ability.
Comment 11 p4plus2 2014-10-14 01:26:36 UTC
Created attachment 89122 [details]
Test cases for XOP/FMA4 instructions
Comment 12 dragor 2015-10-29 06:01:39 UTC
Voting for this bug as I am hitting it as well. This is blocking me from using valgrind on any Qt5 program on this machine.
Comment 13 dragor 2015-12-08 04:34:48 UTC
This bug is getting critical for me.  I took the patch and applied it to the valgrind trunk.  It applied cleanly.  Once I ran the new valgrind, there were many more instructions that were processed.  I got to a point where it ran into a new instruction that this patch does not cover.  The error for the new instruction is:
vex amd64->IR: unhandled instruction bytes: 0x8F 0x68 0x20 0x95 0xE4 0xC0 0x8F 0xC8
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0xB ESC=???
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==21476== valgrind: Unrecognised instruction at address 0x65cff64.
Comment 14 Beau V.C. Bellamy 2016-02-10 09:19:49 UTC
Any progress on this?
Comment 15 p4plus2 2016-08-16 20:12:37 UTC
Created attachment 100633 [details]
XOP/FMA with CPUID support

I had a bit of time this week, (a brisk two years later...) and took the time to rebase to the latest svn copy of valgrind.  In addition to updating, I added the missing CPUID support.  Users running FX-8150-like processors will now be detected as such.  Is there anything else needed before this patch can be merged?
Comment 16 Stefano Bonicatti 2016-08-19 20:29:57 UTC
I have tested with the test case attached in this bug report and i can confirm the issue. 
I can also confirm that running the same test case with the patch works, valgrind doesn't crash/complains about illegal instructions.

Tested with an AMD FX 6300.
Comment 17 Mark Wielaard 2016-09-01 19:40:45 UTC
About the testcase. The __builtin_ia32_xxx are really just internal compiler details. They might or might not be there between versions of the same compiler or between different compilers. They might not even implement the same thing or take different arguments. For example the testcase uses __builtin_ia32_vpcomew with an extra argument that says which kind of operation/memonic is really meant, while gcc provides 8 separate __builtin_ia32_vpcom[xxx] to provide the same functionality. The testcase uses separate __builtin_ia32_vfnmaddxx and __builtin_ia32_vfnmsubxx functions, while gcc just provides one which takes a positive or negative argument. There is also no guarantee that these functions translate to some exact instruction (sequence).

So it would be better to rewrite the testcase to use explicit inline assembly using the exact instructions that we want to test.
Comment 18 Mark Wielaard 2016-09-06 15:23:52 UTC
Created attachment 100955 [details]
Testcases for fma4 instructions

Here are some testcases for the FMA4 instructions. I haven't looked yet at the XOP instructions.
Maybe it is an idea to do FMA4 and XOP as separate patches?

The testcase is based on the idea from the avx-1.c testcase. It creates a a block of values loads some of the values into xmm/ymm registers and creates a memory reference to another. Then it prints the contents of the block before and after the various instructions. The expected output comes from running the program on an actual processor having fma4 instructions. This is then compared with the same program running under valgrind.

With the current patch there are some differences.

First the 256bit ymm operations aren't supported, so they have been disabled in the testcase for now. But I am not sure we really should enable the fma4 cpuid bit in valgrind before we really support them.

Secondly some 128bit xmm operations should clear the upper 128 bits
of the corresponding YMM register to zeros and don't do so at the moment.

Lastly the "full 0xFF" testcases do show some differences (but the zeros, ones and random cases all look fine).

I'll also attach the stdout.diff next.

If you think doing full bit-indentical tests are not fair/achievable for these instructions then let me know and we can see if we can create some testcases that explicitly work on float/double results.
Comment 19 Mark Wielaard 2016-09-06 15:25:20 UTC
Created attachment 100956 [details]
Current stdout.diff file for new fma4.vgtest and proposed patch
Comment 20 Julian Seward 2016-09-07 07:58:56 UTC
(In reply to Mark Wielaard from comment #18)
> Here are some testcases for the FMA4 instructions.

Excellent.

> I haven't looked yet at the XOP instructions.
> Maybe it is an idea to do FMA4 and XOP as separate patches?

Hmm, no particular opinion on that from here.  If you find it more convenient,
then yes.

> First the 256bit ymm operations aren't supported, so they have been disabled
> in the testcase for now. But I am not sure we really should enable the fma4
> cpuid bit in valgrind before we really support them.

I agree.  Enabling cpuid bits before the all the associated instructions have
been implemented has caused us trouble before.

> Secondly some 128bit xmm operations should clear the upper 128 bits
> of the corresponding YMM register to zeros and don't do so at the moment.

In guest_amd64_toIR.c there's a function putYMMRegLoAndZU (ZU = "Zero Upper")
which does exactly that.  Maybe that would be helpful here?

> Lastly the "full 0xFF" testcases do show some differences (but the zeros,
> ones and random cases all look fine).

It may be that such values are infinities, NaNs etc and we don't handle those
quite right.  It would be much preferable if we did.  I think we should
indeed try to achieve bit-identical results to the hardware, and review the
cases where that seems to be problematic.

At least for initial verification, you could copy the scheme used in
randV128() in none/tests/arm64/fp_and_simd.c.  This guarantees that
all F32 and F64 values embedded within vectors are normalised numbers,
and so you can at least differentiate easily, failures caused by incorrect
handling of denorms (NaNs, Infs etc) vs failures for other reasons.

In the long run though I would prefer bit-identical results.
Comment 21 Julian Seward 2016-09-16 13:34:03 UTC
*** Bug 356138 has been marked as a duplicate of this bug. ***
Comment 22 Mark Wielaard 2016-09-18 17:09:41 UTC
I split this bug into a xop (this) one and fma4 (bug #369000) one.
I had some time to look at the fma4 instructions, but not yet at the xop one.
So I split the patch in two.
Comment 23 Mark Wielaard 2021-02-20 21:09:32 UTC
*** Bug 424034 has been marked as a duplicate of this bug. ***
Comment 24 Mark Wielaard 2021-02-20 21:19:58 UTC
Note there is also a bug (with patch) for XOP vpcmov here:
https://bugs.kde.org/show_bug.cgi?id=323431

And a TBM bextr (which also uses the XOP prefix) bug (with patch) here:
https://bugs.kde.org/show_bug.cgi?id=381819
Comment 25 p4plus2 2021-02-25 14:29:56 UTC
Truth be told I had forgotten about this patch and bug, sorry for that, I should have checked back long ago. That said, my understanding is FMA4 support is now committed and XOP is remaining?  I can work on fixing up some of the remaining bits and making sure it patches against the current valgrind assuming nobody else had plans to do so.  It also seems 256 bit support was not yet added to FMA4, I'll take some time and fix that up as well (I'll toss progress on that into #369053 once ready).
Comment 26 Mark Wielaard 2021-02-28 18:45:24 UTC
(In reply to p4plus2 from comment #25)
> Truth be told I had forgotten about this patch and bug, sorry for that, I
> should have checked back long ago.

We haven't been very good at following up ourselves and had several somewhat related bug for this. The main issue is that the XOP encodings used for these issues seem AMD specific and are not supported on all AMD processors to begin with. So it is somewhat hard to test these things unless you happen to have the right hardware.

> That said, my understanding is FMA4
> support is now committed and XOP is remaining?

FMA4 (128 bit only as you note below) has been added based on your patch from this bug. Which was split into a FMA4 128bit patch/bug #369000 and a 256bit FMA4 bug #369053.

Generic XOP instruction parsing and support for certain subsets like BMI (bextr) and vpcmov is spread over some other bugs (this one, bug #323431 and bug #381819)

>  I can work on fixing up some
> of the remaining bits and making sure it patches against the current
> valgrind assuming nobody else had plans to do so.  It also seems 256 bit
> support was not yet added to FMA4, I'll take some time and fix that up as
> well (I'll toss progress on that into #369053 once ready).

Thanks. That would be awesome.
Comment 27 Tom Hughes 2021-12-08 07:07:57 UTC
*** Bug 446669 has been marked as a duplicate of this bug. ***
Comment 28 Leandro Nini 2022-08-22 15:06:28 UTC
Been hit by another unhandled instruction using valgrind-3.19.0 with the bextr patch from bug #381819 applied:

vex amd64->IR: unhandled instruction bytes: 0x8F 0xC8 0x70 0xA3 0xCD 0x30 0xC4 0x41 0x79 0xD4
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=1
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x1 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0