I know it's scheduled for 3.7.0, but since it was low priority, I'd like to "vote" for this feature. I use gentoo, and my system is built with '-march=native', that means I can't use valgrind because AVX instructions are used everywhere. I think I'll have to fallback to '-march=nocona' untill valgrind support this new instruction set. Thanks,
Created attachment 60079 [details] analyze-x86.py
FYI: running analyse-x86.py on Xorg gives: cpuid: 0 nop: 15938 call: 17409 i86 280079 i386 619 i686 1023 MMX 2066 AVX 6258 So anyone who build with -march=native on a sandy bridge cpu will get a *lot* of AVX instructions. That will be far more common than SSE4.
*** Bug 273230 has been marked as a duplicate of this bug. ***
*** Bug 268314 has been marked as a duplicate of this bug. ***
Let's use this as the canonical add-AVX-support bug. There are various duplicates. Last weekend I made a branch for AVX support, and did a bunch of preliminary refactoring necessary to make it possible to add AVX instructions in a sane way. Right now there are none actually supported, however I hope that support will begin to appear in the next three or four weeks. It won't make 3.7.0 unfortunately. The branch can be checked out with svn co svn://svn.valgrind.org/valgrind/branches/AVX
*** Bug 280835 has been marked as a duplicate of this bug. ***
*** Bug 284864 has been marked as a duplicate of this bug. ***
*** Bug 285725 has been marked as a duplicate of this bug. ***
*** Bug 286596 has been marked as a duplicate of this bug. ***
*** Bug 287307 has been marked as a duplicate of this bug. ***
*** This bug has been confirmed by popular vote. ***
*** Bug 289656 has been marked as a duplicate of this bug. ***
Attaching a link for a downstream bug on this issue: https://bugs.gentoo.org/show_bug.cgi?id=398447
*** Bug 292300 has been marked as a duplicate of this bug. ***
*** Bug 286497 has been marked as a duplicate of this bug. ***
*** Bug 288995 has been marked as a duplicate of this bug. ***
*** Bug 292493 has been marked as a duplicate of this bug. ***
*** Bug 292841 has been marked as a duplicate of this bug. ***
This really starts to be critical as most of the recent machines have this feature enabled. Can we get an ETA for this and/or an estimation of what has been and needs to be done ?
It's hardly critical - even if your machine supports these instructions nobody is forcing you to compile with -march=native. Just compile code you want to valgrind for a more basic CPU and you'll be fine.
(In reply to comment #20) > It's hardly critical - even if your machine supports these instructions > nobody is forcing you to compile with -march=native. > > Just compile code you want to valgrind for a more basic CPU and you'll be > fine. And all the related libraries, (that means Qt and KDElibs for any KDE application)... That also prevent any developper using Gentoo to use --march=native system wide, which is really bad. For me, it's critical.
Doing complete AVX support is going to be a pretty big task. I suspect a minimal implementation that just provides coverage for the instructions emitted by gcc-4.7.0 -march=sandybridge (or whatever the relevant magic flag is) would keep folks happy in the short term whilst avoiding having to implement the majority of the instructions.
Just want to add my voice to those who think this is closer to "critical" than "minor nuisance". One great feature of Valgrind is that we do NOT have to recompile our applications; we can run real production binaries under memcheck. Note that not all the world uses GCC (we use the Intel C compiler). More to the point, performance-critical code is likely to be hand-optimized using the AVX intrinsics: http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/index.htm#intref_cls/common/intref_bk_advectorext.htm ...by which the entire AVX instruction set comes into play. I know this is a big task. But I bet there is also a big community who would contribute to the effort if they (a) knew exactly what was needed, (b) knew their work was likely to be accepted, and (c) knew that they were not duplicating work already in progress.
(In reply to comment #23) > Just want to add my voice to those who think this is closer to "critical" > than "minor nuisance". > > One great feature of Valgrind is that we do NOT have to recompile our > applications; we can run real production binaries under memcheck. > > Note that not all the world uses GCC (we use the Intel C compiler). More to > the point, performance-critical code is likely to be hand-optimized using > the AVX intrinsics: > > http://software.intel.com/sites/products/documentation/studio/composer/en-us/ > 2011/compiler_c/index.htm#intref_cls/common/intref_bk_advectorext.htm > > ...by which the entire AVX instruction set comes into play. > > I know this is a big task. But I bet there is also a big community who > would contribute to the effort if they (a) knew exactly what was needed, (b) > knew their work was likely to be accepted, and (c) knew that they were not > duplicating work already in progress. That is _exactly_ my point. That's why I asked for what has been done for know, and what needs to be done. Implementing the logic for everything maybe isn't that critical, but at least not aborting when finding instructions generated by gcc -march=corei7-avx and the intel compiler is. Recompiling the whole system, in the case you're not using a binary distribution is _not_ an acceptable workaround.
I need to finish off and merge branches/TCHAIN to trunk. I expect to do that in the next two days, after which I will attend to the AVX stuff. It's clear it's time to make a start on it. What needs to happen, that I could use help with, to get started, is: * a list of insns generated by gcc -march=corei7-avx that need to be implemented. It doesn't have to be exact, but we have to start somewhere. * a test program for these instructions, in the style (comprehensiveness, mostly) of none/tests/amd64/sse4-64.c. It isn't exciting stuff, but getting new instructions working reliably without such coverage is essentially impossible. I need to mess with the IR infrastructure and the x86_64 back end (instruction selector, insn emitters) to handle 256-bit vectors. Also the front end (guest_amd64_toIR.c) will need work to get it to parse the 2 kinds of VEX prefixes.
*** Bug 299104 has been marked as a duplicate of this bug. ***
This is in progress now. A few insns have been implemented, and infrastructure (256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction selection) is in place. Early next week I'll put up a patch which will cover at least some of the instructions emitted by gcc-4.7.0 -mavx.
(In reply to comment #27) > This is in progress now. A few insns have been implemented, and > infrastructure > (256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction > selection) > is in place. Early next week I'll put up a patch which will cover at least > some of > the instructions emitted by gcc-4.7.0 -mavx. This is good news ! If you need some infos or need people to test stuff, feel free to ask
(In reply to comment #27) > This is in progress now. A few insns have been implemented, and > infrastructure > (256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction > selection) > is in place. Early next week I'll put up a patch which will cover at least > some of > the instructions emitted by gcc-4.7.0 -mavx. It's great! Is there any public patches/source tree branches to try it out? I have the same problem with AVX instructions and I can help with testing the patches
*** Bug 299803 has been marked as a duplicate of this bug. ***
*** Bug 299805 has been marked as a duplicate of this bug. ***
*** Bug 299804 has been marked as a duplicate of this bug. ***
Created attachment 71072 [details] AVX support -- 13 May 2011 -- WIP -- Valgrind changes
Created attachment 71073 [details] AVX support -- 13 May 2011 -- WIP -- Vex changes Very, very incomplete. Expect breakage. Implemented insns are: VMOVDQA ymm2/m256, ymm1 VMOVDQU ymm2/m256, ymm1 VMOVDQA xmm2/m128, xmm1 VMOVDQU xmm2/m128, xmm1 VPSLLD imm8, xmm2, xmm1 VPCMPEQD r/m, rV, r VZEROUPPER VMOVDQA ymm1, ymm2/m256 VMOVDQA xmm1, xmm2/m128 VMOVDQU xmm1, xmm2/m128 VPOR r/m, rV, r VPXOR r/m, rV, r VPSUBB r/m, rV, r VPADDD r/m, rV, r VPSHUFB r/m, rV, r VINSERTF128 r/m, rV, rD VEXTRACTF128 rS, r/m VPBLENDVB xmmG, xmmE/memE, xmmV, xmmIS4
A comment on how those instructions got chosen for implementation: I'm working through the failing instructions resulting from running the executable created by gcc-4.7.0 -O3 -mavx -o bz2-64 perf/bz2-64.c -I. This appears to have quite a lot of VEX-encoded XMM instructions resulting from auto-vectorisation. It's all integer stuff. I haven't started on the FP stuff yet. I am also not even at the point where the above executable will run yet.
Created attachment 71170 [details] AVX support -- 18 May 2011 -- WIP -- Valgrind changes
Created attachment 71171 [details] AVX support -- 18 May 2011 -- WIP -- Vex changes Majorly expands the set of supported instructions. Is now able to run large amounts of integer and FP code generated by gcc-4.7.0 -mavx -O2. This is the first version of the patch that is able to run useful amounts of AVX code. Unfortunately only works with --tool=none so far (massif too, maybe). In particular Memcheck doesn't work, pending rework of the back end code generator to deal with Memcheck's instrumentation code for 256 bit vectors. I hope to get it fixed this weekend.
Initial support has been committed now, as revs 2330/12569. These do not provide anything close to complete AXV coverage, but they do provide support for code created by "gcc-4.7.0 -mavx -O2", which I think should cover the majority of duplicates of this bug report. The only tool currently supported is Memcheck, although it appears that Massif, Helgrind and DRD also work. Give it a try. There may be breakage, but I will be working to tidy everything up, make the other tools work again, etc, this week.
On Gentoo, gcc (Gentoo Hardened 4.5.3-r2 p1.1, pie-0.4.7) 4.5.3, system built with CFLAGS/CXXFLAGS="-O2 -pipe -march=native -mavx -maes -ggdb" (built on i5-2520M CPU) and using recent SVN of valgrind and VEX with revision numbers as you specified, a sample program using Qt fails immediately inside one of QString's helper functions. Not sure if it's due to -maes or -mavx. jkt@svist ~/work/prog/_trojita-build/desktop-debug $ valgrind ./src/Gui/trojita ==51259== Memcheck, a memory error detector ==51259== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. ==51259== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info ==51259== Command: ./src/Gui/trojita ==51259== vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5 0xF9 0x68 0xC1 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==51259== valgrind: Unrecognised instruction at address 0x824eb9b. ==51259== at 0x824EB9B: QString::fromLatin1_helper(char const*, int) (emmintrin.h:703) ==51259== by 0x158D2B: QString::QString(QLatin1String const&) (qstring.h:694) ==51259== by 0x26587B: __static_initialization_and_destruction_0(int, int) (SettingsNames.cpp:26) ==51259== by 0x266690: global constructors keyed to SettingsNames.cpp (SettingsNames.cpp:73) ==51259== by 0x266895: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==51259== by 0x148DB2: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==51259== by 0x573F62F: ??? (in /usr/lib64/qt4/libQtWebKit.so.4.9.0) ==51259== by 0x266814: __libc_csu_init (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==51259== by 0x97DA1FF: (below main) (libc-start.c:193)
I suspect that is some variant of PUNPCKLBW. It's not easy to deduce which one from the failure output, though. Can you use objdump -d to find the insn? You may find it useful to rerun V with --demangle=no --sym-offsets=yes, so as to get the function name and offset inside the function that the insn lives at.
This is what valgrind gives me now (on a new build of that binary; this is a project I keep working on, but on a completely unrelated place; I've nonetheless copied *this* one and all further reports will use *this* version of this binary): vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5 0xF9 0x68 0xC1 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==52919== valgrind: Unrecognised instruction at address 0x824eb9b. ==52919== at 0x824EB9B: _ZN7QString17fromLatin1_helperEPKci+235 (emmintrin.h:703) ==52919== by 0x158D2B: _ZN7QStringC1ERK13QLatin1String+55 (qstring.h:694) ==52919== by 0x26587B: _Z41__static_initialization_and_destruction_0ii+107 (SettingsNames.cpp:26) ==52919== by 0x266690: _GLOBAL__I_SettingsNames.cpp+37 (SettingsNames.cpp:73) ==52919== by 0x266895: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==52919== by 0x148DB2: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==52919== by 0x573F62F: ??? (in /usr/lib64/qt4/libQtWebKit.so.4.9.0) ==52919== by 0x266814: __libc_csu_init+68 (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita) ==52919== by 0x97DA1FF: (below main)+143 (libc-start.c:193) When I look at the output of the objdump, I see the following: 00000000000455b8 <_ZN7QString17fromLatin1_helperEPKci@plt>: 455b8: ff 25 aa 9c 39 00 jmpq *0x399caa(%rip) # 3df268 <_GLOBAL_OFFSET_TABLE_+0x2410> 455be: 68 7f 04 00 00 pushq $0x47f 455c3: e9 f0 b7 ff ff jmpq 40db8 <_init+0x18> That looks like just some jump table entry, but I won't pretend that I'm familiar with assembly. I'm sorry for not doing my homework properly, but I don't know how to get to the offending instructions from here (and is it even in the application itself? Or shall I instead be looking at libQtCore.so?). Anyway, the objdump output is at http://dev.gentoo.org/~jkt/tmp/objdump-trojita.bz2 , the binary in question at http://dev.gentoo.org/~jkt/tmp/trojita.for.valgrind.bz2 and the QtCore library at http://dev.gentoo.org/~jkt/tmp/libQtCore.so.4.8.1.bz2 . As a *very* blind shot, this is a piece of objdump of the libQtCore.so.4.8.1 here: 00000000000daab0 <_ZN7QString17fromLatin1_helperEPKci>: daab0: 55 push %rbp daab1: 53 push %rbx daab2: 48 89 fb mov %rdi,%rbx daab5: 48 83 ec 28 sub $0x28,%rsp daab9: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax daac0: 00 00 daac2: 48 89 44 24 18 mov %rax,0x18(%rsp) daac7: 31 c0 xor %eax,%eax daac9: 48 85 ff test %rdi,%rdi daacc: 0f 84 ee 00 00 00 je dabc0 <_ZN7QString17fromLatin1_helperEPKci+0x110> daad2: 85 f6 test %esi,%esi daad4: 0f 84 7e 00 00 00 je dab58 <_ZN7QString17fromLatin1_helperEPKci+0xa8> daada: 89 f0 mov %esi,%eax daadc: c1 e8 1f shr $0x1f,%eax daadf: 84 c0 test %al,%al daae1: 75 6d jne dab50 <_ZN7QString17fromLatin1_helperEPKci+0xa0> daae3: 48 63 ee movslq %esi,%rbp daae6: 89 34 24 mov %esi,(%rsp) daae9: 48 8d 7c 2d 20 lea 0x20(%rbp,%rbp,1),%rdi daaee: e8 1d 51 fa ff callq 7fc10 <_Z7qMallocm> daaf3: 48 85 c0 test %rax,%rax daaf6: 8b 34 24 mov (%rsp),%esi daaf9: 0f 84 e1 00 00 00 je dabe0 <_ZN7QString17fromLatin1_helperEPKci+0x130> daaff: 48 8d 50 1a lea 0x1a(%rax),%rdx dab03: 80 60 18 e0 andb $0xe0,0x18(%rax) dab07: 83 fe 0f cmp $0xf,%esi dab0a: c7 00 01 00 00 00 movl $0x1,(%rax) dab10: 89 70 08 mov %esi,0x8(%rax) dab13: 89 70 04 mov %esi,0x4(%rax) dab16: 48 89 50 10 mov %rdx,0x10(%rax) dab1a: 66 c7 44 68 1a 00 00 movw $0x0,0x1a(%rax,%rbp,2) dab21: 7f 5d jg dab80 <_ZN7QString17fromLatin1_helperEPKci+0xd0> dab23: 85 f6 test %esi,%esi dab25: 74 3e je dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5> dab27: 48 8d 4b 01 lea 0x1(%rbx),%rcx dab2b: 83 ee 01 sub $0x1,%esi dab2e: 48 8d 34 31 lea (%rcx,%rsi,1),%rsi dab32: eb 08 jmp dab3c <_ZN7QString17fromLatin1_helperEPKci+0x8c> dab34: 0f 1f 40 00 nopl 0x0(%rax) dab38: 48 83 c1 01 add $0x1,%rcx dab3c: 0f b6 1b movzbl (%rbx),%ebx dab3f: 66 89 1a mov %bx,(%rdx) dab42: 48 83 c2 02 add $0x2,%rdx dab46: 48 39 f1 cmp %rsi,%rcx dab49: 48 89 cb mov %rcx,%rbx dab4c: 75 ea jne dab38 <_ZN7QString17fromLatin1_helperEPKci+0x88> dab4e: eb 15 jmp dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5> dab50: 80 3f 00 cmpb $0x0,(%rdi) dab53: 75 7a jne dabcf <_ZN7QString17fromLatin1_helperEPKci+0x11f> dab55: 0f 1f 00 nopl (%rax) dab58: 48 8b 05 61 32 44 00 mov 0x443261(%rip),%rax # 51ddc0 <_ZTI16QEventTransition+0xdc0> dab5f: f0 ff 00 lock incl (%rax) dab62: 0f 95 c2 setne %dl dab65: 48 8b 54 24 18 mov 0x18(%rsp),%rdx dab6a: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx dab71: 00 00 dab73: 0f 85 7e 00 00 00 jne dabf7 <_ZN7QString17fromLatin1_helperEPKci+0x147> dab79: 48 83 c4 28 add $0x28,%rsp dab7d: 5b pop %rbx dab7e: 5d pop %rbp dab7f: c3 retq dab80: c5 f1 ef c9 vpxor %xmm1,%xmm1,%xmm1 dab84: 89 f7 mov %esi,%edi dab86: c1 ff 04 sar $0x4,%edi dab89: 31 c9 xor %ecx,%ecx dab8b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) dab90: c5 fa 6f 03 vmovdqu (%rbx),%xmm0 dab94: 83 c1 01 add $0x1,%ecx dab97: 48 83 c3 10 add $0x10,%rbx dab9b: c5 f9 60 d1 vpunpcklbw %xmm1,%xmm0,%xmm2 dab9f: c5 f9 68 c1 vpunpckhbw %xmm1,%xmm0,%xmm0 daba3: c5 fa 7f 12 vmovdqu %xmm2,(%rdx) daba7: c5 fa 7f 42 10 vmovdqu %xmm0,0x10(%rdx) dabac: 48 83 c2 20 add $0x20,%rdx dabb0: 39 cf cmp %ecx,%edi dabb2: 7f dc jg dab90 <_ZN7QString17fromLatin1_helperEPKci+0xe0> dabb4: 83 e6 0f and $0xf,%esi dabb7: e9 67 ff ff ff jmpq dab23 <_ZN7QString17fromLatin1_helperEPKci+0x73> dabbc: 0f 1f 40 00 nopl 0x0(%rax) dabc0: 48 8b 05 69 2f 44 00 mov 0x442f69(%rip),%rax # 51db30 <_ZTI16QEventTransition+0xb30> dabc7: f0 ff 00 lock incl (%rax) dabca: 0f 95 c2 setne %dl dabcd: eb 96 jmp dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5> dabcf: e8 14 6b f8 ff callq 616e8 <strlen@plt> dabd4: 89 c6 mov %eax,%esi dabd6: e9 08 ff ff ff jmpq daae3 <_ZN7QString17fromLatin1_helperEPKci+0x33> dabdb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) dabe0: 48 89 44 24 08 mov %rax,0x8(%rsp) dabe5: e8 e6 07 fa ff callq 7b3d0 <_Z9qBadAllocv> dabea: 8b 34 24 mov (%rsp),%esi dabed: 48 8b 44 24 08 mov 0x8(%rsp),%rax dabf2: e9 08 ff ff ff jmpq daaff <_ZN7QString17fromLatin1_helperEPKci+0x4f> dabf7: e8 4c 70 f8 ff callq 61c48 <__stack_chk_fail@plt> dabfc: 0f 1f 40 00 nopl 0x0(%rax) ...and there indeed is a couple of vpunpcklbw instructions.
Another one: ==14865== Memcheck, a memory error detector ==14865== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. ==14865== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info ==14865== Command: ./hayabusa go infinite ==14865== vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x1E 0xC9 0xC4 0xE2 0x79 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38 vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==14865== valgrind: Unrecognised instruction at address 0x47dc30. ==14865== at 0x47DC30: Eval::initTables() (eval.cpp:98) ==14865== by 0x427140: Console::init(int&, char**) (console.cpp:72) ==14865== by 0x4244C9: main (main.cpp:28) The relevant part of the code: for (int x0=0; x0<8; ++x0) for (int y0=0; y0<8; ++y0) for (int x1=0; x1<8; ++x1) for (int y1=0; y1<8; ++y1) distance[x0+8*y0][x1+8*y1] = std::max(abs(x0-x1), abs(y0-y1)); 47dc26: c4 c1 79 fa cb vpsubd %xmm11,%xmm0,%xmm1 47dc2b: c4 c1 79 fa c2 vpsubd %xmm10,%xmm0,%xmm0 47dc30: c4 e2 79 1e c9 vpabsd %xmm1,%xmm1 <----- *** crash *** 47dc35: c4 e2 79 1e c0 vpabsd %xmm0,%xmm0
(In reply to comment #39) > vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5 I implemented PUNPCKLBW and HBW in r2337; svn up and try again.
Thanks for the update. Now it dies at the following place: vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x2C 0xC1 0x89 0x43 0x3C 0xC5 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==86082== valgrind: Unrecognised instruction at address 0xa3af004. ==86082== at 0xA3AF004: _uhash_allocate (uhash.c:241) ==86082== by 0xA3AF4EB: _uhash_create (uhash.c:265) ==86082== by 0xA3DC7AA: entryOpen (uresbund.c:274) ==86082== by 0xA3DDA38: ures_open_48 (uresbund.c:2051) ==86082== by 0xDFEF926: ucol_open_internal_48 (ucol_res.cpp:186) ==86082== by 0x82639F5: qt_initIcu(QString const&) (qlocale_icu.cpp:147) ==86082== by 0x822103E: QLocalePrivate::updateSystemPrivate() (qlocale.cpp:529) ==86082== by 0x82212AA: systemPrivate() (qlocale.cpp:540) ==86082== by 0x822130C: defaultPrivate() (qlocale.cpp:551) ==86082== by 0x82214FD: QLocale::QLocale() (qlocale.cpp:673) ==86082== by 0x82BF3DC: QResourceFileEngine::QResourceFileEngine(QString const&) (qresource.cpp:1200) ==86082== by 0x82ED16E: _q_resolveEntryAndCreateLegacyEngine_recursive(QFileSystemEntry&, QFileSystemMetaData&, QAbstractFileEngine*&, bool) (qfilesystemengine.cpp:162) ==86082== by 0x82ED2EC: QFileSystemEngine::resolveEntryAndCreateLegacyEngine(QFileSystemEntry&, QFileSystemMetaData&) (qfilesystemengine.cpp:208) ==86082== by 0x829D75C: QFileInfo::QFileInfo(QString const&) (qfileinfo_p.h:103) ==86082== by 0x8298498: QFile::exists(QString const&) (qfile.cpp:615) ==86082== by 0x81F193D: QLibraryInfoPrivate::findConfiguration() (qlibraryinfo.cpp:116) ==86082== by 0x81F1B3C: QLibrarySettings::QLibrarySettings() (qlibraryinfo.cpp:102) ==86082== by 0x81F1BF5: qt_library_settings() (qlibraryinfo.cpp:82) ==86082== by 0x81F1EFF: QLibraryInfo::location(QLibraryInfo::LibraryLocation) (qlibraryinfo.cpp:96) ==86082== by 0x8322682: QCoreApplication::libraryPaths() (qcoreapplication.cpp:2389) ==86082== by 0x8322E67: QCoreApplication::init() (qcoreapplication.cpp:707) ==86082== by 0x83230A4: QCoreApplication::QCoreApplication(QCoreApplicationPrivate&) (qcoreapplication.cpp:596) ==86082== by 0x6FF5E06: QApplication::QApplication(int&, char**, int) (qapplication.cpp:740) ==86082== by 0x155C65: main (main.cpp:28) I wasn't able to find the _uhash_allocate symbol in the output of objdump -d of any of the files which I can see in `pmap $pidOfMyApp`, so I cannot provide assembly output at this point.
I think, if I'm reading things right, that 0xC5 0xFA 0x2C is vcvttss2si.
(In reply to comment #45) > I think, if I'm reading things right, that 0xC5 0xFA 0x2C is vcvttss2si. Done, r2338. svn up and try again ..
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE3 0x79 0x60 0xCA 0x45 0xC4 0xE3 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F3A vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==98468== valgrind: Unrecognised instruction at address 0x824928e. ==98468== at 0x824928E: _ZL15toLatin1_helperPK5QChari+238 (smmintrin.h:182) gdb shows it as the following (and I'm including the rest of the function code as well): 0x00007ffff464f288 <+232>: vmovdqu (%r12),%xmm2 => 0x00007ffff464f28e <+238>: vpcmpestrm $0x45,%xmm2,%xmm1 0x00007ffff464f294 <+244>: vpblendvb %xmm0,%xmm3,%xmm2,%xmm4 0x00007ffff464f29a <+250>: vmovdqu 0x10(%r12),%xmm2 0x00007ffff464f2a1 <+257>: vpcmpestrm $0x45,%xmm2,%xmm1 0x00007ffff464f2a7 <+263>: add $0x1,%esi 0x00007ffff464f2aa <+266>: add $0x20,%r12 0x00007ffff464f2ae <+270>: vpblendvb %xmm0,%xmm3,%xmm2,%xmm2 0x00007ffff464f2b4 <+276>: vpackuswb %xmm2,%xmm4,%xmm4 0x00007ffff464f2b8 <+280>: vmovdqu %xmm4,(%rcx) 0x00007ffff464f2bc <+284>: add $0x10,%rcx 0x00007ffff464f2c0 <+288>: cmp %esi,%edi 0x00007ffff464f2c2 <+290>: jg 0x7ffff464f288 <toLatin1_helper(QChar const*, int)+232> 0x00007ffff464f2c4 <+292>: and $0xf,%ebp 0x00007ffff464f2c7 <+295>: jne 0x7ffff464f224 <toLatin1_helper(QChar const*, int)+132> 0x00007ffff464f2cd <+301>: jmpq 0x7ffff464f1d4 <toLatin1_helper(QChar const*, int)+52> 0x00007ffff464f2d2 <+306>: nopw 0x0(%rax,%rax,1) 0x00007ffff464f2d8 <+312>: mov 0x10(%rax),%rcx 0x00007ffff464f2dc <+316>: lea 0x18(%rax),%rdx 0x00007ffff464f2e0 <+320>: cmp %rdx,%rcx 0x00007ffff464f2e3 <+323>: jne 0x7ffff464f20d <toLatin1_helper(QChar const*, int)+109> 0x00007ffff464f2e9 <+329>: jmpq 0x7ffff464f21f <toLatin1_helper(QChar const*, int)+127> 0x00007ffff464f2ee <+334>: callq 0x7ffff45dbc48 <__stack_chk_fail@plt> 0x00007ffff464f2f3 <+339>: mov %rax,%rbp 0x00007ffff464f2f6 <+342>: mov %rbx,%rdi 0x00007ffff464f2f9 <+345>: callq 0x7ffff45df610 <QByteArray::~QByteArray()> 0x00007ffff464f2fe <+350>: mov %rbp,%rdi 0x00007ffff464f301 <+353>: callq 0x7ffff45dc188 <_Unwind_Resume@plt>
This is another one. I'm not sure, if it helps to post every unrecognised insn i encounter. If it does, i will continue testing. If the analyse-x86.py script would be nearly up to date one could gather all unknown insns at once? Sascha vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x7E 0xC2 0x48 0x89 0xD1 vex amd64->IR: REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==3945== valgrind: Unrecognised instruction at address 0x368ec3d080. ==3945== at 0x368EC3D080: __floor_c (in /lib64/libm-2.15.so) => 368ec3d080: c4 e1 f9 7e c2 vmovq %xmm0,%rdx 368ec3d085: 48 89 d1 mov %rdx,%rcx 368ec3d088: 48 c1 f9 34 sar $0x34,%rcx 368ec3d08c: 81 e1 ff 07 00 00 and $0x7ff,%ecx 368ec3d092: 81 e9 ff 03 00 00 sub $0x3ff,%ecx 368ec3d098: 83 f9 33 cmp $0x33,%ecx
(In reply to comment #48) > This is another one. I'm not sure, if it helps to post every unrecognised > insn i encounter. I think it is an effective way to target limited available development effort to getting this working asap, so keep going. I don't have time to do any more this afternoon, but will continue tonight. IME this will go on for a couple of days, but we will quickly reach the point where we have covered the entire subset of AVX insns that gcc can produce.
Now it dies a little later for me: vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0xE7 0x1 0xC5 0xF9 0xE7 0x41 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 which is 44480c: c5 f9 e7 01 vmovntdq %xmm0,(%rcx) 444810: c5 f9 e7 41 10 vmovntdq %xmm0,0x10(%rcx) 444815: 4c 8d 84 07 c0 00 00 lea 0xc0(%rdi,%rax,1),%r8 44481c: 00 44481d: c5 f9 e7 41 20 vmovntdq %xmm0,0x20(%rcx) 444822: c5 f9 e7 41 30 vmovntdq %xmm0,0x30(%rcx) 444827: 4c 8d 8c 07 00 01 00 lea 0x100(%rdi,%rax,1),%r9 This is from a fast memset library which presumably uses "_mm_stream_si128". I can work around this, so this is not high priority.
After that it dies at code generated from standard C floating point: vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78 0x11 0x10 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x9 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==23405== valgrind: Unrecognised instruction at address 0x4694c7. ==23405== at 0x4694C7: Parameters::add(std::string, float) (parameters.cpp:72) This is the "vmovlhps %xmm7,%xmm9,%xmm10" instruction below: if (value >= 0.0) { 4694ab: c5 f8 2e d4 vucomiss %xmm4,%xmm2 4694af: 72 24 jb 4694d5 <_ZN10Parameters3addESsf+0xd5> min = 0.0; max = 2.0*value; } 4694b1: c5 ea 58 f2 vaddss %xmm2,%xmm2,%xmm6 Parm(T value): value(value), var(value/8) { if (value >= 0.0) { min = 0.0; 4694b5: c5 f8 28 ec vmovaps %xmm4,%xmm5 4694b9: c5 7a 10 44 24 0c vmovss 0xc(%rsp),%xmm8 4694bf: c5 c8 14 f9 vunpcklps %xmm1,%xmm6,%xmm7 4694c3: c5 38 14 cd vunpcklps %xmm5,%xmm8,%xmm9 *** die here: 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 *** 4694cb: c5 78 11 10 vmovups %xmm10,(%rax) 4694cf: 48 83 c4 30 add $0x30,%rsp 4694d3: 5b pop %rbx 4694d4: c3 retq
Just hit another one: vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0xAE 0x5C 0x24 0xFC 0x81 0x4C vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==28881== valgrind: Unrecognised instruction at address 0x402080. ==28881== at 0x402080: set_fast_math (crtfastmath.c:127) 0000000000402080 <set_fast_math>: 402080: c5 f8 ae 5c 24 fc vstmxcsr -0x4(%rsp) 402086: 81 4c 24 fc 40 80 00 orl $0x8040,-0x4(%rsp) 40208d: 00 40208e: c5 f8 ae 54 24 fc vldmxcsr -0x4(%rsp) 402094: c3 retq
(In reply to comment #48) > vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x7E 0xC2 0x48 > => 368ec3d080: c4 e1 f9 7e c2 vmovq %xmm0,%rdx Done, r2339. That was fun -- I couldn't find any description of it in the Intel manuals. Maybe I was looking in the wrong place.
The description is under MOVD (not Q), page 737 in the reference manual.
(In reply to comment #51) > vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78 > 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 Done, r2340.
(In reply to comment #55) > (In reply to comment #51) > > vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78 > > 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 > > Done, r2340. Updated to r2340, and for me it is still crashing at exatcly this instruction at the very same position.
(In reply to comment #56) > > > 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 > Updated to r2340, and for me it is still crashing at exatcly this instruction at the very same position. Urr, I implemented vmovhlps by mistake, not vmovlhps. Will try again in the morning.
(In reply to comment #42) > 47dc30: c4 e2 79 1e c9 vpabsd %xmm1,%xmm1 Done, r2341.
(In reply to comment #57) > > > > 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 Done (second try); r2342.
Now it dies one instruction later, at "vmovups". vex amd64->IR: unhandled instruction bytes: 0xC5 0x78 0x11 0x10 0x48 0x83 0xC4 0x30 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==2694== valgrind: Unrecognised instruction at address 0x4694cb. ==2694== at 0x4694CB: Parameters::add(std::string, float) (parameters.cpp:72) ==2694== by 0x46549A: Parameters::init() (parameters.cpp:100) ==2694== by 0x46BD44: Console::init(int&, char**) (console.cpp:74) ==2694== by 0x427429: main (main.cpp:28) 4694c7: c5 30 16 d7 vmovlhps %xmm7,%xmm9,%xmm10 4694cb: c5 78 11 10 vmovups %xmm10,(%rax) <----- *** crash here *** 4694cf: 48 83 c4 30 add $0x30,%rsp 4694d3: 5b pop %rbx 4694d4: c3 retq Possibly the way we have to go is somewhat longer than you expected :-) But keep up the good work :-)
Is there a way to generate a list of all instructions handled by valgrind ? If yes we could also generate a list of unsupported instructions from a pool of binaries.
Here is another one (from a small matrix-multiply code), at VEX revision r2342: vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0xE6 0xD9 0xC5 0xF1 0xFE 0xE2 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==88909== valgrind: Unrecognised instruction at address 0x4004d0. This is in 4004cc: c5 f9 6f cc vmovdqa %xmm4,%xmm1 4004d0: c5 fa e6 d9 vcvtdq2pd %xmm1,%xmm3 4004d4: c5 f1 fe e2 vpaddd %xmm2,%xmm1,%xmm4 4004d8: c5 f9 70 c9 ee vpshufd $0xee,%xmm1,%xmm1 4004dd: c5 e1 58 d8 vaddpd %xmm0,%xmm3,%xmm3 4004e1: c5 fa e6 c9 vcvtdq2pd %xmm1,%xmm1 4004e5: c5 f1 58 c8 vaddpd %xmm0,%xmm1,%xmm1 4004e9: c5 f9 29 19 vmovapd %xmm3,(%rcx) 4004ed: c5 f9 29 49 10 vmovapd %xmm1,0x10(%rcx)
This is the same instruction vmovq, but the other way around. vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x6E 0xC0 0xC3 0x81 0xF9 vex amd64->IR: REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==6258== valgrind: Unrecognised instruction at address 0x368ec3d0c6. ==6258== at 0x368EC3D0C6: __floor_c+70 (in /lib64/libm-2.15.so) 368ec3d0c6: c4 e1 f9 6e c0 vmovq %rax,%xmm0 368ec3d0cb: c3 retq 368ec3d0cc: 81 f9 00 04 00 00 cmp $0x400,%ecx 368ec3d0d2: 75 0c jne 368ec3d0e0 <__signbitl+0x51a0> 368ec3d0d4: c5 fb 58 c0 vaddsd %xmm0,%xmm0,%xmm0 368ec3d0d8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 368ec3d0df: 00 368ec3d0e0: f3 c3 repz retq 368ec3d0e2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 368ec3d0e8: c5 fb 58 05 78 8c 03 vaddsd 0x38c78(%rip),%xmm0,%xmm0 # 368ec75d68 <__signbitl+0x3de28>
(In reply to comment #62) > Here is another one (from a small matrix-multiply code), at VEX revision > r2342: > 4004e1: c5 fa e6 c9 vcvtdq2pd %xmm1,%xmm1 > 4004e5: c5 f1 58 c8 vaddpd %xmm0,%xmm1,%xmm1 These are vector floating point (xxxPD), so they are either from handwritten assembly or not created by gcc at -O2 or below (yes?) since gcc only does vectorization at -O3, IIUC. Anyway, I am concentrating on getting complete coverage for gcc-4.7.0 -O2, so I won't do these yet since getting the vector FP support working is a whole bunch of work and obviously the scalar FP support is still incomplete.
(In reply to comment #50) > 44480c: c5 f9 e7 01 vmovntdq %xmm0,(%rcx) (In reply to comment #60) > 4694cb: c5 78 11 10 vmovups %xmm10,(%rax) (In reply to comment #63) 368ec3d0c6: c4 e1 f9 6e c0 vmovq %rax,%xmm0 These 3 are fixed in r2343 now.
(In reply to comment #64) > > 4004e1: c5 fa e6 c9 vcvtdq2pd %xmm1,%xmm1 > These are vector floating point (xxxPD), so they are either from > handwritten assembly or not created by gcc at -O2 or below (yes?) Oops. You are right. This is with -O3. With -O2, it runs (even with cachegrind). Cool.
(In reply to comment #47) > => 0x00007ffff464f28e <+238>: vpcmpestrm $0x45,%xmm2,%xmm1 Done, r2344. AFAICS, the only not-implemented one so far reported (apart from Josef's vector ones) are vstmxcsr/vldmxcsr, in comment #52. I don't have a good way to test those. I'll get to them though. Let me know if I missed anything.
Using VEX r2344: vex amd64->IR: unhandled instruction bytes: 0xC5 0xD9 0x67 0xE2 0xC5 0xFA 0x7F 0x21 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x4 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==128337== valgrind: Unrecognised instruction at address 0x82492b4. ==128337== at 0x82492B4: _ZL15toLatin1_helperPK5QChari+276 (emmintrin.h:703) ==128337== by 0x824E946: _ZNK7QString8toLatin1Ev+38 (qstring.cpp:3709) ==128337== by 0x82D75A5: _ZN16QSettingsPrivate15stringToVariantERK7QString+213 (qsettings.cpp:556) ==128337== by 0x82D81B5: The exact offset which valgrind reports is not avialble in gdb's backtrace, though, so I'm not sure how relevant is the disassembly: 0x00007ffff4654b94 <+228>: add $0x1,%ecx 0x00007ffff4654b97 <+231>: add $0x10,%rbx => 0x00007ffff4654b9b <+235>: vpunpcklbw %xmm1,%xmm0,%xmm2 0x00007ffff4654b9f <+239>: vpunpckhbw %xmm1,%xmm0,%xmm0 0x00007ffff4654ba3 <+243>: vmovdqu %xmm2,(%rdx) 0x00007ffff4654ba7 <+247>: vmovdqu %xmm0,0x10(%rdx) 0x00007ffff4654bac <+252>: add $0x20,%rdx 0x00007ffff4654bb0 <+256>: cmp %ecx,%edi 0x00007ffff4654bb2 <+258>: jg 0x7ffff4654b90 <QString::fromLatin1_helper(char const*, int)+224> 0x00007ffff4654bb4 <+260>: and $0xf,%esi 0x00007ffff4654bb7 <+263>: jmpq 0x7ffff4654b23 <QString::fromLatin1_helper(char const*, int)+115> 0x00007ffff4654bbc <+268>: nopl 0x0(%rax) 0x00007ffff4654bc0 <+272>: mov 0x442f69(%rip),%rax # 0x7ffff4a97b30 0x00007ffff4654bc7 <+279>: lock incl (%rax) 0x00007ffff4654bca <+282>: setne %dl 0x00007ffff4654bcd <+285>: jmp 0x7ffff4654b65 <QString::fromLatin1_helper(char const*, int)+181> 0x00007ffff4654bcf <+287>: callq 0x7ffff45db6e8 <strlen@plt> 0x00007ffff4654bd4 <+292>: mov %eax,%esi 0x00007ffff4654bd6 <+294>: jmpq 0x7ffff4654ae3 <QString::fromLatin1_helper(char const*, int)+51> 0x00007ffff4654bdb <+299>: nopl 0x0(%rax,%rax,1) 0x00007ffff4654be0 <+304>: mov %rax,0x8(%rsp) 0x00007ffff4654be5 <+309>: callq 0x7ffff45f53d0 <qBadAlloc()> 0x00007ffff4654bea <+314>: mov (%rsp),%esi 0x00007ffff4654bed <+317>: mov 0x8(%rsp),%rax 0x00007ffff4654bf2 <+322>: jmpq 0x7ffff4654aff <QString::fromLatin1_helper(char const*, int)+79> 0x00007ffff4654bf7 <+327>: callq 0x7ffff45dbc48 <__stack_chk_fail@plt>
If you want good testing coverage, e.g. gcc's gcc/testsuite/gcc.target/i386/ should cover lots of insns, both for AVX, older ISAs and newer ISAs (XOP, AVX2, HLE/RTM, BMI, BMI2, ...). Pick up dg-do run tests, compile/link them with the corresponding gcc (please use 4.7 or better, for HLE/RTM you need trunk) using the options mentioned in dg-options comment, run them both without valgrind and with valgrind.
A new one, "vmovd xmm0,edx", this time in glibc-2.15: vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x7E 0xC2 0x81 0xFA 0xFF 0xFF vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==24740== valgrind: Unrecognised instruction at address 0x3001c273c0. ==24740== at 0x3001C273C0: __logf_finite (e_logf.c:39) ==24740== by 0x44A58F: Parameters::mutate(float) (cmath:361) ==24740== by 0x44ACFD: Evolution::initFixed(int) (evolution.cpp:54) ==24740== by 0x469D63: Console::init(int&, char**) (console.cpp:77) ==24740== by 0x4271E9: main (main.cpp:28) (gdb) disas 0x3001C273C0,+100 Dump of assembler code from 0x3001c273c0 to 0x3001c27424: 0x0000003001c273c0: vmovd %xmm0,%edx 0x0000003001c273c4: cmp $0x7fffff,%edx 0x0000003001c273ca: jg 0x3001c273f8 0x0000003001c273cc: test $0x7fffffff,%edx 0x0000003001c273d2: je 0x3001c275ed 0x0000003001c273d8: test %edx,%edx
Insn vcvtsd2ss is not implemented. vcvtss2sd occurs a few insns later, which seems to be missing, too. vex amd64->IR: unhandled instruction bytes: 0xC5 0xFB 0x5A 0xC0 0xC5 0xF8 0x28 0xC8 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=1 PFX.F3=0 ==22268== valgrind: Unrecognised instruction at address 0x368ec22498. ==22268== at 0x368EC22498: exp+88 (in /lib64/libm-2.15.so) 368ec2248c: 48 8b 05 0d cb 2c 00 mov 0x2ccb0d(%rip),%rax # 368eeeefa0 <__signbitl+0x2b7060> 368ec22493: 83 38 ff cmpl $0xffffffff,(%rax) 368ec22496: 74 c4 je 368ec2245c <exp+0x1c> 368ec22498: c5 fb 5a c0 vcvtsd2ss %xmm0,%xmm0,%xmm0 368ec2249c: c5 f8 28 c8 vmovaps %xmm0,%xmm1 368ec224a0: bf 07 00 00 00 mov $0x7,%edi 368ec224a5: e8 86 6b fe ff callq 368ec09030 <matherr@plt+0x28d0> 368ec224aa: c5 fa 5a c0 vcvtss2sd %xmm0,%xmm0,%xmm0 368ec224ae: eb d7 jmp 368ec22487 <exp+0x47>
(In reply to comment #68) > ==128337== valgrind: Unrecognised instruction at address 0x82492b4. > The exact offset which valgrind reports is not avialble in gdb's backtrace, > though, so I'm not sure how relevant is the disassembly: > => 0x00007ffff4654b9b <+235>: vpunpcklbw %xmm1,%xmm0,%xmm2 Yes, the page offset (b9b) is completely wrong (vs 2b4). Anyway, by looking at the other output, I'd guess the insn is some VPACKUSWB variant. Can you try again to find it?
You are right C5 xx 67 (where xx is any byte) is the 128bit vex-encoded VPACKUSWB instruction. If you don't know it already, you can use www.sandpile.org for an opcode lookup.
(In reply to comment #70) > 0x0000003001c273c0: vmovd %xmm0,%edx (In reply to comment #71) 368ec22498: c5 fb 5a c0 vcvtsd2ss %xmm0,%xmm0,%xmm0 368ec224aa: c5 fa 5a c0 vcvtss2sd %xmm0,%xmm0,%xmm0 Done, r2345.
(In reply to comment #73) > You are right C5 xx 67 (where xx is any byte) is the 128bit vex-encoded > VPACKUSWB instruction. Done, r2346.
vex amd64->IR: unhandled instruction bytes: 0xC5 0xC0 0xC6 0xFF 0x0 0x45 0x19 0xC0 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x7 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==38183== valgrind: Unrecognised instruction at address 0x6fe1ca7. ==38183== at 0x6FE1CA7: _ZN16QRadialFetchSimdI9QSimdSse2E5fetchEPjS2_PK8OperatorPK9QSpanDataddddd+167 (xmmintrin.h:866) ==38183== by 0x6FE2361: _Z33qt_fetch_radial_gradient_templateI16QRadialFetchSimdI9QSimdSse2EEPKjPjPK8OperatorPK9QSpanDataiii+945 (qdrawhelper_p.h:439) Looks like it's: => 0x00007ffff51f6ca7 <+167>: vshufps $0x0,%xmm7,%xmm7,%xmm7
Next is VCVTTSS2SI, float to integer conversion. vex amd64->IR: unhandled instruction bytes: 0xC4 0x61 0xFA 0x2C 0x9C 0x24 0xE0 0xB vex amd64->IR: REX=0 REX.W=1 REX.R=1 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==28050== valgrind: Unrecognised instruction at address 0x45699d. 45699d: c4 61 fa 2c 9c 24 e0 vcvttss2si 0xbe0(%rsp),%r11 *** crash here*** 4569a4: 0b 00 00 4569a7: c5 f9 d6 4c 24 68 vmovq %xmm1,0x68(%rsp)
Actually CVTTSS2SI is float to integer with truncation rounding toward zero.
Hey, it seems that all instructions for my current test setup are implemented. Thanks for your great work Julian. I'll report back as soon as i find new ones :-) Regards, Sascha
(In reply to comment #77) > 45699d: c4 61 fa 2c 9c 24 e0 vcvttss2si 0xbe0(%rsp),%r11 *** Done (+ 3 others) in r2349.
Another one vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x6C 0xC0 0xC5 0xFA 0x6F 0xE vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==16615== valgrind: Unrecognised instruction at address 0x54420cc. ==16615== at 0x54420CC: read_alias_file+1164 (in /usr/lib64/libc-2.15.so) 340cc: c5 f9 6c c0 vpunpcklqdq %xmm0,%xmm0,%xmm0 340d0: c5 fa 6f 0e vmovdqu (%rsi),%xmm1 340d4: 48 83 c7 01 add $0x1,%rdi 340d8: c5 f1 d4 c8 vpaddq %xmm0,%xmm1,%xmm1 340dc: c5 fa 7f 0e vmovdqu %xmm1,(%rsi) 340e0: 48 83 c6 10 add $0x10,%rsi 340e4: 48 3b bd 28 fe ff ff cmp -0x1d8(%rbp),%rdi 340eb: 75 e3 jne 340d0 <read_alias_file+0x490> 340ed: e9 0c ff ff ff jmpq 33ffe <read_alias_file+0x3be> 340f2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 340f9: 1f 84 00 00 00 00 00
And yet another vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x20 0xC8 0xC5 0xF9 0x73 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38 vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==16936== valgrind: Unrecognised instruction at address 0x5c6d1ed. ==16936== at 0x5C6D1ED: _ZNSt8numpunctIwE22_M_initialize_numpunctEP15__locale_struct+317 (in /usr/lib64/libstdc++.so.6.0.17) 881ed: c4 e2 79 20 c8 vpmovsxbw %xmm0,%xmm1 881f2: c5 f9 73 d8 08 vpsrldq $0x8,%xmm0,%xmm0 881f7: c4 e2 79 23 d1 vpmovsxwd %xmm1,%xmm2 881fc: c5 f1 73 d9 08 vpsrldq $0x8,%xmm1,%xmm1 88201: c4 e2 79 20 c0 vpmovsxbw %xmm0,%xmm0 88206: c4 e2 79 23 c9 vpmovsxwd %xmm1,%xmm1 8820b: c5 fa 7f 51 50 vmovdqu %xmm2,0x50(%rcx) 88210: c5 fa 7f 49 60 vmovdqu %xmm1,0x60(%rcx) 88215: c4 e2 79 23 c8 vpmovsxwd %xmm0,%xmm1 8821a: c5 f9 73 d8 08 vpsrldq $0x8,%xmm0,%xmm0 8821f: c4 e2 79 23 c0 vpmovsxwd %xmm0,%xmm0 88224: c5 fa 7f 49 70 vmovdqu %xmm1,0x70(%rcx) 88229: c5 fa 7f 81 80 00 00 vmovdqu %xmm0,0x80(%rcx) 88230: 00 88231: c5 fa 6f 46 10 vmovdqu 0x10(%rsi),%xmm0 88236: c4 e2 79 20 c8 vpmovsxbw %xmm0,%xmm1 8823b: c5 f9 73 d8 08 vpsrldq $0x8,%xmm0,%xmm0 88240: c4 e2 79 23 d1 vpmovsxwd %xmm1,%xmm2 88245: c5 f1 73 d9 08 vpsrldq $0x8,%xmm1,%xmm1 8824a: c4 e2 79 20 c0 vpmovsxbw %xmm0,%xmm0 8824f: c4 e2 79 23 c9 vpmovsxwd %xmm1,%xmm1 88254: c5 fa 7f 91 90 00 00 vmovdqu %xmm2,0x90(%rcx) 8825b: 00 8825c: c5 fa 7f 89 a0 00 00 vmovdqu %xmm1,0xa0(%rcx) 88263: 00 88264: c4 e2 79 23 c8 vpmovsxwd %xmm0,%xmm1 88269: c5 f9 73 d8 08 vpsrldq $0x8,%xmm0,%xmm0 8826e: c4 e2 79 23 c0 vpmovsxwd %xmm0,%xmm0 88273: c5 fa 7f 89 b0 00 00 vmovdqu %xmm1,0xb0(%rcx) 8827a: 00 8827b: c5 fa 7f 81 c0 00 00 vmovdqu %xmm0,0xc0(%rcx) 88282: 00
callgrind on dolphin vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x71 0xD1 0x8 0xC5 0xF1 0xDB vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==13668== valgrind: Unrecognised instruction at address 0x874f086. ==13668== at 0x874F086: _Z31comp_func_solid_SourceOver_sse2Pjijj+326 (emmintrin.h:1167) ==13668== by 0x893CAB2: _ZL19blend_color_genericiPK11QT_FT_Span_Pv+306 (qdrawhelper.cpp:3316) ==13668== by 0x894A4BF: gray_convert_glyph+1631 (qgrayraster.c:1756) ==13668== by 0x891491A: _ZN25QRasterPaintEnginePrivate9rasterizeEP14QT_FT_Outline_PFviPK11QT_FT_Span_PvES5_P13QRasterBuffer.part.114+474 (qpaintengine_raster.cpp:3834) ==13668== by 0x892041D: _ZN18QRasterPaintEngine4fillERK11QVectorPathRK6QBrush+509 (qpaintengine_raster.cpp:1753) ==13668== by 0x8892D2C: _ZN14QPaintEngineEx4drawERK11QVectorPath+124 (qpaintengineex.cpp:599) ==13668== by 0x889462D: _ZN14QPaintEngineEx15drawRoundedRectERK6QRectFddN2Qt8SizeModeE+477 (qpaintengineex.cpp:779) ==13668== by 0x88A9099: _ZN8QPainter15drawRoundedRectERK6QRectFddN2Qt8SizeModeE+57 (qpainter.cpp:4238) ==13668== by 0x10286343: _ZN6Bespin8Elements12sunkenShadowEib+403 (elements.cpp:135) ==13668== by 0x10037B75: _ZN6Bespin5Style15generatePixmapsEv+869 (genpixmaps.cpp:75) ==13668== by 0x10043C69: _ZN6Bespin5Style4initEPK9QSettings+249 (init.cpp:688) ==13668== by 0x1002A4D0: _ZN6Bespin5StyleC1Ev+96 (bespin.cpp:300) ==13668== by 0x10030661: _ZN17BespinStylePlugin6createERK7QString+97 (bespin.cpp:79) ==13668== by 0x8A54178: _ZN13QStyleFactory6createERK7QString+408 (qstylefactory.cpp:193) ==13668== by 0x875C728: _ZN12QApplication5styleEv+184 (qapplication.cpp:1462) ==13668== by 0x87CC0C3: _ZL20qt_set_x11_resourcesPKcS0_S0_S0_+499 (qapplication_x11.cpp:1289) ==13668== by 0x87D0350: _Z7qt_initP19QApplicationPrivateiP9_XDisplaymm+5552 (qapplication_x11.cpp:2397) ==13668== by 0x875D1B3: _ZN19QApplicationPrivate9constructEP9_XDisplaymm+211 (qapplication.cpp:839) ==13668== by 0x875D8E9: _ZN12QApplicationC1ERiPPcbi+121 (qapplication.cpp:772) ==13668== by 0x750EE06: _ZN12KApplicationC1Eb+54 (kapplication.cpp:346) ==13668== by 0x4C5867D: _ZN18DolphinApplicationC1Ev+29 (dolphinapplication.cpp:31) ==13668== by 0x4C6F829: kdemain+3529 (main.cpp:85) ==13668== by 0x4ED138C: (below main)+236 (libc-start.c:226) The whole system is compiled using gcc-4.6.3 with march=corei7-avx -O2
The next one is a VPINSRQ, which reulted from a call to vector = _mm_insert_epi64(vector, k, 1); I guess this counts as integer vector code. Anyway, here is the relevant error message: vex amd64->IR: unhandled instruction bytes: 0xC4 0xE3 0xE9 0x22 0xDB 0x1 0xC5 0xF9 vex amd64->IR: REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x2 ESC=0F3A vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==30797== valgrind: Unrecognised instruction at address 0x443068. ... 443068: c4 e3 e9 22 db 01 vpinsrq $0x1,%rbx,%xmm2,%xmm3 ...
Vex r2350 adds support for the following: VADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 58 /r VMULPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 59 /r VCVTPS2DQ xmm2/m128, xmm1 = VEX.128.66.0F.WIG 5B /r VSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5C /r VMINPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5D /r VMAXPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5F /r VPUNPCKLWD r/m, rV, r ::: r = interleave-lo-words(rV, r/m) VPUNPCKHWD r/m, rV, r ::: r = interleave-hi-words(rV, r/m) VPSHUFLW imm8, xmm2/m128, xmm1 = VEX.128.F2.0F.WIG 70 /r ib VPSHUFHW imm8, xmm2/m128, xmm1 = VEX.128.F3.0F.WIG 70 /r ib VPSRLD imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 72 /2 ib VPSLLDQ imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /7 ib VSHUFPS imm8, xmm3/m128, xmm2, xmm1, xmm2 VPMULLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D5 /r VPSUBUSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D8 /r VPANDN r/m, rV, r ::: r = rV & ~r/m (is that correct, re the ~ ?) VPADDUSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DC /r VPADDUSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DD /r VPMULHUW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E4 /r
Vex r2351 adds support for the following: VPUNPCKLDQ r/m, rV, r ::: r = interleave-lo-dwords(rV, r/m) VPACKSSDW r/m, rV, r ::: r = QNarrowBin32Sto16Sx8(rV, r/m) VPUNPCKLQDQ r/m, rV, r ::: r = interleave-lo-64bitses(rV, r/m) VPSRLW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /2 ib VPADDW r/m, rV, r ::: r = rV + r/m VPINSRD r32/m32, xmm2, xmm1 = VEX.NDS.128.66.0F3A.W0 22 /r ib
Using VEX r2358, my Qt style (Oxygen) fails with this inside QRadialFetchSimd<QSimdSse2>::fetch(unsigned int*, unsigned int*, Operator const*, QSpanData const*, double, double, double, double, double) (xmmintrin.h:352): vex amd64->IR: unhandled instruction bytes: 0xC5 0x70 0xC2 0xEB 0x1 0xC4 0xC1 0x60 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x1 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==105358== valgrind: Unrecognised instruction at address 0x6fe1d3e. ==105358== at 0x6FE1D3E: QRadialFetchSimd<QSimdSse2>::fetch(unsigned int*, unsigned int*, Operator const*, QSpanData const*, double, double, double, double, double) (xmmintrin.h:352) gdb calls it VXORPS: => 0x00007ffff51f6d02 <+258>: vxorps %xmm1,%xmm1,%xmm1 The disassembly of that function shows a few more instructions which -- if I interpret these lists correctly -- aren't implemented yet: - 0x00007ffff51f6d0b <+267>: vmovaps 0x80(%rsp),%xmm3 - 0x00007ffff51f6d3e <+318>: vcmpltps %xmm3,%xmm1,%xmm13 - 0x00007ffff51f6d48 <+328>: vsqrtps %xmm11,%xmm11 - 0x00007ffff51f6f49 <+841>: vcvttps2dq %xmm11,%xmm11
I am also seeing the vxorps operation with "unhandled bytes" 0xc5 0xf8 0x57 0xc0 0xc5 0xfa 0x11 0x44 The "more odd" issue I am seeing is with an other program. I am not sure if this is because of the new processor stuff, or is it just some incompatibility with the Ubuntu 12.04 version. Or perhaps an error which I have introduced while compiling to Ubuntu 12.04 Eero Interesting parts of the Valgrind/memcheck report follows: assumed next %rip = 0x4168E2 actual next %rip = 0x4168E3 vex: the `impossible' happened: disInstr_AMD64: disInstr miscalculated next %rip vex storage: T total 370065048 bytes allocated vex storage: P total 960 bytes allocated valgrind: the 'impossible' happened: LibVEX called failure_exit(). ==7247== at 0x38044186: report_and_quit (m_libcassert.c:210) ==7247== by 0x380441E3: panic (m_libcassert.c:294) ==7247== by 0x38044398: vgPlain_core_panic_at (m_libcassert.c:299) ==7247== by 0x380443AA: vgPlain_core_panic (m_libcassert.c:304) ==7247== by 0x3805B0A2: failure_exit (m_translate.c:700) ==7247== by 0x380DCA28: vpanic (main_util.c:226) ==7247== by 0x38144E8B: disInstr_AMD64 (guest_amd64_toIR.c:19144) ==7247== by 0x380ED4E7: bb_to_IR (guest_generic_bb_to_IR.c:305) ==7247== by 0x380DB1E0: LibVEX_Translate (main_main.c:508) ==7247== by 0x3805D302: vgPlain_translate (m_translate.c:1535) ==7247== by 0x38086CB8: vgPlain_scheduler (scheduler.c:901) ==7247== by 0x38096CE5: run_a_thread_NORETURN (syswrap-linux.c:98)
vex r2365 adds support for the following: VMOVAPD ymm1, ymm2/m256 = VEX.256.66.0F.WIG 29 /r VMOVAPS ymm1, ymm2/m256 = VEX.256.0F.WIG 29 /r VPADDQ r/m, rV, r ::: r = rV + r/m VPSUBW r/m, rV, r ::: r = rV - r/m VPSUBQ = VEX.NDS.128.66.0F.WIG FB /r VPINSRQ r64/m64, xmm2, xmm1 = VEX.NDS.128.66.0F3A.W1 22 /r ib
The next step: vex amd64->IR: unhandled instruction bytes: 0xC4 0x41 0x69 0xC4 0xC9 0x0 0xC4 0x41 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x2 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==21651== valgrind: Unrecognised instruction at address 0x46b2dd 46b2dd: c4 41 69 c4 c9 00 vpinsrw $0x0,%r9d,%xmm2,%xmm9 46b2e3: c4 41 7b 2c f7 vcvttsd2si %xmm15,%r14d 46b2e8: c4 c1 31 c4 c6 01 vpinsrw $0x1,%r14d,%xmm9,%xmm0
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0xFC 0xC1 0xC5 0xF9 0x7F 0x1 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==14795== valgrind: Unrecognised instruction at address 0x5e3a250. 1cc440: c5 f9 fc c1 vpaddb %xmm1,%xmm0,%xmm0
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x18 0x64 0x24 0x54 0x48 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38 vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 68056: c4 e2 79 18 64 24 54 vbroadcastss 0x54(%rsp),%xmm4
What kind of gcc compiled that firefox that even the most basic instructions like vaddpd or vmulpd (256-bit) don't show up in it? Even as simple testcase as following with -O3 -mavx doesn't work. double a[2048], b[1024], c[1024]; int main () { int i; for (i = 0; i < 1024; i++) a[i] = b[i] + c[i]; for (i = 0; i < 1024; i++) a[i + 1024] = b[i] * c[i]; asm volatile ("" : : : "memory"); return 0; } I'd help with adding support for some AVX instructions, but I think it would be better if Julian converted a few such insns first to select a style. Say for VADDPS, do you want to duplicate everything like: /* VADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 58 /r */ if (haveNo66noF2noF3(pfx) && 0==getVexL(pfx)/*128*/) { delta = dis_AVX128_E_V_to_G( uses_vvvv, vbi, pfx, delta, "vaddps", Iop_Add32Fx4 ); goto decode_success; } /* VADDPS ymm3/m128, ymm2, ymm1 = VEX.NDS.256.0F.WIG 58 /r */ if (haveNo66noF2noF3(pfx) && 1==getVexL(pfx)/*256*/) { delta = dis_AVX256_E_V_to_G( uses_vvvv, vbi, pfx, delta, "vaddps", Iop_Add32Fx8 ); goto decode_success; } or say have some function to which you pass both Iop_Add32Fx4 and Iop_Add32Fx8 arguments and it will DTRT based on getVexL(pfx)? When trying gcc avx tests with valgrind, everything stops immediately because CPUID under valgrind doesn't indicate AVX support (but not even SSE3+ support), when hacked around the next crash is on the detection of AVX OS support (xgetbv instruction which makes valgrind panic), and when even this is hacked around fails on most AVX insns.
(In reply to comment #93) > What kind of gcc compiled that firefox that even the most basic instructions > like vaddpd or vmulpd (256-bit) don't show up in it? I concentrated first on adding support for "gcc-4.7.0 -mavx -g -O2" code. vaddpd etc only appear when gcc is vectorizing, at -O3. I have been working through the -O3 output the past couple days now. > Say for VADDPS, do you want to duplicate everything like: Yes, there is some level of duplication. It's difficult to get rid of entirely. I do what I can, but the primary emphasis is to ensure that what is implemented is correct.
Created attachment 71767 [details] P VEX r2379 added support for these: VMOVUPD xmm2/m128, xmm1 = VEX.128.66.0F.WIG 10 /r VMOVUPS xmm2/m128, xmm1 = VEX.128.0F.WIG 10 /r VUNPCKHPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 15 /r VUNPCKLPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 14 /r VUNPCKHPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 15 /r VADDPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 58 /r VADDPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 58 /r VADDPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 58 /r VMULPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 59 /r VMULPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 59 /r VMULPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 59 /r VSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5C /r VSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5C /r VSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5C /r VDIVPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5E /r VDIVPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5E /r VPSRLQ imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /2 ib VPCMPEQQ = VEX.NDS.128.66.0F38.WIG 29 /r VPCMPGTQ = VEX.NDS.128.66.0F38.WIG 37 /r VPEXTRQ = VEX.128.66.0F3A.W1 16 /r ib r2380 added support for these: VPSLLQ imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /6 ib VPEXTRW imm8, xmm1, reg32 = VEX.128.66.0F.W0 C5 /r ib VPMINUB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DA /r VPMAXUB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DE /r VPMINSW r/m, rV, r ::: r = min-signed16s(rV, r/m) VPMAXSW r/m, rV, r ::: r = max-signed16s(rV, r/m) VPMULUDQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F4 /r VPMINSB r/m, rV, r ::: r = min-signed-8s(rV, r/m) VPMINUW r/m, rV, r ::: r = min-unsigned-16s(rV, r/m) VPMINUD r/m, rV, r ::: r = min-unsigned-32s(rV, r/m) VPMAXSB r/m, rV, r ::: r = max-signed-8s(rV, r/m) VPMAXUW r/m, rV, r ::: r = max-unsigned-16s(rV, r/m) VPMAXUD r/m, rV, r ::: r = max-unsigned-32s(rV, r/m) VPMULLD r/m, rV, r ::: r = mul-32s(rV, r/m) VPHMINPOSUW xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 41 /r VPERMILPD imm8, ymm2/m256, ymm1 = VEX.256.66.0F3A.W0 05 /r ib VPERMILPD imm8, xmm2/m128, xmm1 = VEX.128.66.0F3A.W0 05 /r ib VPERM2F128 imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.66.0F3A.W0 06 /r ib VPEXTRB imm8, xmm2, reg/m8 = VEX.128.66.0F3A.W0 14 /r ib Note that although programs run ok with --tool=none, 256 bit vector arithmetic (vaddpd etc) will still crash Memcheck, because the instrumentation stuff for 256 bit vectors is incomplete.
Created attachment 71769 [details] Q On Tue, Jun 12, 2012 at 03:07:54PM +0000, Julian Seward wrote: > Note that although programs run ok with --tool=none, 256 bit vector > arithmetic (vaddpd etc) will still crash Memcheck, because the > instrumentation stuff for 256 bit vectors is incomplete. Quick patch to implement 128-bit VDIVPS/VDIVPD and 256-bit VMOVAPS. Testcase (to be tested with gcc -O3 -mavx and gcc -O3 -mavx -mprefer-avx128): #define N 1024 double a[N], b[N], c[N]; float d[N], e[N], f[N]; int main () { int i; for (i = 0; i < N; i++) { b[i] = i + 1.0; c[i] = i + 2.0; d[i] = i + 1.0; e[i] = i + 2.0; asm (""); } asm volatile (""); for (i = 0; i < N; i++) { a[i] = b[i] / c[i]; d[i] = e[i] / f[i]; } asm volatile (""); for (i = 0; i < N; i++) if (a[i] != b[i] / c[i] || d[i] != e[i] / f[i]) __builtin_abort (); return 0; }
On Tue, Jun 12, 2012 at 03:46:26PM +0000, Jakub Jelinek wrote: > Quick patch to implement 128-bit VDIVPS/VDIVPD and 256-bit VMOVAPS. And here is another, to implement V{AND,ANDN,OR,XOR}P{S,D} 256-bit. Testcase (to be tested with gcc -O3 -mavx and gcc -O3 -mavx -DINTRIN -fno-strict-aliasing (the former is just autovectorization, the latter using intrinsics)): #include <x86intrin.h> #define N 1024 long a[N], b[N], c[N]; int d[N], e[N], f[N]; int main () { int i; for (i = 0; i < N; i++) { b[i] = i * 0x123456789abcdefUL; c[i] = i * 0xfedcba987654321UL; d[i] = i * 0x1234567; e[i] = i + 0x7654321; asm (""); } asm volatile (""); #ifdef INTRIN double *ad = (double *) &a[0], *bd = (double *) &b[0], *cd = (double *) &c[0]; for (i = 0; i < N; i += 4) _mm256_store_pd (ad + i, _mm256_and_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i))); float *dd = (float *) &d[0], *ed = (float *) &e[0], *fd = (float *) &f[0]; for (i = 0; i < N; i += 8) _mm256_store_ps (dd + i, _mm256_and_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i))); #else for (i = 0; i < N; i++) { a[i] = b[i] & c[i]; d[i] = e[i] & f[i]; } #endif asm volatile (""); for (i = 0; i < N; i++) if (a[i] != (b[i] & c[i]) || d[i] != (e[i] & f[i])) __builtin_abort (); asm volatile (""); #ifdef INTRIN for (i = 0; i < N; i += 4) _mm256_store_pd (ad + i, _mm256_andnot_pd (_mm256_load_pd (cd + i), _mm256_load_pd (bd + i))); for (i = 0; i < N; i += 8) _mm256_store_ps (dd + i, _mm256_andnot_ps (_mm256_load_ps (fd + i), _mm256_load_ps (ed + i))); #else for (i = 0; i < N; i++) { a[i] = b[i] & ~c[i]; d[i] = e[i] & ~f[i]; } #endif asm volatile (""); for (i = 0; i < N; i++) if (a[i] != (b[i] & ~c[i]) || d[i] != (e[i] & ~f[i])) __builtin_abort (); asm volatile (""); #ifdef INTRIN for (i = 0; i < N; i += 4) _mm256_store_pd (ad + i, _mm256_or_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i))); for (i = 0; i < N; i += 8) _mm256_store_ps (dd + i, _mm256_or_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i))); #else for (i = 0; i < N; i++) { a[i] = b[i] | c[i]; d[i] = e[i] | f[i]; } #endif asm volatile (""); for (i = 0; i < N; i++) if (a[i] != (b[i] | c[i]) || d[i] != (e[i] | f[i])) __builtin_abort (); asm volatile (""); #ifdef INTRIN for (i = 0; i < N; i += 4) _mm256_store_pd (ad + i, _mm256_xor_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i))); for (i = 0; i < N; i += 8) _mm256_store_ps (dd + i, _mm256_xor_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i))); #else for (i = 0; i < N; i++) { a[i] = b[i] ^ c[i]; d[i] = e[i] ^ f[i]; } #endif asm volatile (""); for (i = 0; i < N; i++) if (a[i] != (b[i] ^ c[i]) || d[i] != (e[i] ^ f[i])) __builtin_abort (); return 0; }
I'll wait with further changes now to see if that is the desired way to do it.
In the #c97 testcase please replace: d[i] = i * 0x1234567; e[i] = i + 0x7654321; with d[i] = i * 0x1234567U; e[i] = i * 0x7654321U; (without U suffix gcc 4.8 optimizes the first loop into endless loop due to undefined signed overflow in it).
r2381 adds support for: VUNPCKLPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 14 /r VUNPCKHPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 15 /r VUNPCKLPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 14 /r VUNPCKHPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 15 /r VANDPD r/m, rV, r ::: r = rV & r/m (256 bit) VXORPD r/m, rV, r ::: r = rV ^ r/m (256 bit) VDIVPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5E /r VCMPPD xmm3/m64(E=argL), xmm2(V=argR), xmm1(G) VSHUFPS imm8, ymm3/m256, ymm2, ymm1, ymm2 VCVTDQ2PD xmm2/m64, xmm1 = VEX.128.F3.0F.WIG E6 /r VBROADCASTSD m64, ymm1 = VEX.256.66.0F38.W0 19 /r This gives at least moderately usable coverage for code created by "gcc-4.7.0 -mavx -O3". Memcheck will still fail, though.
(In reply to comment #83) > vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x71 0xD1 0x8 0xC5 > 0xF1 0xDB > vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 > vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F > vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 > ==13668== valgrind: Unrecognised instruction at address 0x874f086. > ==13668== at 0x874F086: _Z31comp_func_solid_SourceOver_sse2Pjijj+326 > (emmintrin.h:1167) > ==13668== by 0x893CAB2: _ZL19blend_color_genericiPK11QT_FT_Span_Pv+306 > (qdrawhelper.cpp:3316) > ==13668== by 0x894A4BF: gray_convert_glyph+1631 (qgrayraster.c:1756) > ==13668== by 0x891491A: > _ZN25QRasterPaintEnginePrivate9rasterizeEP14QT_FT_Outline_PFviPK11QT_FT_Span_ > PvES5_P13QRasterBuffer.part.114+474 (qpaintengine_raster.cpp:3834) This now became: vex amd64->IR: unhandled instruction bytes: 0xC5 0xC9 0xFC 0xC0 0xC4 0xC1 0x79 0x7F vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x6 ESC=0F vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==3182== valgrind: Unrecognised instruction at address 0x89630be. ==3182== at 0x89630BE: comp_func_solid_SourceOver_sse2(unsigned int*, int, unsigned int, unsigned int) (emmintrin.h:992) ==3182== by 0x8B50AB2: blend_color_generic(int, QT_FT_Span_ const*, void*) (qdrawhelper.cpp:3316) ==3182== by 0x8B5E4BF: gray_convert_glyph (qgrayraster.c:1756) emmintrin.h comes with gcc-4.6.3. Line 992: extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_epi8 (__m128i __A, __m128i __B) { return (__m128i)__builtin_ia32_paddb128 ((__v16qi)__A, (__v16qi)__B); /// <- :992 } Line 1167: extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_srli_epi16 (__m128i __A, int __B) { return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); /// <- :1167 }
r2382 adds support for VMOVHPD m64, xmm1, xmm2 = VEX.NDS.128.66.0F.WIG 16 /r VMOVAPS ymm2/m256, ymm1 = VEX.256.0F.WIG 28 /r VCVTPD2PS ymm2/m256, xmm1 = VEX.256.66.0F.WIG 5A /r VPUNPCKHDQ = VEX.NDS.128.66.0F.WIG 6A /r VPCMPEQW r/m, rV, r ::: r = rV `eq-by-16s` r/m VPSUBUSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D9 /r VCVTDQ2PD xmm2/m128, ymm1 = VEX.256.F3.0F.WIG E6 /r VPADDB r/m, rV, r ::: r = rV + r/m VBROADCASTSS m32, xmm1 = VEX.128.66.0F38.W0 18 /r VPMOVSXBW xmm2/m64, xmm1 VPMOVSXWD xmm2/m64, xmm1 VPMOVSXDQ xmm2/m64, xmm1
(In reply to comment #102) > r2382 Thx! Now its: vex amd64->IR: unhandled instruction bytes: 0xC5 0x70 0xC2 0xDA 0x1 0xC5 0xE8 0x58 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x1 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==19107== valgrind: Unrecognised instruction at address 0x89638d6. ==19107== at 0x89638D6: unsigned int const* qt_fetch_radial_gradient_template<QRadialFetchSimd<QSimdSse2> >(unsigned int*, Operator const*, QSpanData const*, int, int, int) (xmmintrin.h:352) ==19107== by 0x8B59D55: void handleSpans<BlendSrcGeneric<(SpanMethod)0> >(int, QT_FT_Span_ const*, QSpanData const*, BlendSrcGeneric<(SpanMethod)0>&) (qdrawhelper.cpp:3575) ==19107== by 0x8B540E1: void blend_src_generic<(SpanMethod)0>(int, QT_FT_Span_ const*, void*) (qdrawhelper.cpp:3599) extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cmpgt_ps (__m128 __A, __m128 __B) { return (__m128) __builtin_ia32_cmpgtps ((__v4sf)__A, (__v4sf)__B); } Sidenote for the interested: I wondered that there should be unknown SSE2-instructions. But today I read, that -mavx translates those instructions to AVX. :)
> AFAICS, the only not-implemented one so far reported (apart from > Josef's vector ones) are vstmxcsr/vldmxcsr, in comment #52. I don't > have a good way to test those. I'll get to them though. As these instructions still seem to crash my session, here is a minimal test case (in case it is needed) It "works" for me when compiling for -march=native on my I7 processor Eero #include <xmmintrin.h> int main() { _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); return 0; }
(In reply to comment #97) > And here is another, to implement V{AND,ANDN,OR,XOR}P{S,D} 256-bit. Committed as r2383, thanks. Patch is fine. Only comment is, ideally you could add test cases to none/tests/amd64/avx-1.c (very easy to do) as you go along.
*** Bug 301967 has been marked as a duplicate of this bug. ***
r2384 adds support for VSTMXCSR m32 = VEX.LZ.0F.WIG AE /3 VLDMXCSR m32 = VEX.LZ.0F.WIG AE /2 0F 01 D0 = XGETBV VPUNPCKHQDQ r/m, rV, r ::: r = interleave-hi-64bitses(rV, r/m) VPSRAW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /4 ib VPMULHW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E5 /r VPERMILPS imm8, ymm2/m256, ymm1 = VEX.256.66.0F3A.W0 04 /r ib It also adds a CPUID emulation that announces AVX support, although this is currently disabled.
Thank you for your work on this problem, and with Valgrind generally! With the latest update I got forward, now I am seeing: vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x51 0xC1 0xC5 0xF8 0x2E 0xC0 gdb disas points to: 0x00000000005dd3b9 <+25>: c5 fa 51 c1 vsqrtss %xmm1,%xmm0,%xmm0 Eero
Created attachment 71858 [details] VPALIGNR and VBROADCASTSS (256-bit) support VPALIGNR and VBROADCASTSS patch. No test adjustment yet, started hacking also on VPINSRW support, but run out of time for today. Posting here primarily to avoid work duplication.
Created attachment 71909 [details] avx.patch On top of the previous patch, support for further mainly permutation/shuffle insns, which came up during cd /usr/src/gcc/gcc/testsuite/gcc.dg/torture; for i in vshuf*.c; do \ gcc -O2 -mavx -DEXPENSIVE -o /tmp/n $i -lm; \ VALGRIND_LIB=/usr/src/valgrind/.in_place/ /usr/src/valgrind/coregrind/valgrind \ --tool=none --quiet /tmp/n || echo $i $?; \ done testing (still not all insns covered yet, but several tests already pass). Additionally tested with the above command with -mss{e4,se3,e3,e2} instead of -mavx. This adds support for 3 operand VMOVSD, VMOVLPD, VPINSRW, VSHUFPD (128 and 256), VPERMILPS, VBLENDPS (128 and 256), VBLENDPD (128 and 256), VPBLENDW and VPINSRB.
(In reply to comment #109) > Created attachment 71858 [details] (In reply to comment #100) > Created attachment 71909 [details] Committed, r2386, r2387. Thanks!
For that gcc.dg/torture/ set of tests, the only remaining unsupported insn is probably VPERMILPS (the variable version; though it would be nice to also support VPERMILPD variable version). Not sure how those would be best implemented, new IR opcodes, or perhaps new IR opcodes plus some masking on the selection vector. Julian, could you look at that? I'll in the mean time look at some gcc.target/i386/ unsupported insns...
r2388 adds support for: VMOVUPD ymm1, ymm2/m256 = VEX.256.66.0F.WIG 11 /r VMOVUPS ymm1, ymm2/m256 = VEX.256.0F.WIG 11 /r VCOMISD xmm2/m64, xmm1 = VEX.LIG.66.0F.WIG 2F /r VPCMPGTD r/m, rV, r ::: r = rV `>s-by-32s` r/m VPMOVSXBD xmm2/m32, xmm1 VPMOVZXBD xmm2/m32, xmm1 VDPPD xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 41 /r ib
(In reply to comment #112) > Julian, could you look at that? Will do.
Created attachment 71917 [details] avx2.patch This patch adds: VMOVUPS ymm2/m256, ymm1 = VEX.256.0F.WIG 10 /r VSQRTSS xmm3/m64(E), xmm2(V), xmm1(G) = VEX.NDS.LIG.F3.0F.WIG 51 /r VSQRTPS xmm2/m128(E), xmm1(G) = VEX.NDS.128.0F.WIG 51 /r VSQRTPS ymm2/m256(E), ymm1(G) = VEX.NDS.256.0F.WIG 51 /r VSQRTPD xmm2/m128(E), xmm1(G) = VEX.NDS.128.66.0F.WIG 51 /r VSQRTPD ymm2/m256(E), ymm1(G) = VEX.NDS.256.66.0F.WIG 51 /r VRSQRTSS xmm3/m64(E), xmm2(V), xmm1(G) = VEX.NDS.LIG.F3.0F.WIG 52 /r VRSQRTPS xmm2/m128(E), xmm1(G) = VEX.NDS.128.0F.WIG 52 /r VRSQRTPS ymm2/m256(E), ymm1(G) = VEX.NDS.256.0F.WIG 52 /r VZEROALL = VEX.256.0F.WIG 77 VMOVDQU ymm1, ymm2/m256 = VEX.256.F3.0F.WIG 7F found during gcc.target/i386/ testing (further changes to come).
Created attachment 71922 [details] avx-3.patch This patch adds: VCVTPS2PD xmm2/m128, ymm1 = VEX.256.0F.WIG 5A /r VCVTPS2DQ ymm2/m256, ymm1 = VEX.256.66.0F.WIG 5B /r VCVTTPS2DQ xmm2/m128, xmm1 = VEX.128.F3.0F.WIG 5B /r VCVTTPS2DQ ymm2/m256, ymm1 = VEX.256.F3.0F.WIG 5B /r VCVTDQ2PS xmm2/m128, xmm1 = VEX.128.0F.WIG 5B /r VCVTDQ2PS ymm2/m256, ymm1 = VEX.256.0F.WIG 5B /r VCVTTPD2DQ xmm2/m128, xmm1 = VEX.128.66.0F.WIG E6 /r VCVTTPD2DQ ymm2/m256, xmm1 = VEX.256.66.0F.WIG E6 /r VCVTPD2DQ xmm2/m128, xmm1 = VEX.128.F2.0F.WIG E6 /r VCVTPD2DQ ymm2/m256, xmm1 = VEX.256.F2.0F.WIG E6 /r (again, from gcc.target/i386/ avx-*.c tests).
(In reply to comment #115) > Created attachment 71917 [details] avx2.patch (In reply to comment #116) Created attachment 71922 [details] avx-3.patch Committed w/ minor fixes, r2390. Thanks!
(In reply to comment #112) > is probably VPERMILPS (the variable version; though it would be nice to also > support VPERMILPD variable version). Not sure how those would be best > implemented, new IR opcodes, or perhaps new IR opcodes plus some masking on > the selection vector. I can't think of a really simple and efficient way to do this. My least-worst suggestion, for VPERMILPS_128(vec, ctrl), where vec and ctrl are the data and control vectors respectively, is to generate this expression ctrl_clean = ShrN32x4(ShlN32x4(ctrl, 30), 30) This zeroes the top 30 bits of each control vector lane, which is important for avoiding false positives in Memcheck. Then add a new primop, Iop_Perm32x4 (put it next to existing Perm8x16). Implement this in the back end using a function call, which is handled via "do_SseAssistedBinary" in iselVecExpr_wrk. Add the relevant helper function in host_generic_simd128.c, in the style of eg h_generic_calc_Mul32x4. Overall result is then Perm32x4(vec, ctrl_clean) To be a bit cleverer, if we know the host actually supports AVX, we could implement Perm32x4 using VPERMILPS, but that means adding a new case for AMD64Instr, which sounds more complexity than it is worth for this one case.
(I have two programs which I would like to run with Valgrind, after the latest patch the other one seems to be OK, great!) From some code written with xmmintrin.h: vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x12 0xFB 0xC5 0xFA 0x16 0xDB vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==6596== valgrind: Unrecognised instruction at address 0x51585c. gdb disasm: 0x000000000051585c <+140>: c5 fa 12 fb vmovsldup %xmm3,%xmm7 0x0000000000515860 <+144>: c5 fa 16 db vmovshdup %xmm3,%xmm3 0x0000000000515864 <+148>: c5 e0 58 df vaddps %xmm7,%xmm3,%xmm3 Do you need a test case to be compiled/run ? Eero
Another one, a not so uncommon instruction: vex amd64->IR: unhandled instruction bytes: 0xC4 0x41 0x12 0x10 0xF4 0xC4 0x41 0x79 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0xD ESC=0F vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=1 ==10414== valgrind: Unrecognised instruction at address 0x46b17e. 46b17e: c4 41 12 10 f4 vmovss %xmm12,%xmm13,%xmm14 This was generated from a _mm_insert_epi64x(vector, value, 0), so gcc without any intrinsics will probably not do this.
Created attachment 71936 [details] VMOVS[LH]DUP patch This patch adds VMOVSLDUP xmm2/m128, xmm1 = VEX.NDS.128.F3.0F.WIG 12 /r VMOVSLDUP ymm2/m256, ymm1 = VEX.NDS.256.F3.0F.WIG 12 /r VMOVSHDUP xmm2/m128, xmm1 = VEX.NDS.128.F3.0F.WIG 16 /r VMOVSHDUP ymm2/m256, ymm1 = VEX.NDS.256.F3.0F.WIG 16 /r
Created attachment 71937 [details] VMOVSS patch VMOVSS xmm3, xmm2, xmm1 = VEX.LIG.F3.0F.WIG 10 /r VMOVSS xmm3, xmm2, xmm1 = VEX.LIG.F3.0F.WIG 11 /r
That was fast. The next one is evil. vex amd64->IR: unhandled instruction bytes: 0xC4 0x42 0x29 0xB 0xD9 0xC4 0xC1 0x79 vex amd64->IR: REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0xA ESC=0F38 vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==9622== valgrind: Unrecognised instruction at address 0x441de6. ==9622== at 0x441DE6: Eval::Init::material() (tmmintrin.h:124) ==9622== by 0x45809A: Eval::Init::setEvalParameters(Parameters const&) (evalinit.cpp:778) ==9622== by 0x46DA40: Eval::init(Parameters const&) (eval.cpp:91) ==9622== by 0x44FD5F: Game::Game(Console*, Parameters const&, unsigned long, unsigned long) (game.cpp:115) ==9622== by 0x468559: Console::init(int&, char**) (console.cpp:78) ==9622== by 0x427D79: main (main.cpp:28) 441de6: c4 42 29 0b d9 vpmulhrsw %xmm9,%xmm10,%xmm11 441deb: c4 c1 79 c5 eb 01 vpextrw $0x1,%xmm11,%ebp 441df1: c4 c1 79 c5 d3 00 vpextrw $0x0,%xmm11,%edx I shouldn't have used this, I knew it would hurt me later :-) It actually does give a nice speedup, but fortunately for valgrind testing I can easily disable it. It comes from this code #ifdef __SSSE3__ // The SSE version loses half a bit of precision, because is rounds first // and then sums up, where the normal code rounds last. __v8hi score16 = _mm_mulhrs_epi16(weights.data, score.data); int16_t s0 = _mm_extract_epi16( score16, 0 ) + _mm_extract_epi16( score16, 1 ); return s0; #else int o = weights.opening(); int e = weights.endgame(); int s = (o*score.opening() + e*score.endgame() + 0x4000) >> 15; ASSERT(s==s0); return s; #endif If I disable the SSSE3 path, the next stop ist at vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x17 0xDA 0x49 0x8B 0x48 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38 vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0 ==12207== valgrind: Unrecognised instruction at address 0x443236 which is 443236: c4 e2 79 17 da vptest %xmm2,%xmm3
Created attachment 71948 [details] VPSRAD and VPSLLW patch VPSLLW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /6 ib VPSRAD imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 72 /4 ib There is still big amount of unsupported insns, gcc.target/i386/avx-*.c tests that still fail because of unsupported insns are: avx-ceilf-sfix-vec.c avx-ceilf-vec.c avx-ceil-sfix-2-vec.c avx-ceil-sfix-vec.c avx-ceil-vec.c avx-cmpss-1.c avx-cond-1.c avx-floorf-sfix-vec.c avx-floorf-vec.c avx-floor-sfix-2-vec.c avx-floor-sfix-vec.c avx-floor-vec.c avx-mul-1.c avx-pr51581-1.c avx-pr51581-2.c avx-rintf-sfix-vec.c avx-rintf-vec.c avx-rint-sfix-2-vec.c avx-rint-sfix-vec.c avx-rint-vec.c avx-vaddsubpd-1.c avx-vaddsubpd-256-1.c avx-vaddsubps-1.c avx-vaddsubps-256-1.c avx-vaesdec-1.c avx-vaesdeclast-1.c avx-vaesenc-1.c avx-vaesenclast-1.c avx-vaesimc-1.c avx-vaeskeygenassist-1.c avx-vblendvpd-256-1.c avx-vblendvps-256-1.c avx-vcmppd-1.c avx-vcmppd-256-1.c avx-vcmpps-1.c avx-vcmpps-256-1.c avx-vcmpsd-1.c avx-vcmpss-1.c avx-vcvtsd2si-1.c avx-vcvtsd2si-2.c avx-vcvtss2si-1.c avx-vcvtss2si-2.c avx-vdpps-1.c avx-vdpps-2.c avx-vextractps-1.c avx-vhaddpd-1.c avx-vhaddpd-256-1.c avx-vhaddps-1.c avx-vhaddps-256-1.c avx-vhsubpd-1.c avx-vhsubpd-256-1.c avx-vhsubps-1.c avx-vhsubps-256-1.c avx-vlddqu-1.c avx-vlddqu-256-1.c avx-vmaskmovdqu.c avx-vmaskmovpd-1.c avx-vmaskmovpd-256-1.c avx-vmaskmovpd-256-2.c avx-vmaskmovpd-2.c avx-vmaskmovps-1.c avx-vmaskmovps-256-1.c avx-vmaskmovps-256-2.c avx-vmaskmovps-2.c avx-vmaxpd-256-1.c avx-vmaxps-256-1.c avx-vminpd-256-1.c avx-vminps-256-1.c avx-vmovhps-1.c avx-vmovhps-2.c avx-vmovmskpd-1.c avx-vmovmskpd-256-1.c avx-vmovmskps-1.c avx-vmovmskps-256-1.c avx-vmovntdq-256-1.c avx-vmovntdqa-1.c avx-vmovntpd-1.c avx-vmovntpd-256-1.c avx-vmovntps-1.c avx-vmovntps-256-1.c avx-vmpsadbw-1.c avx-vpabsb-1.c avx-vpabsw-1.c avx-vpacksswb-1.c avx-vpackusdw-1.c avx-vpaddsb-1.c avx-vpaddsw-1.c avx-vpavgb-1.c avx-vpavgw-1.c avx-vpcmpestri-1.c avx-vpcmpestri-2.c avx-vpcmpestrm-1.c avx-vpcmpestrm-2.c avx-vpcmpgtb-1.c avx-vpcmpgtw-1.c avx-vpcmpistri-1.c avx-vpcmpistri-2.c avx-vpcmpistrm-1.c avx-vpcmpistrm-2.c avx-vpermilpd-256-2.c avx-vpermilpd-2.c avx-vpermilps-256-2.c avx-vpermilps-2.c avx-vphaddd-1.c avx-vphaddsw-1.c avx-vphaddw-1.c avx-vphsubd-1.c avx-vphsubsw-1.c avx-vphsubw-1.c avx-vpmaddubsw-1.c avx-vpmovsxbq-1.c avx-vpmovsxwq-1.c avx-vpmovzxbq-1.c avx-vpmovzxdq-1.c avx-vpmovzxwq-1.c avx-vpmuldq-1.c avx-vpmulhrsw-1.c avx-vpsadbw-1.c avx-vpsignb-1.c avx-vpsignd-1.c avx-vpsignw-1.c avx-vpsubsb-1.c avx-vpsubsw-1.c avx-vptest-1.c avx-vptest-256-1.c avx-vptest-256-2.c avx-vptest-256-3.c avx-vptest-2.c avx-vptest-3.c avx-vrcpps-1.c avx-vrcpps-256-1.c avx-vroundpd-1.c avx-vroundpd-256-1.c avx-vroundpd-256-2.c avx-vroundpd-256-3.c avx-vroundpd-2.c avx-vroundpd-3.c avx-vroundps-256-1.c avx-vtestpd-1.c avx-vtestpd-256-1.c avx-vtestpd-256-2.c avx-vtestpd-256-3.c avx-vtestpd-2.c avx-vtestpd-3.c avx-vtestps-1.c avx-vtestps-256-1.c avx-vtestps-256-2.c avx-vtestps-256-3.c avx-vtestps-2.c avx-vtestps-3.c
Created attachment 71949 [details] gcc.target/i386 hack patch Attaching also the hack I'm using to get the above list of failures. cd /usr/src/gcc/gcc/testsuite/gcc.target/i386/; for i in `grep -L 'dg-do.*compile' avx-*.c`; do gcc -O3 -mavx -g -DNEED_IEEE754_DOUBLE -o /tmp/n $i -lm; VALGRIND_LIB=/usr/src/valgrind/.in_place/ /usr/src/valgrind/coregrind/valgrind --tool=none --quiet /tmp/n || echo $i $?; done (with sse[234]*-.c ssse3-*.c also several tests fail).
(In reply to comment #121) > Created attachment 71936 [details] > VMOVS[LH]DUP patch (In reply to comment #122) > Created attachment 71937 [details] > VMOVSS patch (In reply to comment #124) > Created attachment 71948 [details] > VPSRAD and VPSLLW patch Thanks -- all 3 committed together as r2394/r12656.
Created attachment 71980 [details] VPTEST and VTESTP[SD] VTESTPS xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 0E /r VTESTPS ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 0E /r VTESTPD xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 0F /r VTESTPD ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 0F /r VPTEST xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 17 /r VPTEST ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 17 /r
Created attachment 71984 [details] VPERMILP{S,D} fix vshuf-v8sf.c testcase was crashing in valgrind, as VPERMILP{S,D} variable form doesn't have the immediate byte, so needs to pass 0 as extra_bytes.
Created attachment 71990 [details] Variable 128-bit integer shifts and VBLENDVP{S,D} VPSRLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D1 /r VPSRLD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D2 /r VPSRLQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D3 /r VPSRAW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E1 /r VPSRAD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E2 /r VPSLLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F1 /r VPSLLD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F2 /r VPSLLQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F3 /r VBLENDVPS xmm4, xmm3/mem128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 4A /r /is4 VBLENDVPS ymm4, ymm3/mem256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 4A /r /is4 VBLENDVPD xmm4, xmm3/mem128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 4B /r /is4 VBLENDVPD ymm4, ymm3/mem256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 4B /r /is4 and in addition to that fixes out of bound counts for SSE2 PS{LL,RA,RL}{W,D,Q} - the shift count is 64-bit rather than 32-bit.
Created attachment 71998 [details] VROUND* and VPSUBS[BW] VPSUBSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E8 /r VPSUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E9 /r VROUNDPS imm8, xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 08 ib VROUNDPS imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 08 ib VROUNDPD imm8, xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 09 ib VROUNDPD imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 09 ib VROUNDSS imm8, xmm3/m32, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 0A ib VROUNDSD imm8, xmm3/m64, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 0B ib
(In reply to comment #128) > Created attachment 71984 [details] (In reply to comment #127) > Created attachment 71980 [details] (In reply to comment #129) > Created attachment 71990 [details] (In reply to comment #130) > Created attachment 71998 [details] Committed, revs 2397, 2398, 2399, 2400 respectively. Thanks.
Created attachment 72008 [details] Another set of random AVX insns from gcc.target/i386 VPCMPGTB = VEX.NDS.128.66.0F.WIG 64 /r VPCMPGTW = VEX.NDS.128.66.0F.WIG 65 /r VCMPPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG C2 /r ib VCMPPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG C2 /r ib VCMPPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG C2 /r ib VADDSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D0 /r VADDSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG D0 /r VADDSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG D0 /r VADDSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG D0 /r VPMADDWD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F5 /r VPMULDQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 28 /r
Created attachment 72009 [details] VCMPPD and VCMPPS incremental fix I've missed avx-cond-1.c abort, which was due to swapping the arguments incorrectly.
Created attachment 72016 [details] Further AVX insns VCVTSD2SI xmm1/m32, r32 = VEX.LIG.F2.0F.W0 2D /r VCVTSD2SI xmm1/m64, r64 = VEX.LIG.F2.0F.W1 2D /r VCVTSS2SI xmm1/m32, r32 = VEX.LIG.F3.0F.W0 2D /r VCVTSS2SI xmm1/m64, r64 = VEX.LIG.F3.0F.W1 2D /r VHADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG 7C /r VHSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG 7D /r VHADDPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG 7C /r VHSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG 7D /r VHADDPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 7C /r VHSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 7D /r VHADDPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 7C /r VHSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 7D /r VLDDQU m256, ymm1 = VEX.256.F2.0F.WIG F0 /r VLDDQU m128, xmm1 = VEX.128.F2.0F.WIG F0 /r VEXTRACTPS imm8, xmm1, r32/m32 = VEX.128.66.0F3A.WIG 17 /r ib VDPPS imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 40 /r ib VDPPS imm8, ymm3/m128,ymm2,ymm1 = VEX.NDS.128.66.0F3A.WIG 40 /r ib
Created attachment 72020 [details] Further insns VMOVHPS m64, xmm1, xmm2 = VEX.NDS.128.0F.WIG 16 /r VMOVHPS xmm1, m64 = VEX.128.0F.WIG 17 /r VMOVNTPD xmm1, m128 = VEX.128.66.0F.WIG 2B /r VMOVNTPS xmm1, m128 = VEX.128.0F.WIG 2B /r VMOVNTPD ymm1, m256 = VEX.256.66.0F.WIG 2B /r VMOVNTPS ymm1, m256 = VEX.256.0F.WIG 2B /r VMOVMSKPD xmm2, r32 = VEX.128.66.0F.WIG 50 /r VMOVMSKPD ymm2, r32 = VEX.256.66.0F.WIG 50 /r VMOVMSKPS xmm2, r32 = VEX.128.0F.WIG 50 /r VMOVMSKPS ymm2, r32 = VEX.256.0F.WIG 50 /r VMINPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5D /r VMINPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5D /r VMINPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5D /r VMAXPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5F /r VMAXPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5F /r VMAXPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5F /r VMOVNTDQ ymm1, m256 = VEX.256.66.0F.WIG E7 /r VMASKMOVDQU xmm2, xmm1 = VEX.128.66.0F.WIG F7 /r VMOVNTDQA m128, xmm1 = VEX.128.66.0F38.WIG 2A /r
Created attachment 72021 [details] Last patch for today VPACKSSWB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 63 /r VPAVGB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E0 /r VPAVGW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E3 /r VPADDSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG EC /r VPADDSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG ED /r VPHADDW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 01 /r VPHADDD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 02 /r VPHADDSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 03 /r VPMADDUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 04 /r VPHSUBW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 05 /r VPHSUBD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 06 /r VPHSUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 07 /r VPABSB xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 1C /r VPABSW xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 1D /r VPMOVSXBQ xmm2/m16, xmm1 = VEX.128.66.0F38.WIG 22 /r VPMOVSXWQ xmm2/m32, xmm1 = VEX.128.66.0F38.WIG 24 /r VPACKUSDW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2B /r VPMOVZXBQ xmm2/m16, xmm1 = VEX.128.66.0F38.WIG 32 /r VPMOVZXWQ xmm2/m32, xmm1 = VEX.128.66.0F38.WIG 34 /r VPMOVZXDQ xmm2/m64, xmm1 = VEX.128.66.0F38.WIG 35 /r VMPSADBW imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 42 /r ib
Created attachment 72040 [details] Another set of insns VMOVDDUP ymm2/m256, ymm1 = VEX.256.F2.0F.WIG /12 r VMOVLPS m64, xmm1, xmm2 = VEX.NDS.128.0F.WIG 12 /r VMOVLPS xmm1, m64 = VEX.128.0F.WIG 13 /r VRCPSS xmm3/m64, xmm2, xmm1 = VEX.NDS.LIG.F3.0F.WIG 53 /r VRCPPS xmm2/m128, xmm1 = VEX.NDS.128.0F.WIG 53 /r VRCPPS ymm2/m256, ymm1 = VEX.NDS.256.0F.WIG 53 /r VMOVQ xmm1, m64/r64 = VEX.128.66.0F.W1 7E /r (mem case only) VPSADBW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F6 /r VPSIGNB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 08 /r VPSIGNW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 09 /r VPSIGND xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 0A /r VPMULHRSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 0B /r VBROADCASTF128 m128, ymm1 = VEX.256.66.0F38.WIG 1A /r VPEXTRW imm8, reg/m16, xmm2 = VEX.128.66.0F3A.W0 15 /r ib
Created attachment 72041 [details] AVX encoded AES support VAESIMC xmm2/m128, xmm1 = VEX.128.66.0F38.WIG DB /r VAESENC xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DC /r VAESENCLAST xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DD /r VAESDEC xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DE /r VAESDECLAST xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DF /r VAESKEYGENASSIST imm8, xmm2/m128, xmm1 = VEX.128.66.0F3A.WIG DF /r
Created attachment 72045 [details] VPCLMULQDQ VPCLMULQDQ imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 44 /r ib
I'm running out of time, have to move back to GCC hacking next week. From gcc.target/i386 remaining failures and eyeballing the AVX opcode tables at the end of the PDF, comparing them to the VEX 0F, 0F38 and 0F3A routines, it seems almost everything is now implemented, with the exception of: 1) VMASKMOVP[SD]: VMASKMOVPS m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2C /r VMASKMOVPS m256, ymm2, ymm1 = VEX.NDS.256.66.0F38.WIG 2C /r VMASKMOVPD m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2D /r VMASKMOVPD m256, ymm2, ymm1 = VEX.NDS.256.66.0F38.WIG 2D /r VMASKMOVPS xmm2, xmm1, m128 = VEX.NDS.128.66.0F38.WIG 2E /r VMASKMOVPS ymm2, ymm1, m256 = VEX.NDS.256.66.0F38.WIG 2E /r VMASKMOVPD xmm2, xmm1, m128 = VEX.NDS.128.66.0F38.WIG 2F /r VMASKMOVPD ymm2, ymm1, m256 = VEX.NDS.256.66.0F38.WIG 2F /r 2) FMA: VFMADDSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 96 /r VFMADDSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 96 /r VFMADDSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 96 /r VFMADDSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 96 /r VFMSUBADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 97 /r VFMSUBADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 97 /r VFMSUBADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 97 /r VFMSUBADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 97 /r VFMADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 98 /r VFMADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 98 /r VFMADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 98 /r VFMADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 98 /r VFMADD132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 99 /r VFMADD132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 99 /r VFMSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9A /r VFMSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9A /r VFMSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9A /r VFMSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9A /r VFMSUB132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9B /r VFMSUB132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9B /r VFNMADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9C /r VFNMADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9C /r VFNMADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9C /r VFNMADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9C /r VFNMADD132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9D /r VFNMADD132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9D /r VFNMSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9E /r VFNMSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9E /r VFNMSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9E /r VFNMSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9E /r VFNMSUB132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9F /r VFNMSUB132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9F /r VFMADDSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A6 /r VFMADDSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A6 /r VFMADDSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A6 /r VFMADDSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A6 /r VFMSUBADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A7 /r VFMSUBADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A7 /r VFMSUBADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A7 /r VFMSUBADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A7 /r VFMADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A8 /r VFMADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A8 /r VFMADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A8 /r VFMADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A8 /r VFMADD213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A9 /r VFMADD213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A9 /r VFMSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AA /r VFMSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AA /r VFMSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AA /r VFMSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AA /r VFMSUB213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AB /r VFMSUB213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AB /r VFNMADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AC /r VFNMADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AC /r VFNMADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AC /r VFNMADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AC /r VFNMADD213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AD /r VFNMADD213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AD /r VFNMSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AE /r VFNMSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AE /r VFNMSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AE /r VFNMSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AE /r VFNMSUB213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AF /r VFNMSUB213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AF /r VFMADDSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B6 /r VFMADDSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B6 /r VFMADDSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B6 /r VFMADDSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B6 /r VFMSUBADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B7 /r VFMSUBADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B7 /r VFMSUBADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B7 /r VFMSUBADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B7 /r VFMADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B8 /r VFMADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B8 /r VFMADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B8 /r VFMADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B8 /r VFMADD231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B9 /r VFMADD231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B9 /r VFMSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BA /r VFMSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BA /r VFMSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BA /r VFMSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BA /r VFMSUB231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BB /r VFMSUB231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BB /r VFNMADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BC /r VFNMADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BC /r VFNMADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BC /r VFNMADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BC /r VFNMADD231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BD /r VFNMADD231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BD /r VFNMSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BE /r VFNMSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BE /r VFNMSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BE /r VFNMSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BE /r VFNMSUB231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BF /r VFNMSUB231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BF /r 3) various comparison codes for VCMP[PS][SD]. Various gcc.target/i386 tests fail because of this. Not sure if valgrind differentiates between signalling/non-signalling variants, if it should, then we'll need some hacks to handle them all. 4) various VPCMP[EI]STR[iM] modes (valgrind handles just a subset.
Committed: (In reply to comment #132) > Created attachment 72008 [details] r2404 (In reply to comment #133) > Created attachment 72009 [details] r2405 (In reply to comment #134) > Created attachment 72016 [details] r2406 (In reply to comment #135) > Created attachment 72020 [details] r2407 (In reply to comment #136) > Created attachment 72021 [details] r2408 (In reply to comment #137) > Created attachment 72040 [details] r2409 (In reply to comment #138) > Created attachment 72041 [details] (In reply to comment #139) > Created attachment 72045 [details] r2410 (#138 and #139 together) Thanks!
Thanks. To add to the list that is left unimplemented is XGETBV (that e.g. the gcc.dg/i386 tests use to detect whether OS has support for saving/restoring of the AVX state. Given that valgrind doesn't actually use AVX insns itself, it might be fine to always claim it has support.
(In reply to comment #142) > Thanks. To add to the list that is left unimplemented is XGETBV (that e.g. I implemented XGETBV about a week back, along with a CPUID that claims it is an AVX capable CPU. However, the latter is currently disabled. Once I fix up the Memcheck side of things for the new IRops, I'll re-enable the new CPUID, and then the gcc tests should then work ok.
AVX support is now enabled by default, on processors that support it, and works properly with Memcheck. I now consider it usable. Please let me know (file bug reports) if you find any problems with it. Many thanks to Jakub Jelinek for helping out with this.
(In reply to comment #140) > 2) FMA: > [... big list of FMA insns ...] The processor I have (Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz) doesn't support FMA, AFAICS. Is FMA support generally available yet? In any case, FMA is a different (although related) insn set extension, and should have its own bug report. I propose to close this one as resolved-fixed unless anybody objects.
You are right, FMA seems to be deferred for Haswell, for which we don't support AVX2 either. Adding 1), VMASKMOVP[SD] support would be nice though, GCC emits those at least for the intrinsics and hopefully for 4.8 also in vectorized code. A slight complication with those insns is that faults shouldn't be reported for masked out loads or stores (e.g. if mask is all zeros, or if the memory load or store crosses a page boundary and after the end of cliff mask doesn't have sign bits set). So either it could be implemented as a scalar loop, always testing a single mask bits and conditionally doing a load resp. store, or using new IR opcode. For AVX2 VGATHER* will be even harder to do...
(In reply to comment #146) > For AVX2 VGATHER* will be even harder to do... ARM/NEON already has scatter/gather loads (kind of) and we do a very poor, although correct, translation of them. Maybe some IR enhancements to support scatter/gather better will help with both VGATHER and NEON.
Closing. Please report followup any problems in new bug reports.
*** Bug 302656 has been marked as a duplicate of this bug. ***
*** Bug 298227 has been marked as a duplicate of this bug. ***
*** Bug 298335 has been marked as a duplicate of this bug. ***
*** Bug 303466 has been marked as a duplicate of this bug. ***
*** Bug 306721 has been marked as a duplicate of this bug. ***
*** Bug 307612 has been marked as a duplicate of this bug. ***
I've built latest valgrind from the SVN and I am still getting vex related issues: vex x86->IR: unhandled instruction bytes: 0xC5 0xF9 0x6E 0x40 ==27173== valgrind: Unrecognised instruction at address 0x4014e00. ==27173== at 0x4014E00: _dl_sysdep_start (in /lib32/ld-2.15.so) ==27173== by 0xF770EFFF: ??? I'm using gentoo x86_64 built with -march=native and trying to run valgrind on -m32 built binary. Have I missed anything?
(In reply to comment #155) > I'm using gentoo x86_64 built with -march=native and trying to run valgrind > on -m32 built binary. Have I missed anything? Yes, -m32 built binary is not x86_64, and AVX is only supported on x86_64, not on i?86 in valgrind.
Are there any plans to support plain x86 AVX instructions in upcoming valgrind releases? Should I create a separate issue for it?
There already is one - bug #301967 but as Julian says on that bug it is unlikely to be supported.
I might be a little late about this issue, but for anyone still having issue with AVX on gentoo (maybe for x86 people?), there is no need to not use -march=native... just do CFLAGS="{$CFLAGS} -march=native -mno-avx" which will enable all relevant optimization for your CPU without AVX. then 'emerge -eav @world @system' and you are good to go!