Bug 273475 - Add support for AVX instructions
Summary: Add support for AVX instructions
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: 3.7 SVN
Platform: unspecified All
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 268314 273230 280835 284864 285725 286497 286596 287307 288995 289656 292300 292493 292841 298227 298335 299104 299803 299804 299805 302656 303466 306721 307612 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-05-17 12:02 UTC by Corentin Chary
Modified: 2016-02-01 06:45 UTC (History)
42 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
analyze-x86.py (13.19 KB, text/x-python)
2011-05-17 14:28 UTC, Corentin Chary
Details
AVX support -- 13 May 2011 -- WIP -- Valgrind changes (8.85 KB, patch)
2012-05-13 17:09 UTC, Julian Seward
Details
AVX support -- 13 May 2011 -- WIP -- Vex changes (91.95 KB, patch)
2012-05-13 17:14 UTC, Julian Seward
Details
AVX support -- 18 May 2011 -- WIP -- Valgrind changes (13.45 KB, patch)
2012-05-17 23:30 UTC, Julian Seward
Details
AVX support -- 18 May 2011 -- WIP -- Vex changes (185.30 KB, patch)
2012-05-17 23:35 UTC, Julian Seward
Details
P (2.20 KB, text/plain)
2012-06-12 15:46 UTC, Jakub Jelinek
Details
Q (8.57 KB, text/plain)
2012-06-12 17:10 UTC, Jakub Jelinek
Details
VPALIGNR and VBROADCASTSS (256-bit) support (8.25 KB, patch)
2012-06-15 16:54 UTC, Jakub Jelinek
Details
avx.patch (31.32 KB, patch)
2012-06-18 13:05 UTC, Jakub Jelinek
Details
avx2.patch (11.91 KB, patch)
2012-06-18 16:19 UTC, Jakub Jelinek
Details
avx-3.patch (22.32 KB, patch)
2012-06-18 19:21 UTC, Jakub Jelinek
Details
VMOVS[LH]DUP patch (8.14 KB, patch)
2012-06-19 11:09 UTC, Jakub Jelinek
Details
VMOVSS patch (3.23 KB, patch)
2012-06-19 12:10 UTC, Jakub Jelinek
Details
VPSRAD and VPSLLW patch (4.49 KB, patch)
2012-06-19 14:48 UTC, Jakub Jelinek
Details
gcc.target/i386 hack patch (2.38 KB, patch)
2012-06-19 14:58 UTC, Jakub Jelinek
Details
VPTEST and VTESTP[SD] (14.24 KB, patch)
2012-06-20 11:52 UTC, Jakub Jelinek
Details
VPERMILP{S,D} fix (1.88 KB, patch)
2012-06-20 12:40 UTC, Jakub Jelinek
Details
Variable 128-bit integer shifts and VBLENDVP{S,D} (19.16 KB, patch)
2012-06-20 16:32 UTC, Jakub Jelinek
Details
VROUND* and VPSUBS[BW] (18.43 KB, patch)
2012-06-20 19:56 UTC, Jakub Jelinek
Details
Another set of random AVX insns from gcc.target/i386 (27.75 KB, patch)
2012-06-21 09:49 UTC, Jakub Jelinek
Details
VCMPPD and VCMPPS incremental fix (753 bytes, patch)
2012-06-21 10:38 UTC, Jakub Jelinek
Details
Further AVX insns (33.81 KB, patch)
2012-06-21 13:53 UTC, Jakub Jelinek
Details
Further insns (31.04 KB, patch)
2012-06-21 16:03 UTC, Jakub Jelinek
Details
Last patch for today (43.12 KB, patch)
2012-06-21 18:39 UTC, Jakub Jelinek
Details
Another set of insns (25.73 KB, patch)
2012-06-22 14:00 UTC, Jakub Jelinek
Details
AVX encoded AES support (18.16 KB, patch)
2012-06-22 15:09 UTC, Jakub Jelinek
Details
VPCLMULQDQ (7.15 KB, patch)
2012-06-22 15:31 UTC, Jakub Jelinek
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Corentin Chary 2011-05-17 12:02:22 UTC
I know it's scheduled for 3.7.0, but since it was low priority, I'd like to "vote" for this feature.

I use gentoo, and my system is built with '-march=native', that means I can't use valgrind because AVX instructions are used everywhere.

I think I'll have to fallback to '-march=nocona' untill valgrind support this new instruction set.

Thanks,
Comment 1 Corentin Chary 2011-05-17 14:28:53 UTC
Created attachment 60079 [details]
analyze-x86.py
Comment 2 Corentin Chary 2011-05-17 14:29:11 UTC
FYI: running analyse-x86.py on Xorg gives:

cpuid:  0       nop: 15938      call: 17409
i86     280079
i386    619
i686    1023
MMX     2066
AVX     6258

So anyone who build with -march=native on a sandy bridge cpu will get a *lot* of AVX instructions. That will be far more common than SSE4.
Comment 3 Julian Seward 2011-10-05 07:53:33 UTC
*** Bug 273230 has been marked as a duplicate of this bug. ***
Comment 4 Julian Seward 2011-10-12 09:47:30 UTC
*** Bug 268314 has been marked as a duplicate of this bug. ***
Comment 5 Julian Seward 2011-10-12 09:56:40 UTC
Let's use this as the canonical add-AVX-support bug.  There are
various duplicates.

Last weekend I made a branch for AVX support, and did a bunch of
preliminary refactoring necessary to make it possible to add AVX
instructions in a sane way.  Right now there are none actually
supported, however I hope that support will begin to appear in the
next three or four weeks.  It won't make 3.7.0 unfortunately.

The branch can be checked out with
  svn co svn://svn.valgrind.org/valgrind/branches/AVX
Comment 6 Julian Seward 2011-10-13 08:49:11 UTC
*** Bug 280835 has been marked as a duplicate of this bug. ***
Comment 7 Tom Hughes 2011-10-24 13:40:39 UTC
*** Bug 284864 has been marked as a duplicate of this bug. ***
Comment 8 Tom Hughes 2011-11-04 08:22:56 UTC
*** Bug 285725 has been marked as a duplicate of this bug. ***
Comment 9 Tom Hughes 2011-11-14 14:25:11 UTC
*** Bug 286596 has been marked as a duplicate of this bug. ***
Comment 10 Tom Hughes 2011-11-22 22:56:41 UTC
*** Bug 287307 has been marked as a duplicate of this bug. ***
Comment 11 Jérôme Carretero 2011-11-24 08:43:46 UTC
*** This bug has been confirmed by popular vote. ***
Comment 12 Tom Hughes 2011-12-23 11:24:17 UTC
*** Bug 289656 has been marked as a duplicate of this bug. ***
Comment 13 blueness 2012-01-11 17:45:54 UTC
Attaching a link for a downstream bug on this issue:

   https://bugs.gentoo.org/show_bug.cgi?id=398447
Comment 14 Tom Hughes 2012-01-24 00:40:17 UTC
*** Bug 292300 has been marked as a duplicate of this bug. ***
Comment 15 Julian Seward 2012-01-25 12:44:58 UTC
*** Bug 286497 has been marked as a duplicate of this bug. ***
Comment 16 Julian Seward 2012-01-25 22:29:38 UTC
*** Bug 288995 has been marked as a duplicate of this bug. ***
Comment 17 Tom Hughes 2012-01-26 18:35:29 UTC
*** Bug 292493 has been marked as a duplicate of this bug. ***
Comment 18 Tom Hughes 2012-01-30 08:45:24 UTC
*** Bug 292841 has been marked as a duplicate of this bug. ***
Comment 19 Marc-Antoine Perennou 2012-04-17 16:03:54 UTC
This really starts to be critical as most of the recent machines have this feature enabled.
Can we get an ETA for this and/or an estimation of what has been and needs to be done ?
Comment 20 Tom Hughes 2012-04-17 16:38:49 UTC
It's hardly critical - even if your machine supports these instructions nobody is forcing you to compile with -march=native.

Just compile code you want to valgrind for a more basic CPU and you'll be fine.
Comment 21 Corentin Chary 2012-04-17 16:41:58 UTC
(In reply to comment #20)
> It's hardly critical - even if your machine supports these instructions
> nobody is forcing you to compile with -march=native.
> 
> Just compile code you want to valgrind for a more basic CPU and you'll be
> fine.

And all the related libraries, (that means Qt and KDElibs for any KDE application)...

That also prevent any developper using Gentoo to use --march=native system wide, which is really bad.

For me, it's critical.
Comment 22 Julian Seward 2012-04-17 16:52:34 UTC
Doing complete AVX support is going to be a pretty big task.  I suspect a
minimal implementation that just provides coverage for the instructions
emitted by gcc-4.7.0 -march=sandybridge (or whatever the relevant magic
flag is) would keep folks happy in the short term whilst avoiding having to
implement the majority of the instructions.
Comment 23 Patrick J. LoPresti 2012-04-17 17:08:42 UTC
Just want to add my voice to those who think this is closer to "critical" than "minor nuisance".

One great feature of Valgrind is that we do NOT have to recompile our applications; we can run real production binaries under memcheck.

Note that not all the world uses GCC (we use the Intel C compiler).  More to the point, performance-critical code is likely to be hand-optimized using the AVX intrinsics:

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/index.htm#intref_cls/common/intref_bk_advectorext.htm

...by which the entire AVX instruction set comes into play.

I know this is a big task.  But I bet there is also a big community who would contribute to the effort if they (a) knew exactly what was needed, (b) knew their work was likely to be accepted, and (c) knew that they were not duplicating work already in progress.
Comment 24 Marc-Antoine Perennou 2012-04-17 17:54:26 UTC
(In reply to comment #23)
> Just want to add my voice to those who think this is closer to "critical"
> than "minor nuisance".
> 
> One great feature of Valgrind is that we do NOT have to recompile our
> applications; we can run real production binaries under memcheck.
> 
> Note that not all the world uses GCC (we use the Intel C compiler).  More to
> the point, performance-critical code is likely to be hand-optimized using
> the AVX intrinsics:
> 
> http://software.intel.com/sites/products/documentation/studio/composer/en-us/
> 2011/compiler_c/index.htm#intref_cls/common/intref_bk_advectorext.htm
> 
> ...by which the entire AVX instruction set comes into play.
> 
> I know this is a big task.  But I bet there is also a big community who
> would contribute to the effort if they (a) knew exactly what was needed, (b)
> knew their work was likely to be accepted, and (c) knew that they were not
> duplicating work already in progress.

That is _exactly_ my point. That's why I asked for what has been done for know, and what needs to be done.
Implementing the logic for everything maybe isn't that critical, but at least not aborting when finding instructions generated by gcc -march=corei7-avx and the intel compiler is. Recompiling the whole system, in the case you're not using a binary distribution is _not_ an acceptable workaround.
Comment 25 Julian Seward 2012-04-19 10:24:08 UTC
I need to finish off and merge branches/TCHAIN to trunk.  I expect to
do that in the next two days, after which I will attend to the AVX
stuff.  It's clear it's time to make a start on it.

What needs to happen, that I could use help with, to get started, is:

* a list of insns generated by gcc -march=corei7-avx that need to be
  implemented.  It doesn't have to be exact, but we have to start
  somewhere.

* a test program for these instructions, in the style
  (comprehensiveness, mostly) of none/tests/amd64/sse4-64.c.  It isn't
  exciting stuff, but getting new instructions working reliably
  without such coverage is essentially impossible.

I need to mess with the IR infrastructure and the x86_64 back end
(instruction selector, insn emitters) to handle 256-bit vectors.  Also
the front end (guest_amd64_toIR.c) will need work to get it to parse
the 2 kinds of VEX prefixes.
Comment 26 Tom Hughes 2012-04-30 14:02:25 UTC
*** Bug 299104 has been marked as a duplicate of this bug. ***
Comment 27 Julian Seward 2012-05-01 16:13:32 UTC
This is in progress now.  A few insns have been implemented, and infrastructure
(256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction selection)
is in place.  Early next week I'll put up a patch which will cover at least some of
the instructions emitted by gcc-4.7.0 -mavx.
Comment 28 Marc-Antoine Perennou 2012-05-03 14:47:05 UTC
(In reply to comment #27)
> This is in progress now.  A few insns have been implemented, and
> infrastructure
> (256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction
> selection)
> is in place.  Early next week I'll put up a patch which will cover at least
> some of
> the instructions emitted by gcc-4.7.0 -mavx.

This is good news !
If you need some infos or need people to test stuff, feel free to ask
Comment 29 Sergey Kishchenko 2012-05-05 21:42:46 UTC
(In reply to comment #27)
> This is in progress now.  A few insns have been implemented, and
> infrastructure
> (256 bit IR extensions, VEX prefix decoding/encoding, 256 bit instruction
> selection)
> is in place.  Early next week I'll put up a patch which will cover at least
> some of
> the instructions emitted by gcc-4.7.0 -mavx.

It's great! Is there any public patches/source tree branches to try it out? I have the same problem with AVX instructions and I can help with testing the patches
Comment 30 Tom Hughes 2012-05-11 09:37:36 UTC
*** Bug 299803 has been marked as a duplicate of this bug. ***
Comment 31 Tom Hughes 2012-05-11 09:38:16 UTC
*** Bug 299805 has been marked as a duplicate of this bug. ***
Comment 32 Tom Hughes 2012-05-11 09:38:23 UTC
*** Bug 299804 has been marked as a duplicate of this bug. ***
Comment 33 Julian Seward 2012-05-13 17:09:43 UTC
Created attachment 71072 [details]
AVX support -- 13 May 2011 -- WIP -- Valgrind changes
Comment 34 Julian Seward 2012-05-13 17:14:49 UTC
Created attachment 71073 [details]
AVX support -- 13 May 2011 -- WIP -- Vex changes

Very, very incomplete.  Expect breakage.  Implemented insns are:
VMOVDQA ymm2/m256, ymm1
VMOVDQU ymm2/m256, ymm1 
VMOVDQA xmm2/m128, xmm1 
VMOVDQU xmm2/m128, xmm1
VPSLLD imm8, xmm2, xmm1
VPCMPEQD r/m, rV, r
VZEROUPPER
VMOVDQA ymm1, ymm2/m256
VMOVDQA xmm1, xmm2/m128 
VMOVDQU xmm1, xmm2/m128
VPOR r/m, rV, r 
VPXOR r/m, rV, r
VPSUBB r/m, rV, r
VPADDD r/m, rV, r 
VPSHUFB r/m, rV, r
VINSERTF128 r/m, rV, rD
VEXTRACTF128 rS, r/m
VPBLENDVB xmmG, xmmE/memE, xmmV, xmmIS4
Comment 35 Julian Seward 2012-05-13 17:19:11 UTC
A comment on how those instructions got chosen for implementation: I'm
working through the failing instructions resulting from running the
executable created by

  gcc-4.7.0 -O3 -mavx -o bz2-64 perf/bz2-64.c -I.

This appears to have quite a lot of VEX-encoded XMM instructions
resulting from auto-vectorisation.  It's all integer stuff.  I haven't
started on the FP stuff yet.  I am also not even at the point where
the above executable will run yet.
Comment 36 Julian Seward 2012-05-17 23:30:12 UTC
Created attachment 71170 [details]
AVX support -- 18 May 2011 -- WIP -- Valgrind changes
Comment 37 Julian Seward 2012-05-17 23:35:15 UTC
Created attachment 71171 [details]
AVX support -- 18 May 2011 -- WIP -- Vex changes

Majorly expands the set of supported instructions.  Is now able to run
large amounts of integer and FP code generated by gcc-4.7.0 -mavx -O2.
This is the first version of the patch that is able to run useful
amounts of AVX code.

Unfortunately only works with --tool=none so far (massif too, maybe).
In particular Memcheck doesn't work, pending rework of the back end
code generator to deal with Memcheck's instrumentation code for 256
bit vectors.  I hope to get it fixed this weekend.
Comment 38 Julian Seward 2012-05-21 10:22:29 UTC
Initial support has been committed now, as revs 2330/12569.  These do
not provide anything close to complete AXV coverage, but they do
provide support for code created by "gcc-4.7.0 -mavx -O2", which I
think should cover the majority of duplicates of this bug report.  The
only tool currently supported is Memcheck, although it appears that
Massif, Helgrind and DRD also work.

Give it a try.  There may be breakage, but I will be working to tidy
everything up, make the other tools work again, etc, this week.
Comment 39 Jan Kundrát 2012-05-21 10:57:09 UTC
On Gentoo, gcc (Gentoo Hardened 4.5.3-r2 p1.1, pie-0.4.7) 4.5.3, system built with CFLAGS/CXXFLAGS="-O2 -pipe -march=native -mavx -maes -ggdb" (built on i5-2520M CPU) and using recent SVN of valgrind and VEX with revision numbers as you specified, a sample program using Qt fails immediately inside one of QString's helper functions. Not sure if it's due to -maes or -mavx.

jkt@svist ~/work/prog/_trojita-build/desktop-debug $ valgrind ./src/Gui/trojita 
==51259== Memcheck, a memory error detector
==51259== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==51259== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==51259== Command: ./src/Gui/trojita
==51259== 
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5 0xF9 0x68 0xC1
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==51259== valgrind: Unrecognised instruction at address 0x824eb9b.
==51259==    at 0x824EB9B: QString::fromLatin1_helper(char const*, int) (emmintrin.h:703)
==51259==    by 0x158D2B: QString::QString(QLatin1String const&) (qstring.h:694)
==51259==    by 0x26587B: __static_initialization_and_destruction_0(int, int) (SettingsNames.cpp:26)
==51259==    by 0x266690: global constructors keyed to SettingsNames.cpp (SettingsNames.cpp:73)
==51259==    by 0x266895: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==51259==    by 0x148DB2: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==51259==    by 0x573F62F: ??? (in /usr/lib64/qt4/libQtWebKit.so.4.9.0)
==51259==    by 0x266814: __libc_csu_init (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==51259==    by 0x97DA1FF: (below main) (libc-start.c:193)
Comment 40 Julian Seward 2012-05-21 11:15:43 UTC
I suspect that is some variant of PUNPCKLBW.  It's not easy to deduce
which one from the failure output, though.  Can you use objdump -d to
find the insn?  You may find it useful to rerun V with --demangle=no
--sym-offsets=yes, so as to get the function name and offset inside
the function that the insn lives at.
Comment 41 Jan Kundrát 2012-05-21 11:40:12 UTC
This is what valgrind gives me now (on a new build of that binary; this is a project I keep working on, but on a completely unrelated place; I've nonetheless copied *this* one and all further reports will use *this* version of this binary):

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5 0xF9 0x68 0xC1
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==52919== valgrind: Unrecognised instruction at address 0x824eb9b.
==52919==    at 0x824EB9B: _ZN7QString17fromLatin1_helperEPKci+235 (emmintrin.h:703)
==52919==    by 0x158D2B: _ZN7QStringC1ERK13QLatin1String+55 (qstring.h:694)
==52919==    by 0x26587B: _Z41__static_initialization_and_destruction_0ii+107 (SettingsNames.cpp:26)
==52919==    by 0x266690: _GLOBAL__I_SettingsNames.cpp+37 (SettingsNames.cpp:73)
==52919==    by 0x266895: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==52919==    by 0x148DB2: ??? (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==52919==    by 0x573F62F: ??? (in /usr/lib64/qt4/libQtWebKit.so.4.9.0)
==52919==    by 0x266814: __libc_csu_init+68 (in /home/jkt/work/prog/_trojita-build/desktop-debug/src/Gui/trojita)
==52919==    by 0x97DA1FF: (below main)+143 (libc-start.c:193)

When I look at the output of the objdump, I see the following:

00000000000455b8 <_ZN7QString17fromLatin1_helperEPKci@plt>:
   455b8:       ff 25 aa 9c 39 00       jmpq   *0x399caa(%rip)        # 3df268 <_GLOBAL_OFFSET_TABLE_+0x2410>
   455be:       68 7f 04 00 00          pushq  $0x47f
   455c3:       e9 f0 b7 ff ff          jmpq   40db8 <_init+0x18>

That looks like just some jump table entry, but I won't pretend that I'm familiar with assembly. I'm sorry for not doing my homework properly, but I don't know how to get to the offending instructions from here (and is it even in the application itself? Or shall I instead be looking at libQtCore.so?).

Anyway, the objdump output is at http://dev.gentoo.org/~jkt/tmp/objdump-trojita.bz2 , the binary in question at http://dev.gentoo.org/~jkt/tmp/trojita.for.valgrind.bz2 and the QtCore library at http://dev.gentoo.org/~jkt/tmp/libQtCore.so.4.8.1.bz2 .

As a *very* blind shot, this is a piece of objdump of the libQtCore.so.4.8.1 here:

00000000000daab0 <_ZN7QString17fromLatin1_helperEPKci>:
   daab0:       55                      push   %rbp
   daab1:       53                      push   %rbx
   daab2:       48 89 fb                mov    %rdi,%rbx
   daab5:       48 83 ec 28             sub    $0x28,%rsp
   daab9:       64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
   daac0:       00 00 
   daac2:       48 89 44 24 18          mov    %rax,0x18(%rsp)
   daac7:       31 c0                   xor    %eax,%eax
   daac9:       48 85 ff                test   %rdi,%rdi
   daacc:       0f 84 ee 00 00 00       je     dabc0 <_ZN7QString17fromLatin1_helperEPKci+0x110>
   daad2:       85 f6                   test   %esi,%esi
   daad4:       0f 84 7e 00 00 00       je     dab58 <_ZN7QString17fromLatin1_helperEPKci+0xa8>
   daada:       89 f0                   mov    %esi,%eax
   daadc:       c1 e8 1f                shr    $0x1f,%eax
   daadf:       84 c0                   test   %al,%al
   daae1:       75 6d                   jne    dab50 <_ZN7QString17fromLatin1_helperEPKci+0xa0>
   daae3:       48 63 ee                movslq %esi,%rbp
   daae6:       89 34 24                mov    %esi,(%rsp)
   daae9:       48 8d 7c 2d 20          lea    0x20(%rbp,%rbp,1),%rdi
   daaee:       e8 1d 51 fa ff          callq  7fc10 <_Z7qMallocm>
   daaf3:       48 85 c0                test   %rax,%rax
   daaf6:       8b 34 24                mov    (%rsp),%esi
   daaf9:       0f 84 e1 00 00 00       je     dabe0 <_ZN7QString17fromLatin1_helperEPKci+0x130>
   daaff:       48 8d 50 1a             lea    0x1a(%rax),%rdx
   dab03:       80 60 18 e0             andb   $0xe0,0x18(%rax)
   dab07:       83 fe 0f                cmp    $0xf,%esi
   dab0a:       c7 00 01 00 00 00       movl   $0x1,(%rax)
   dab10:       89 70 08                mov    %esi,0x8(%rax)
   dab13:       89 70 04                mov    %esi,0x4(%rax)
   dab16:       48 89 50 10             mov    %rdx,0x10(%rax)
   dab1a:       66 c7 44 68 1a 00 00    movw   $0x0,0x1a(%rax,%rbp,2)
   dab21:       7f 5d                   jg     dab80 <_ZN7QString17fromLatin1_helperEPKci+0xd0>
   dab23:       85 f6                   test   %esi,%esi
   dab25:       74 3e                   je     dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5>
   dab27:       48 8d 4b 01             lea    0x1(%rbx),%rcx
   dab2b:       83 ee 01                sub    $0x1,%esi
   dab2e:       48 8d 34 31             lea    (%rcx,%rsi,1),%rsi
   dab32:       eb 08                   jmp    dab3c <_ZN7QString17fromLatin1_helperEPKci+0x8c>
   dab34:       0f 1f 40 00             nopl   0x0(%rax)
   dab38:       48 83 c1 01             add    $0x1,%rcx
   dab3c:       0f b6 1b                movzbl (%rbx),%ebx
   dab3f:       66 89 1a                mov    %bx,(%rdx)
   dab42:       48 83 c2 02             add    $0x2,%rdx
   dab46:       48 39 f1                cmp    %rsi,%rcx
   dab49:       48 89 cb                mov    %rcx,%rbx
   dab4c:       75 ea                   jne    dab38 <_ZN7QString17fromLatin1_helperEPKci+0x88>
   dab4e:       eb 15                   jmp    dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5>
   dab50:       80 3f 00                cmpb   $0x0,(%rdi)
   dab53:       75 7a                   jne    dabcf <_ZN7QString17fromLatin1_helperEPKci+0x11f>
   dab55:       0f 1f 00                nopl   (%rax)
   dab58:       48 8b 05 61 32 44 00    mov    0x443261(%rip),%rax        # 51ddc0 <_ZTI16QEventTransition+0xdc0>
   dab5f:       f0 ff 00                lock incl (%rax)
   dab62:       0f 95 c2                setne  %dl
   dab65:       48 8b 54 24 18          mov    0x18(%rsp),%rdx
   dab6a:       64 48 33 14 25 28 00    xor    %fs:0x28,%rdx
   dab71:       00 00 
   dab73:       0f 85 7e 00 00 00       jne    dabf7 <_ZN7QString17fromLatin1_helperEPKci+0x147>
   dab79:       48 83 c4 28             add    $0x28,%rsp
   dab7d:       5b                      pop    %rbx
   dab7e:       5d                      pop    %rbp
   dab7f:       c3                      retq   
   dab80:       c5 f1 ef c9             vpxor  %xmm1,%xmm1,%xmm1
   dab84:       89 f7                   mov    %esi,%edi
   dab86:       c1 ff 04                sar    $0x4,%edi
   dab89:       31 c9                   xor    %ecx,%ecx
   dab8b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   dab90:       c5 fa 6f 03             vmovdqu (%rbx),%xmm0
   dab94:       83 c1 01                add    $0x1,%ecx
   dab97:       48 83 c3 10             add    $0x10,%rbx
   dab9b:       c5 f9 60 d1             vpunpcklbw %xmm1,%xmm0,%xmm2
   dab9f:       c5 f9 68 c1             vpunpckhbw %xmm1,%xmm0,%xmm0
   daba3:       c5 fa 7f 12             vmovdqu %xmm2,(%rdx)
   daba7:       c5 fa 7f 42 10          vmovdqu %xmm0,0x10(%rdx)
   dabac:       48 83 c2 20             add    $0x20,%rdx
   dabb0:       39 cf                   cmp    %ecx,%edi
   dabb2:       7f dc                   jg     dab90 <_ZN7QString17fromLatin1_helperEPKci+0xe0>
   dabb4:       83 e6 0f                and    $0xf,%esi
   dabb7:       e9 67 ff ff ff          jmpq   dab23 <_ZN7QString17fromLatin1_helperEPKci+0x73>
   dabbc:       0f 1f 40 00             nopl   0x0(%rax)
   dabc0:       48 8b 05 69 2f 44 00    mov    0x442f69(%rip),%rax        # 51db30 <_ZTI16QEventTransition+0xb30>
   dabc7:       f0 ff 00                lock incl (%rax)
   dabca:       0f 95 c2                setne  %dl
   dabcd:       eb 96                   jmp    dab65 <_ZN7QString17fromLatin1_helperEPKci+0xb5>
   dabcf:       e8 14 6b f8 ff          callq  616e8 <strlen@plt>
   dabd4:       89 c6                   mov    %eax,%esi
   dabd6:       e9 08 ff ff ff          jmpq   daae3 <_ZN7QString17fromLatin1_helperEPKci+0x33>
   dabdb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   dabe0:       48 89 44 24 08          mov    %rax,0x8(%rsp)
   dabe5:       e8 e6 07 fa ff          callq  7b3d0 <_Z9qBadAllocv>
   dabea:       8b 34 24                mov    (%rsp),%esi
   dabed:       48 8b 44 24 08          mov    0x8(%rsp),%rax
   dabf2:       e9 08 ff ff ff          jmpq   daaff <_ZN7QString17fromLatin1_helperEPKci+0x4f>
   dabf7:       e8 4c 70 f8 ff          callq  61c48 <__stack_chk_fail@plt>
   dabfc:       0f 1f 40 00             nopl   0x0(%rax)

...and there indeed is a couple of vpunpcklbw instructions.
Comment 42 Gunther Piez 2012-05-21 11:41:44 UTC
Another one:

==14865== Memcheck, a memory error detector
==14865== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==14865== Using Valgrind-3.8.0.SVN and LibVEX; rerun with -h for copyright info
==14865== Command: ./hayabusa go infinite
==14865== 
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x1E 0xC9 0xC4 0xE2 0x79
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==14865== valgrind: Unrecognised instruction at address 0x47dc30.
==14865==    at 0x47DC30: Eval::initTables() (eval.cpp:98)
==14865==    by 0x427140: Console::init(int&, char**) (console.cpp:72)
==14865==    by 0x4244C9: main (main.cpp:28)

The relevant part of the code:
   for (int x0=0; x0<8; ++x0)
        for (int y0=0; y0<8; ++y0)
            for (int x1=0; x1<8; ++x1)
                for (int y1=0; y1<8; ++y1)
                    distance[x0+8*y0][x1+8*y1] = std::max(abs(x0-x1), abs(y0-y1));
  47dc26:       c4 c1 79 fa cb          vpsubd %xmm11,%xmm0,%xmm1
  47dc2b:       c4 c1 79 fa c2          vpsubd %xmm10,%xmm0,%xmm0
  47dc30:       c4 e2 79 1e c9          vpabsd %xmm1,%xmm1  <----- *** crash ***
  47dc35:       c4 e2 79 1e c0          vpabsd %xmm0,%xmm0
Comment 43 Julian Seward 2012-05-22 09:15:34 UTC
(In reply to comment #39)
> vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x60 0xD1 0xC5

I implemented PUNPCKLBW and HBW in r2337; svn up and try again.
Comment 44 Jan Kundrát 2012-05-22 09:45:46 UTC
Thanks for the update. Now it dies at the following place:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x2C 0xC1 0x89 0x43 0x3C 0xC5
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==86082== valgrind: Unrecognised instruction at address 0xa3af004.
==86082==    at 0xA3AF004: _uhash_allocate (uhash.c:241)
==86082==    by 0xA3AF4EB: _uhash_create (uhash.c:265)
==86082==    by 0xA3DC7AA: entryOpen (uresbund.c:274)
==86082==    by 0xA3DDA38: ures_open_48 (uresbund.c:2051)
==86082==    by 0xDFEF926: ucol_open_internal_48 (ucol_res.cpp:186)
==86082==    by 0x82639F5: qt_initIcu(QString const&) (qlocale_icu.cpp:147)
==86082==    by 0x822103E: QLocalePrivate::updateSystemPrivate() (qlocale.cpp:529)
==86082==    by 0x82212AA: systemPrivate() (qlocale.cpp:540)
==86082==    by 0x822130C: defaultPrivate() (qlocale.cpp:551)
==86082==    by 0x82214FD: QLocale::QLocale() (qlocale.cpp:673)
==86082==    by 0x82BF3DC: QResourceFileEngine::QResourceFileEngine(QString const&) (qresource.cpp:1200)
==86082==    by 0x82ED16E: _q_resolveEntryAndCreateLegacyEngine_recursive(QFileSystemEntry&, QFileSystemMetaData&, QAbstractFileEngine*&, bool) (qfilesystemengine.cpp:162)
==86082==    by 0x82ED2EC: QFileSystemEngine::resolveEntryAndCreateLegacyEngine(QFileSystemEntry&, QFileSystemMetaData&) (qfilesystemengine.cpp:208)
==86082==    by 0x829D75C: QFileInfo::QFileInfo(QString const&) (qfileinfo_p.h:103)
==86082==    by 0x8298498: QFile::exists(QString const&) (qfile.cpp:615)
==86082==    by 0x81F193D: QLibraryInfoPrivate::findConfiguration() (qlibraryinfo.cpp:116)
==86082==    by 0x81F1B3C: QLibrarySettings::QLibrarySettings() (qlibraryinfo.cpp:102)
==86082==    by 0x81F1BF5: qt_library_settings() (qlibraryinfo.cpp:82)
==86082==    by 0x81F1EFF: QLibraryInfo::location(QLibraryInfo::LibraryLocation) (qlibraryinfo.cpp:96)
==86082==    by 0x8322682: QCoreApplication::libraryPaths() (qcoreapplication.cpp:2389)
==86082==    by 0x8322E67: QCoreApplication::init() (qcoreapplication.cpp:707)
==86082==    by 0x83230A4: QCoreApplication::QCoreApplication(QCoreApplicationPrivate&) (qcoreapplication.cpp:596)
==86082==    by 0x6FF5E06: QApplication::QApplication(int&, char**, int) (qapplication.cpp:740)
==86082==    by 0x155C65: main (main.cpp:28)

I wasn't able to find the _uhash_allocate symbol in the output of objdump -d of any of the files which I can see in `pmap $pidOfMyApp`, so I cannot provide assembly output at this point.
Comment 45 Tom Hughes 2012-05-22 09:59:47 UTC
I think, if I'm reading things right, that 0xC5 0xFA 0x2C is vcvttss2si.
Comment 46 Julian Seward 2012-05-22 10:49:26 UTC
(In reply to comment #45)
> I think, if I'm reading things right, that 0xC5 0xFA 0x2C is vcvttss2si.

Done, r2338.  svn up and try again ..
Comment 47 Jan Kundrát 2012-05-22 11:30:25 UTC
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE3 0x79 0x60 0xCA 0x45 0xC4 0xE3
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F3A
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==98468== valgrind: Unrecognised instruction at address 0x824928e.
==98468==    at 0x824928E: _ZL15toLatin1_helperPK5QChari+238 (smmintrin.h:182)

gdb shows it as the following (and I'm including the rest of the function code as well):

   0x00007ffff464f288 <+232>:   vmovdqu (%r12),%xmm2
=> 0x00007ffff464f28e <+238>:   vpcmpestrm $0x45,%xmm2,%xmm1
   0x00007ffff464f294 <+244>:   vpblendvb %xmm0,%xmm3,%xmm2,%xmm4
   0x00007ffff464f29a <+250>:   vmovdqu 0x10(%r12),%xmm2
   0x00007ffff464f2a1 <+257>:   vpcmpestrm $0x45,%xmm2,%xmm1
   0x00007ffff464f2a7 <+263>:   add    $0x1,%esi
   0x00007ffff464f2aa <+266>:   add    $0x20,%r12
   0x00007ffff464f2ae <+270>:   vpblendvb %xmm0,%xmm3,%xmm2,%xmm2
   0x00007ffff464f2b4 <+276>:   vpackuswb %xmm2,%xmm4,%xmm4
   0x00007ffff464f2b8 <+280>:   vmovdqu %xmm4,(%rcx)
   0x00007ffff464f2bc <+284>:   add    $0x10,%rcx
   0x00007ffff464f2c0 <+288>:   cmp    %esi,%edi
   0x00007ffff464f2c2 <+290>:   jg     0x7ffff464f288 <toLatin1_helper(QChar const*, int)+232>
   0x00007ffff464f2c4 <+292>:   and    $0xf,%ebp
   0x00007ffff464f2c7 <+295>:   jne    0x7ffff464f224 <toLatin1_helper(QChar const*, int)+132>
   0x00007ffff464f2cd <+301>:   jmpq   0x7ffff464f1d4 <toLatin1_helper(QChar const*, int)+52>
   0x00007ffff464f2d2 <+306>:   nopw   0x0(%rax,%rax,1)
   0x00007ffff464f2d8 <+312>:   mov    0x10(%rax),%rcx
   0x00007ffff464f2dc <+316>:   lea    0x18(%rax),%rdx
   0x00007ffff464f2e0 <+320>:   cmp    %rdx,%rcx
   0x00007ffff464f2e3 <+323>:   jne    0x7ffff464f20d <toLatin1_helper(QChar const*, int)+109>
   0x00007ffff464f2e9 <+329>:   jmpq   0x7ffff464f21f <toLatin1_helper(QChar const*, int)+127>
   0x00007ffff464f2ee <+334>:   callq  0x7ffff45dbc48 <__stack_chk_fail@plt>
   0x00007ffff464f2f3 <+339>:   mov    %rax,%rbp
   0x00007ffff464f2f6 <+342>:   mov    %rbx,%rdi
   0x00007ffff464f2f9 <+345>:   callq  0x7ffff45df610 <QByteArray::~QByteArray()>
   0x00007ffff464f2fe <+350>:   mov    %rbp,%rdi
   0x00007ffff464f301 <+353>:   callq  0x7ffff45dc188 <_Unwind_Resume@plt>
Comment 48 Sascha Jopen 2012-05-22 11:44:50 UTC
This is another one. I'm not sure, if it helps to post every unrecognised insn i encounter. If it does, i will continue testing. If the analyse-x86.py script would be nearly up to date one could gather all unknown insns at once?

Sascha

vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x7E 0xC2 0x48 0x89 0xD1
vex amd64->IR:   REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==3945== valgrind: Unrecognised instruction at address 0x368ec3d080.
==3945==    at 0x368EC3D080: __floor_c (in /lib64/libm-2.15.so)

=>  368ec3d080:   c4 e1 f9 7e c2          vmovq  %xmm0,%rdx
  368ec3d085:   48 89 d1                mov    %rdx,%rcx
  368ec3d088:   48 c1 f9 34             sar    $0x34,%rcx
  368ec3d08c:   81 e1 ff 07 00 00       and    $0x7ff,%ecx
  368ec3d092:   81 e9 ff 03 00 00       sub    $0x3ff,%ecx
  368ec3d098:   83 f9 33                cmp    $0x33,%ecx
Comment 49 Julian Seward 2012-05-22 11:52:30 UTC
(In reply to comment #48)
> This is another one. I'm not sure, if it helps to post every unrecognised
> insn i encounter.

I think it is an effective way to target limited available development
effort to getting this working asap, so keep going.  I don't have time
to do any more this afternoon, but will continue tonight.  IME this
will go on for a couple of days, but we will quickly reach the point
where we have covered the entire subset of AVX insns that gcc can
produce.
Comment 50 Gunther Piez 2012-05-22 15:10:03 UTC
Now it dies a little later for me:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0xE7 0x1 0xC5 0xF9 0xE7 0x41
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0

which is 
  44480c:       c5 f9 e7 01             vmovntdq %xmm0,(%rcx)
  444810:       c5 f9 e7 41 10          vmovntdq %xmm0,0x10(%rcx)
  444815:       4c 8d 84 07 c0 00 00    lea    0xc0(%rdi,%rax,1),%r8
  44481c:       00 
  44481d:       c5 f9 e7 41 20          vmovntdq %xmm0,0x20(%rcx)
  444822:       c5 f9 e7 41 30          vmovntdq %xmm0,0x30(%rcx)
  444827:       4c 8d 8c 07 00 01 00    lea    0x100(%rdi,%rax,1),%r9

This is from a fast memset library which presumably uses "_mm_stream_si128". I can work around this, so this is not high priority.
Comment 51 Gunther Piez 2012-05-22 15:45:34 UTC
After that it dies at code generated from standard C floating point:

vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78 0x11 0x10
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x9 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==23405== valgrind: Unrecognised instruction at address 0x4694c7.
==23405==    at 0x4694C7: Parameters::add(std::string, float) (parameters.cpp:72)



This is the "vmovlhps %xmm7,%xmm9,%xmm10" instruction below:

       if (value >= 0.0) {
  4694ab:       c5 f8 2e d4             vucomiss %xmm4,%xmm2
  4694af:       72 24                   jb     4694d5 <_ZN10Parameters3addESsf+0xd5>
            min = 0.0;
            max = 2.0*value; }
  4694b1:       c5 ea 58 f2             vaddss %xmm2,%xmm2,%xmm6

    Parm(T value):
        value(value),
        var(value/8) {
        if (value >= 0.0) {
            min = 0.0;
  4694b5:       c5 f8 28 ec             vmovaps %xmm4,%xmm5
  4694b9:       c5 7a 10 44 24 0c       vmovss 0xc(%rsp),%xmm8
  4694bf:       c5 c8 14 f9             vunpcklps %xmm1,%xmm6,%xmm7
  4694c3:       c5 38 14 cd             vunpcklps %xmm5,%xmm8,%xmm9
*** die here:  4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10 ***
  4694cb:       c5 78 11 10             vmovups %xmm10,(%rax)
  4694cf:       48 83 c4 30             add    $0x30,%rsp
  4694d3:       5b                      pop    %rbx
  4694d4:       c3                      retq
Comment 52 Marc-Antoine Perennou 2012-05-22 16:17:58 UTC
Just hit another one:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF8 0xAE 0x5C 0x24 0xFC 0x81 0x4C
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==28881== valgrind: Unrecognised instruction at address 0x402080.
==28881==    at 0x402080: set_fast_math (crtfastmath.c:127)

0000000000402080 <set_fast_math>:
  402080:       c5 f8 ae 5c 24 fc       vstmxcsr -0x4(%rsp)
  402086:       81 4c 24 fc 40 80 00    orl    $0x8040,-0x4(%rsp)
  40208d:       00
  40208e:       c5 f8 ae 54 24 fc       vldmxcsr -0x4(%rsp)
  402094:       c3                      retq
Comment 53 Julian Seward 2012-05-22 23:13:24 UTC
(In reply to comment #48)
> vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x7E 0xC2 0x48
> =>  368ec3d080:   c4 e1 f9 7e c2          vmovq  %xmm0,%rdx

Done, r2339.  That was fun -- I couldn't find any description of it in
the Intel manuals.  Maybe I was looking in the wrong place.
Comment 54 Gunther Piez 2012-05-22 23:24:48 UTC
The description is under MOVD (not Q), page 737 in the reference manual.
Comment 55 Julian Seward 2012-05-22 23:37:12 UTC
(In reply to comment #51)
> vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78
> 4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10

Done, r2340.
Comment 56 Gunther Piez 2012-05-22 23:45:22 UTC
(In reply to comment #55)
> (In reply to comment #51)
> > vex amd64->IR: unhandled instruction bytes: 0xC5 0x30 0x16 0xD7 0xC5 0x78
> > 4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10
> 
> Done, r2340.

Updated to r2340, and for me it is still crashing at exatcly this instruction at the very same position.
Comment 57 Julian Seward 2012-05-23 00:01:02 UTC
(In reply to comment #56)
> > > 4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10
> Updated to r2340, and for me it is still crashing at exatcly this instruction at the very same position.

Urr, I implemented vmovhlps by mistake, not vmovlhps.  Will try again in the morning.
Comment 58 Julian Seward 2012-05-23 05:58:07 UTC
(In reply to comment #42)
>   47dc30:       c4 e2 79 1e c9          vpabsd %xmm1,%xmm1
Done, r2341.
Comment 59 Julian Seward 2012-05-23 06:17:45 UTC
(In reply to comment #57)
> > > > 4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10

Done (second try); r2342.
Comment 60 Gunther Piez 2012-05-23 07:08:54 UTC
Now it dies one instruction later, at "vmovups".

vex amd64->IR: unhandled instruction bytes: 0xC5 0x78 0x11 0x10 0x48 0x83 0xC4 0x30
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==2694== valgrind: Unrecognised instruction at address 0x4694cb.
==2694==    at 0x4694CB: Parameters::add(std::string, float) (parameters.cpp:72)
==2694==    by 0x46549A: Parameters::init() (parameters.cpp:100)
==2694==    by 0x46BD44: Console::init(int&, char**) (console.cpp:74)
==2694==    by 0x427429: main (main.cpp:28)

  4694c7:       c5 30 16 d7             vmovlhps %xmm7,%xmm9,%xmm10
  4694cb:       c5 78 11 10             vmovups %xmm10,(%rax)                    <----- *** crash here ***
  4694cf:       48 83 c4 30             add    $0x30,%rsp
  4694d3:       5b                      pop    %rbx
  4694d4:       c3                      retq

Possibly the way we have to go is somewhat longer than you expected :-)
But keep up the good work :-)
Comment 61 Corentin Chary 2012-05-23 07:13:12 UTC
Is there a way to generate a list of all instructions handled by valgrind ? If yes we could also generate a list of unsupported instructions from a pool of binaries.
Comment 62 Josef Weidendorfer 2012-05-23 08:56:36 UTC
Here is another one (from a small matrix-multiply code), at VEX revision r2342:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0xE6 0xD9 0xC5 0xF1 0xFE 0xE2
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==88909== valgrind: Unrecognised instruction at address 0x4004d0.

This is in

  4004cc:       c5 f9 6f cc             vmovdqa %xmm4,%xmm1
  4004d0:       c5 fa e6 d9             vcvtdq2pd %xmm1,%xmm3
  4004d4:       c5 f1 fe e2             vpaddd %xmm2,%xmm1,%xmm4
  4004d8:       c5 f9 70 c9 ee          vpshufd $0xee,%xmm1,%xmm1
  4004dd:       c5 e1 58 d8             vaddpd %xmm0,%xmm3,%xmm3
  4004e1:       c5 fa e6 c9             vcvtdq2pd %xmm1,%xmm1
  4004e5:       c5 f1 58 c8             vaddpd %xmm0,%xmm1,%xmm1
  4004e9:       c5 f9 29 19             vmovapd %xmm3,(%rcx)
  4004ed:       c5 f9 29 49 10          vmovapd %xmm1,0x10(%rcx)
Comment 63 Sascha Jopen 2012-05-23 09:14:27 UTC
This is the same instruction vmovq, but the other way around.

vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xF9 0x6E 0xC0 0xC3 0x81 0xF9
vex amd64->IR:   REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==6258== valgrind: Unrecognised instruction at address 0x368ec3d0c6.
==6258==    at 0x368EC3D0C6: __floor_c+70 (in /lib64/libm-2.15.so)

  368ec3d0c6:   c4 e1 f9 6e c0          vmovq  %rax,%xmm0
  368ec3d0cb:   c3                      retq   
  368ec3d0cc:   81 f9 00 04 00 00       cmp    $0x400,%ecx
  368ec3d0d2:   75 0c                   jne    368ec3d0e0 <__signbitl+0x51a0>
  368ec3d0d4:   c5 fb 58 c0             vaddsd %xmm0,%xmm0,%xmm0
  368ec3d0d8:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  368ec3d0df:   00 
  368ec3d0e0:   f3 c3                   repz retq 
  368ec3d0e2:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  368ec3d0e8:   c5 fb 58 05 78 8c 03    vaddsd 0x38c78(%rip),%xmm0,%xmm0        # 368ec75d68 <__signbitl+0x3de28>
Comment 64 Julian Seward 2012-05-23 10:58:49 UTC
(In reply to comment #62)
> Here is another one (from a small matrix-multiply code), at VEX revision
> r2342:
>   4004e1:       c5 fa e6 c9             vcvtdq2pd %xmm1,%xmm1
>   4004e5:       c5 f1 58 c8             vaddpd %xmm0,%xmm1,%xmm1

These are vector floating point (xxxPD), so they are either from
handwritten assembly or not created by gcc at -O2 or below (yes?)
since gcc only does vectorization at -O3, IIUC.  Anyway, I am
concentrating on getting complete coverage for gcc-4.7.0 -O2, so I
won't do these yet since getting the vector FP support working is a
whole bunch of work and obviously the scalar FP support is still
incomplete.
Comment 65 Julian Seward 2012-05-23 11:37:44 UTC
(In reply to comment #50)
>   44480c:       c5 f9 e7 01             vmovntdq %xmm0,(%rcx)

(In reply to comment #60)
> 4694cb: c5 78 11 10 vmovups %xmm10,(%rax) 

(In reply to comment #63)
368ec3d0c6: c4 e1 f9 6e c0 vmovq %rax,%xmm0

These 3 are fixed in r2343 now.
Comment 66 Josef Weidendorfer 2012-05-23 12:40:36 UTC
(In reply to comment #64)
> >   4004e1:       c5 fa e6 c9             vcvtdq2pd %xmm1,%xmm1
> These are vector floating point (xxxPD), so they are either from
> handwritten assembly or not created by gcc at -O2 or below (yes?)

Oops. You are right. This is with -O3.
With -O2, it runs (even with cachegrind). Cool.
Comment 67 Julian Seward 2012-05-23 12:47:38 UTC
(In reply to comment #47)
> => 0x00007ffff464f28e <+238>:   vpcmpestrm $0x45,%xmm2,%xmm1

Done, r2344.

AFAICS, the only not-implemented one so far reported (apart from
Josef's vector ones) are vstmxcsr/vldmxcsr, in comment #52.  I don't
have a good way to test those.  I'll get to them though.

Let me know if I missed anything.
Comment 68 Jan Kundrát 2012-05-23 12:59:01 UTC
Using VEX r2344:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xD9 0x67 0xE2 0xC5 0xFA 0x7F 0x21
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x4 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==128337== valgrind: Unrecognised instruction at address 0x82492b4.
==128337==    at 0x82492B4: _ZL15toLatin1_helperPK5QChari+276 (emmintrin.h:703)
==128337==    by 0x824E946: _ZNK7QString8toLatin1Ev+38 (qstring.cpp:3709)
==128337==    by 0x82D75A5: _ZN16QSettingsPrivate15stringToVariantERK7QString+213 (qsettings.cpp:556)
==128337==    by 0x82D81B5: 

The exact offset which valgrind reports is not avialble in gdb's backtrace, though, so I'm not sure how relevant is the disassembly:

   0x00007ffff4654b94 <+228>:   add    $0x1,%ecx
   0x00007ffff4654b97 <+231>:   add    $0x10,%rbx
=> 0x00007ffff4654b9b <+235>:   vpunpcklbw %xmm1,%xmm0,%xmm2
   0x00007ffff4654b9f <+239>:   vpunpckhbw %xmm1,%xmm0,%xmm0
   0x00007ffff4654ba3 <+243>:   vmovdqu %xmm2,(%rdx)
   0x00007ffff4654ba7 <+247>:   vmovdqu %xmm0,0x10(%rdx)
   0x00007ffff4654bac <+252>:   add    $0x20,%rdx
   0x00007ffff4654bb0 <+256>:   cmp    %ecx,%edi
   0x00007ffff4654bb2 <+258>:   jg     0x7ffff4654b90 <QString::fromLatin1_helper(char const*, int)+224>
   0x00007ffff4654bb4 <+260>:   and    $0xf,%esi
   0x00007ffff4654bb7 <+263>:   jmpq   0x7ffff4654b23 <QString::fromLatin1_helper(char const*, int)+115>
   0x00007ffff4654bbc <+268>:   nopl   0x0(%rax)
   0x00007ffff4654bc0 <+272>:   mov    0x442f69(%rip),%rax        # 0x7ffff4a97b30
   0x00007ffff4654bc7 <+279>:   lock incl (%rax)
   0x00007ffff4654bca <+282>:   setne  %dl
   0x00007ffff4654bcd <+285>:   jmp    0x7ffff4654b65 <QString::fromLatin1_helper(char const*, int)+181>
   0x00007ffff4654bcf <+287>:   callq  0x7ffff45db6e8 <strlen@plt>
   0x00007ffff4654bd4 <+292>:   mov    %eax,%esi
   0x00007ffff4654bd6 <+294>:   jmpq   0x7ffff4654ae3 <QString::fromLatin1_helper(char const*, int)+51>
   0x00007ffff4654bdb <+299>:   nopl   0x0(%rax,%rax,1)
   0x00007ffff4654be0 <+304>:   mov    %rax,0x8(%rsp)
   0x00007ffff4654be5 <+309>:   callq  0x7ffff45f53d0 <qBadAlloc()>
   0x00007ffff4654bea <+314>:   mov    (%rsp),%esi
   0x00007ffff4654bed <+317>:   mov    0x8(%rsp),%rax
   0x00007ffff4654bf2 <+322>:   jmpq   0x7ffff4654aff <QString::fromLatin1_helper(char const*, int)+79>
   0x00007ffff4654bf7 <+327>:   callq  0x7ffff45dbc48 <__stack_chk_fail@plt>
Comment 69 Jakub Jelinek 2012-05-23 14:29:39 UTC
If you want good testing coverage, e.g. gcc's gcc/testsuite/gcc.target/i386/ should cover lots of insns, both for AVX, older ISAs and newer ISAs (XOP, AVX2, HLE/RTM, BMI, BMI2, ...).
Pick up dg-do run tests, compile/link them with the corresponding gcc (please use 4.7 or better, for HLE/RTM you need trunk) using the options mentioned in dg-options comment, run them both without valgrind and with valgrind.
Comment 70 Gunther Piez 2012-05-23 16:10:18 UTC
A new one, "vmovd xmm0,edx", this time in glibc-2.15:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x7E 0xC2 0x81 0xFA 0xFF 0xFF
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==24740== valgrind: Unrecognised instruction at address 0x3001c273c0.
==24740==    at 0x3001C273C0: __logf_finite (e_logf.c:39)
==24740==    by 0x44A58F: Parameters::mutate(float) (cmath:361)
==24740==    by 0x44ACFD: Evolution::initFixed(int) (evolution.cpp:54)
==24740==    by 0x469D63: Console::init(int&, char**) (console.cpp:77)
==24740==    by 0x4271E9: main (main.cpp:28)

(gdb) disas 0x3001C273C0,+100
Dump of assembler code from 0x3001c273c0 to 0x3001c27424:
   0x0000003001c273c0:  vmovd  %xmm0,%edx
   0x0000003001c273c4:  cmp    $0x7fffff,%edx
   0x0000003001c273ca:  jg     0x3001c273f8
   0x0000003001c273cc:  test   $0x7fffffff,%edx
   0x0000003001c273d2:  je     0x3001c275ed
   0x0000003001c273d8:  test   %edx,%edx
Comment 71 Sascha Jopen 2012-05-23 18:51:01 UTC
Insn vcvtsd2ss is not implemented. vcvtss2sd occurs a few insns later, which seems to be missing, too.

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFB 0x5A 0xC0 0xC5 0xF8 0x28 0xC8
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=1 PFX.F3=0
==22268== valgrind: Unrecognised instruction at address 0x368ec22498.
==22268==    at 0x368EC22498: exp+88 (in /lib64/libm-2.15.so)


368ec2248c:   48 8b 05 0d cb 2c 00    mov    0x2ccb0d(%rip),%rax        # 368eeeefa0 <__signbitl+0x2b7060>
  368ec22493:   83 38 ff                cmpl   $0xffffffff,(%rax)
  368ec22496:   74 c4                   je     368ec2245c <exp+0x1c>
  368ec22498:   c5 fb 5a c0             vcvtsd2ss %xmm0,%xmm0,%xmm0
  368ec2249c:   c5 f8 28 c8             vmovaps %xmm0,%xmm1
  368ec224a0:   bf 07 00 00 00          mov    $0x7,%edi
  368ec224a5:   e8 86 6b fe ff          callq  368ec09030 <matherr@plt+0x28d0>
  368ec224aa:   c5 fa 5a c0             vcvtss2sd %xmm0,%xmm0,%xmm0
  368ec224ae:   eb d7                   jmp    368ec22487 <exp+0x47>
Comment 72 Julian Seward 2012-05-23 21:31:28 UTC
(In reply to comment #68)
> ==128337== valgrind: Unrecognised instruction at address 0x82492b4.
> The exact offset which valgrind reports is not avialble in gdb's backtrace,
> though, so I'm not sure how relevant is the disassembly:
> => 0x00007ffff4654b9b <+235>:   vpunpcklbw %xmm1,%xmm0,%xmm2

Yes, the page offset (b9b) is completely wrong (vs 2b4).  Anyway, by
looking at the other output, I'd guess the insn is some VPACKUSWB
variant.  Can you try again to find it?
Comment 73 Gunther Piez 2012-05-23 21:47:05 UTC
You are right C5 xx 67 (where xx is any byte) is the 128bit vex-encoded VPACKUSWB instruction. If you don't know it already, you can use www.sandpile.org for an opcode lookup.
Comment 74 Julian Seward 2012-05-23 23:56:52 UTC
(In reply to comment #70)
>    0x0000003001c273c0:  vmovd  %xmm0,%edx

(In reply to comment #71)
368ec22498: c5 fb 5a c0 vcvtsd2ss %xmm0,%xmm0,%xmm0
368ec224aa: c5 fa 5a c0 vcvtss2sd %xmm0,%xmm0,%xmm0

Done, r2345.
Comment 75 Julian Seward 2012-05-24 00:10:04 UTC
(In reply to comment #73)
> You are right C5 xx 67 (where xx is any byte) is the 128bit vex-encoded
> VPACKUSWB instruction.

Done, r2346.
Comment 76 Jan Kundrát 2012-05-24 00:49:16 UTC
vex amd64->IR: unhandled instruction bytes: 0xC5 0xC0 0xC6 0xFF 0x0 0x45 0x19 0xC0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x7 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==38183== valgrind: Unrecognised instruction at address 0x6fe1ca7.
==38183==    at 0x6FE1CA7: _ZN16QRadialFetchSimdI9QSimdSse2E5fetchEPjS2_PK8OperatorPK9QSpanDataddddd+167 (xmmintrin.h:866)
==38183==    by 0x6FE2361: _Z33qt_fetch_radial_gradient_templateI16QRadialFetchSimdI9QSimdSse2EEPKjPjPK8OperatorPK9QSpanDataiii+945 (qdrawhelper_p.h:439)

Looks like it's:

=> 0x00007ffff51f6ca7 <+167>:   vshufps $0x0,%xmm7,%xmm7,%xmm7
Comment 77 Gunther Piez 2012-05-24 11:25:42 UTC
Next is VCVTTSS2SI, float to integer conversion.

vex amd64->IR: unhandled instruction bytes: 0xC4 0x61 0xFA 0x2C 0x9C 0x24 0xE0 0xB
vex amd64->IR:   REX=0 REX.W=1 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==28050== valgrind: Unrecognised instruction at address 0x45699d.


  45699d:       c4 61 fa 2c 9c 24 e0    vcvttss2si 0xbe0(%rsp),%r11 *** crash here***
  4569a4:       0b 00 00 
  4569a7:       c5 f9 d6 4c 24 68       vmovq  %xmm1,0x68(%rsp)
Comment 78 Gunther Piez 2012-05-24 11:29:52 UTC
Actually CVTTSS2SI is float to integer with truncation rounding toward zero.
Comment 79 Sascha Jopen 2012-05-24 12:26:00 UTC
Hey,

it seems that all instructions for my current test setup are implemented. Thanks for your great work Julian. I'll report back as soon as i find new ones :-)

Regards,
Sascha
Comment 80 Julian Seward 2012-05-24 16:30:57 UTC
(In reply to comment #77)
>   45699d:       c4 61 fa 2c 9c 24 e0    vcvttss2si 0xbe0(%rsp),%r11 ***

Done (+ 3 others) in r2349.
Comment 81 Marc-Antoine Perennou 2012-05-24 16:49:52 UTC
Another one

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x6C 0xC0 0xC5 0xFA 0x6F 0xE
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==16615== valgrind: Unrecognised instruction at address 0x54420cc.
==16615==    at 0x54420CC: read_alias_file+1164 (in /usr/lib64/libc-2.15.so)

   340cc:       c5 f9 6c c0             vpunpcklqdq %xmm0,%xmm0,%xmm0
   340d0:       c5 fa 6f 0e             vmovdqu (%rsi),%xmm1
   340d4:       48 83 c7 01             add    $0x1,%rdi
   340d8:       c5 f1 d4 c8             vpaddq %xmm0,%xmm1,%xmm1
   340dc:       c5 fa 7f 0e             vmovdqu %xmm1,(%rsi)
   340e0:       48 83 c6 10             add    $0x10,%rsi
   340e4:       48 3b bd 28 fe ff ff    cmp    -0x1d8(%rbp),%rdi
   340eb:       75 e3                   jne    340d0 <read_alias_file+0x490>
   340ed:       e9 0c ff ff ff          jmpq   33ffe <read_alias_file+0x3be>
   340f2:       66 66 66 66 66 2e 0f    data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
   340f9:       1f 84 00 00 00 00 00
Comment 82 Marc-Antoine Perennou 2012-05-24 16:55:01 UTC
And yet another

vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x20 0xC8 0xC5 0xF9 0x73
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==16936== valgrind: Unrecognised instruction at address 0x5c6d1ed.
==16936==    at 0x5C6D1ED: _ZNSt8numpunctIwE22_M_initialize_numpunctEP15__locale_struct+317 (in /usr/lib64/libstdc++.so.6.0.17)  

   881ed:       c4 e2 79 20 c8          vpmovsxbw %xmm0,%xmm1
   881f2:       c5 f9 73 d8 08          vpsrldq $0x8,%xmm0,%xmm0
   881f7:       c4 e2 79 23 d1          vpmovsxwd %xmm1,%xmm2
   881fc:       c5 f1 73 d9 08          vpsrldq $0x8,%xmm1,%xmm1
   88201:       c4 e2 79 20 c0          vpmovsxbw %xmm0,%xmm0
   88206:       c4 e2 79 23 c9          vpmovsxwd %xmm1,%xmm1
   8820b:       c5 fa 7f 51 50          vmovdqu %xmm2,0x50(%rcx)
   88210:       c5 fa 7f 49 60          vmovdqu %xmm1,0x60(%rcx)
   88215:       c4 e2 79 23 c8          vpmovsxwd %xmm0,%xmm1
   8821a:       c5 f9 73 d8 08          vpsrldq $0x8,%xmm0,%xmm0
   8821f:       c4 e2 79 23 c0          vpmovsxwd %xmm0,%xmm0
   88224:       c5 fa 7f 49 70          vmovdqu %xmm1,0x70(%rcx)
   88229:       c5 fa 7f 81 80 00 00    vmovdqu %xmm0,0x80(%rcx)
   88230:       00
   88231:       c5 fa 6f 46 10          vmovdqu 0x10(%rsi),%xmm0
   88236:       c4 e2 79 20 c8          vpmovsxbw %xmm0,%xmm1
   8823b:       c5 f9 73 d8 08          vpsrldq $0x8,%xmm0,%xmm0
   88240:       c4 e2 79 23 d1          vpmovsxwd %xmm1,%xmm2
   88245:       c5 f1 73 d9 08          vpsrldq $0x8,%xmm1,%xmm1
   8824a:       c4 e2 79 20 c0          vpmovsxbw %xmm0,%xmm0
   8824f:       c4 e2 79 23 c9          vpmovsxwd %xmm1,%xmm1
   88254:       c5 fa 7f 91 90 00 00    vmovdqu %xmm2,0x90(%rcx)
   8825b:       00
   8825c:       c5 fa 7f 89 a0 00 00    vmovdqu %xmm1,0xa0(%rcx)
   88263:       00
   88264:       c4 e2 79 23 c8          vpmovsxwd %xmm0,%xmm1
   88269:       c5 f9 73 d8 08          vpsrldq $0x8,%xmm0,%xmm0
   8826e:       c4 e2 79 23 c0          vpmovsxwd %xmm0,%xmm0
   88273:       c5 fa 7f 89 b0 00 00    vmovdqu %xmm1,0xb0(%rcx)
   8827a:       00
   8827b:       c5 fa 7f 81 c0 00 00    vmovdqu %xmm0,0xc0(%rcx)
   88282:       00
Comment 83 Franz Trischberger 2012-05-24 17:00:31 UTC
callgrind on dolphin

vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x71 0xD1 0x8 0xC5 0xF1 0xDB
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==13668== valgrind: Unrecognised instruction at address 0x874f086.
==13668==    at 0x874F086: _Z31comp_func_solid_SourceOver_sse2Pjijj+326 (emmintrin.h:1167)
==13668==    by 0x893CAB2: _ZL19blend_color_genericiPK11QT_FT_Span_Pv+306 (qdrawhelper.cpp:3316)
==13668==    by 0x894A4BF: gray_convert_glyph+1631 (qgrayraster.c:1756)
==13668==    by 0x891491A: _ZN25QRasterPaintEnginePrivate9rasterizeEP14QT_FT_Outline_PFviPK11QT_FT_Span_PvES5_P13QRasterBuffer.part.114+474 (qpaintengine_raster.cpp:3834)
==13668==    by 0x892041D: _ZN18QRasterPaintEngine4fillERK11QVectorPathRK6QBrush+509 (qpaintengine_raster.cpp:1753)
==13668==    by 0x8892D2C: _ZN14QPaintEngineEx4drawERK11QVectorPath+124 (qpaintengineex.cpp:599)
==13668==    by 0x889462D: _ZN14QPaintEngineEx15drawRoundedRectERK6QRectFddN2Qt8SizeModeE+477 (qpaintengineex.cpp:779)
==13668==    by 0x88A9099: _ZN8QPainter15drawRoundedRectERK6QRectFddN2Qt8SizeModeE+57 (qpainter.cpp:4238)
==13668==    by 0x10286343: _ZN6Bespin8Elements12sunkenShadowEib+403 (elements.cpp:135)
==13668==    by 0x10037B75: _ZN6Bespin5Style15generatePixmapsEv+869 (genpixmaps.cpp:75)
==13668==    by 0x10043C69: _ZN6Bespin5Style4initEPK9QSettings+249 (init.cpp:688)
==13668==    by 0x1002A4D0: _ZN6Bespin5StyleC1Ev+96 (bespin.cpp:300)
==13668==    by 0x10030661: _ZN17BespinStylePlugin6createERK7QString+97 (bespin.cpp:79)
==13668==    by 0x8A54178: _ZN13QStyleFactory6createERK7QString+408 (qstylefactory.cpp:193)
==13668==    by 0x875C728: _ZN12QApplication5styleEv+184 (qapplication.cpp:1462)
==13668==    by 0x87CC0C3: _ZL20qt_set_x11_resourcesPKcS0_S0_S0_+499 (qapplication_x11.cpp:1289)
==13668==    by 0x87D0350: _Z7qt_initP19QApplicationPrivateiP9_XDisplaymm+5552 (qapplication_x11.cpp:2397)
==13668==    by 0x875D1B3: _ZN19QApplicationPrivate9constructEP9_XDisplaymm+211 (qapplication.cpp:839)
==13668==    by 0x875D8E9: _ZN12QApplicationC1ERiPPcbi+121 (qapplication.cpp:772)
==13668==    by 0x750EE06: _ZN12KApplicationC1Eb+54 (kapplication.cpp:346)
==13668==    by 0x4C5867D: _ZN18DolphinApplicationC1Ev+29 (dolphinapplication.cpp:31)
==13668==    by 0x4C6F829: kdemain+3529 (main.cpp:85)
==13668==    by 0x4ED138C: (below main)+236 (libc-start.c:226)

The whole system is compiled using gcc-4.6.3 with march=corei7-avx -O2
Comment 84 Gunther Piez 2012-05-24 17:27:12 UTC
The next one is a  VPINSRQ, which reulted from a call to 
vector = _mm_insert_epi64(vector, k, 1);
I guess this counts as integer vector code. Anyway, here is the relevant error message:

vex amd64->IR: unhandled instruction bytes: 0xC4 0xE3 0xE9 0x22 0xDB 0x1 0xC5 0xF9
vex amd64->IR:   REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x2 ESC=0F3A
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==30797== valgrind: Unrecognised instruction at address 0x443068.
...
 443068:       c4 e3 e9 22 db 01       vpinsrq $0x1,%rbx,%xmm2,%xmm3
...
Comment 85 Julian Seward 2012-05-25 13:54:22 UTC
Vex r2350 adds support for the following:

VADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 58 /r
VMULPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 59 /r
VCVTPS2DQ xmm2/m128, xmm1 = VEX.128.66.0F.WIG 5B /r
VSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5C /r
VMINPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5D /r
VMAXPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 5F /r
VPUNPCKLWD r/m, rV, r ::: r = interleave-lo-words(rV, r/m)
VPUNPCKHWD r/m, rV, r ::: r = interleave-hi-words(rV, r/m)
VPSHUFLW imm8, xmm2/m128, xmm1 = VEX.128.F2.0F.WIG 70 /r ib
VPSHUFHW imm8, xmm2/m128, xmm1 = VEX.128.F3.0F.WIG 70 /r ib
VPSRLD imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 72 /2 ib
VPSLLDQ imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /7 ib
VSHUFPS imm8, xmm3/m128, xmm2, xmm1, xmm2
VPMULLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D5 /r
VPSUBUSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D8 /r
VPANDN r/m, rV, r ::: r = rV & ~r/m (is that correct, re the ~ ?)
VPADDUSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DC /r
VPADDUSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DD /r
VPMULHUW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E4 /r
Comment 86 Julian Seward 2012-05-25 15:56:45 UTC
Vex r2351 adds support for the following:

VPUNPCKLDQ r/m, rV, r ::: r = interleave-lo-dwords(rV, r/m)
VPACKSSDW r/m, rV, r ::: r = QNarrowBin32Sto16Sx8(rV, r/m)
VPUNPCKLQDQ r/m, rV, r ::: r = interleave-lo-64bitses(rV, r/m)
VPSRLW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /2 ib
VPADDW r/m, rV, r ::: r = rV + r/m 
VPINSRD r32/m32, xmm2, xmm1 = VEX.NDS.128.66.0F3A.W0 22 /r ib
Comment 87 Jan Kundrát 2012-05-27 18:53:44 UTC
Using VEX r2358, my Qt style (Oxygen) fails with this inside QRadialFetchSimd<QSimdSse2>::fetch(unsigned int*, unsigned int*, Operator const*, QSpanData const*, double, double, double, double, double) (xmmintrin.h:352):

vex amd64->IR: unhandled instruction bytes: 0xC5 0x70 0xC2 0xEB 0x1 0xC4 0xC1 0x60
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x1 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==105358== valgrind: Unrecognised instruction at address 0x6fe1d3e.
==105358==    at 0x6FE1D3E: QRadialFetchSimd<QSimdSse2>::fetch(unsigned int*, unsigned int*, Operator const*, QSpanData const*, double, double, double, double, double) (xmmintrin.h:352)

gdb calls it VXORPS:

=> 0x00007ffff51f6d02 <+258>:   vxorps %xmm1,%xmm1,%xmm1

The disassembly of that function shows a few more instructions which -- if I interpret these lists correctly -- aren't implemented yet:

- 0x00007ffff51f6d0b <+267>:   vmovaps 0x80(%rsp),%xmm3
- 0x00007ffff51f6d3e <+318>:   vcmpltps %xmm3,%xmm1,%xmm13
- 0x00007ffff51f6d48 <+328>:   vsqrtps %xmm11,%xmm11
- 0x00007ffff51f6f49 <+841>:   vcvttps2dq %xmm11,%xmm11
Comment 88 Eero Pajarre 2012-05-29 10:27:22 UTC
I am also seeing the vxorps operation with "unhandled bytes" 
0xc5 0xf8 0x57 0xc0 0xc5 0xfa 0x11 0x44

The "more odd" issue I am seeing is with an other program.
I am not sure if this is because of the new processor stuff, or is it
just some incompatibility with the Ubuntu 12.04 version. Or perhaps
an error which I have introduced while compiling to Ubuntu 12.04

   Eero

Interesting parts of the Valgrind/memcheck report follows:
 
assumed next %rip = 0x4168E2
 actual next %rip = 0x4168E3

vex: the `impossible' happened:
   disInstr_AMD64: disInstr miscalculated next %rip
vex storage: T total 370065048 bytes allocated
vex storage: P total 960 bytes allocated

valgrind: the 'impossible' happened:
   LibVEX called failure_exit().
==7247==    at 0x38044186: report_and_quit (m_libcassert.c:210)
==7247==    by 0x380441E3: panic (m_libcassert.c:294)
==7247==    by 0x38044398: vgPlain_core_panic_at (m_libcassert.c:299)
==7247==    by 0x380443AA: vgPlain_core_panic (m_libcassert.c:304)
==7247==    by 0x3805B0A2: failure_exit (m_translate.c:700)
==7247==    by 0x380DCA28: vpanic (main_util.c:226)
==7247==    by 0x38144E8B: disInstr_AMD64 (guest_amd64_toIR.c:19144)
==7247==    by 0x380ED4E7: bb_to_IR (guest_generic_bb_to_IR.c:305)
==7247==    by 0x380DB1E0: LibVEX_Translate (main_main.c:508)
==7247==    by 0x3805D302: vgPlain_translate (m_translate.c:1535)
==7247==    by 0x38086CB8: vgPlain_scheduler (scheduler.c:901)
==7247==    by 0x38096CE5: run_a_thread_NORETURN (syswrap-linux.c:98)
Comment 89 Julian Seward 2012-06-02 11:59:03 UTC
vex r2365 adds support for the following:

VMOVAPD ymm1, ymm2/m256 = VEX.256.66.0F.WIG 29 /r
VMOVAPS ymm1, ymm2/m256 = VEX.256.0F.WIG 29 /r
VPADDQ r/m, rV, r ::: r = rV + r/m
VPSUBW r/m, rV, r ::: r = rV - r/m
VPSUBQ = VEX.NDS.128.66.0F.WIG FB /r
VPINSRQ r64/m64, xmm2, xmm1 = VEX.NDS.128.66.0F3A.W1 22 /r ib
Comment 90 Gunther Piez 2012-06-04 00:02:44 UTC
The next step:

vex amd64->IR: unhandled instruction bytes: 0xC4 0x41 0x69 0xC4 0xC9 0x0 0xC4 0x41
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x2 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==21651== valgrind: Unrecognised instruction at address 0x46b2dd

  46b2dd:       c4 41 69 c4 c9 00       vpinsrw $0x0,%r9d,%xmm2,%xmm9
  46b2e3:       c4 41 7b 2c f7          vcvttsd2si %xmm15,%r14d
  46b2e8:       c4 c1 31 c4 c6 01       vpinsrw $0x1,%r14d,%xmm9,%xmm0
Comment 91 Jens Dieskau 2012-06-04 22:30:42 UTC
vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0xFC 0xC1 0xC5 0xF9 0x7F 0x1
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==14795== valgrind: Unrecognised instruction at address 0x5e3a250.

1cc440:       c5 f9 fc c1             vpaddb %xmm1,%xmm0,%xmm0
Comment 92 Jens Dieskau 2012-06-04 22:31:50 UTC
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x18 0x64 0x24 0x54 0x48
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0

   68056:       c4 e2 79 18 64 24 54    vbroadcastss 0x54(%rsp),%xmm4
Comment 93 Jakub Jelinek 2012-06-07 12:11:21 UTC
What kind of gcc compiled that firefox that even the most basic instructions like vaddpd or vmulpd (256-bit) don't show up in it?

Even as simple testcase as following with -O3 -mavx doesn't work.

double a[2048], b[1024], c[1024];
int
main ()
{
  int i;
  for (i = 0; i < 1024; i++)
    a[i] = b[i] + c[i];
  for (i = 0; i < 1024; i++)
    a[i + 1024] = b[i] * c[i];
  asm volatile ("" : : : "memory");
  return 0;
}

I'd help with adding support for some AVX instructions, but I think it would be better if Julian
converted a few such insns first to select a style.

Say for VADDPS, do you want to duplicate everything like:
      /* VADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 58 /r */
      if (haveNo66noF2noF3(pfx) && 0==getVexL(pfx)/*128*/) {
         delta = dis_AVX128_E_V_to_G(
                    uses_vvvv, vbi, pfx, delta, "vaddps", Iop_Add32Fx4 );
         goto decode_success;
      }
      /* VADDPS ymm3/m128, ymm2, ymm1 = VEX.NDS.256.0F.WIG 58 /r */
      if (haveNo66noF2noF3(pfx) && 1==getVexL(pfx)/*256*/) {
         delta = dis_AVX256_E_V_to_G(
                    uses_vvvv, vbi, pfx, delta, "vaddps", Iop_Add32Fx8 );
         goto decode_success;
      }
or say have some function to which you pass both Iop_Add32Fx4 and Iop_Add32Fx8 arguments
and it will DTRT based on getVexL(pfx)?  When trying gcc avx tests with valgrind, everything stops immediately because CPUID under valgrind doesn't indicate  AVX support (but not even SSE3+ support), when hacked around the next crash is on the detection of AVX OS support (xgetbv instruction which makes valgrind panic), and when even this is hacked around fails on most AVX insns.
Comment 94 Julian Seward 2012-06-12 15:05:15 UTC
(In reply to comment #93)
> What kind of gcc compiled that firefox that even the most basic instructions
> like vaddpd or vmulpd (256-bit) don't show up in it?

I concentrated first on adding support for "gcc-4.7.0 -mavx -g -O2"
code.  vaddpd etc only appear when gcc is vectorizing, at -O3.  I have
been working through the -O3 output the past couple days now.

> Say for VADDPS, do you want to duplicate everything like:

Yes, there is some level of duplication.  It's difficult to get rid of
entirely.  I do what I can, but the primary emphasis is to ensure that
what is implemented is correct.
Comment 95 Julian Seward 2012-06-12 15:07:54 UTC
Created attachment 71767 [details]
P

VEX r2379 added support for these:

    VMOVUPD xmm2/m128, xmm1 = VEX.128.66.0F.WIG 10 /r 
    VMOVUPS xmm2/m128, xmm1 = VEX.128.0F.WIG 10 /r
    VUNPCKHPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG 15 /r
    VUNPCKLPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 14 /r
    VUNPCKHPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 15 /r
    VADDPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 58 /r
    VADDPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 58 /r
    VADDPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 58 /r
    VMULPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 59 /r
    VMULPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 59 /r
    VMULPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 59 /r
    VSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5C /r
    VSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5C /r
    VSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5C /r
    VDIVPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5E /r
    VDIVPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5E /r
    VPSRLQ  imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /2 ib
    VPCMPEQQ = VEX.NDS.128.66.0F38.WIG 29 /r
    VPCMPGTQ = VEX.NDS.128.66.0F38.WIG 37 /r 
    VPEXTRQ = VEX.128.66.0F3A.W1 16 /r ib

r2380 added support for these:

    VPSLLQ  imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 73 /6 ib
    VPEXTRW imm8, xmm1, reg32 = VEX.128.66.0F.W0 C5 /r ib
    VPMINUB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DA /r
    VPMAXUB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG DE /r
    VPMINSW r/m, rV, r ::: r = min-signed16s(rV, r/m)
    VPMAXSW r/m, rV, r ::: r = max-signed16s(rV, r/m)
    VPMULUDQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F4 /r
    VPMINSB r/m, rV, r ::: r = min-signed-8s(rV, r/m)
    VPMINUW r/m, rV, r ::: r = min-unsigned-16s(rV, r/m)
    VPMINUD r/m, rV, r ::: r = min-unsigned-32s(rV, r/m)
    VPMAXSB r/m, rV, r ::: r = max-signed-8s(rV, r/m)
    VPMAXUW r/m, rV, r ::: r = max-unsigned-16s(rV, r/m)
    VPMAXUD r/m, rV, r ::: r = max-unsigned-32s(rV, r/m)
    VPMULLD r/m, rV, r ::: r = mul-32s(rV, r/m)
    VPHMINPOSUW xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 41 /r
    VPERMILPD imm8, ymm2/m256, ymm1 = VEX.256.66.0F3A.W0 05 /r ib
    VPERMILPD imm8, xmm2/m128, xmm1 = VEX.128.66.0F3A.W0 05 /r ib
    VPERM2F128 imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.66.0F3A.W0 06 /r ib
    VPEXTRB imm8, xmm2, reg/m8 = VEX.128.66.0F3A.W0 14 /r ib

Note that although programs run ok with --tool=none, 256 bit vector
arithmetic (vaddpd etc) will still crash Memcheck, because the
instrumentation stuff for 256 bit vectors is incomplete.
Comment 96 Jakub Jelinek 2012-06-12 15:46:26 UTC
Created attachment 71769 [details]
Q

On Tue, Jun 12, 2012 at 03:07:54PM +0000, Julian Seward wrote:
> Note that although programs run ok with --tool=none, 256 bit vector
> arithmetic (vaddpd etc) will still crash Memcheck, because the
> instrumentation stuff for 256 bit vectors is incomplete.

Quick patch to implement 128-bit VDIVPS/VDIVPD and 256-bit VMOVAPS.
Testcase (to be tested with gcc -O3 -mavx and gcc -O3 -mavx -mprefer-avx128):

#define N 1024
double a[N], b[N], c[N];
float d[N], e[N], f[N];

int
main ()
{
  int i;
  for (i = 0; i < N; i++)
    {
      b[i] = i + 1.0;
      c[i] = i + 2.0;
      d[i] = i + 1.0;
      e[i] = i + 2.0;
      asm ("");
    }
  asm volatile ("");
  for (i = 0; i < N; i++)
    {
      a[i] = b[i] / c[i];
      d[i] = e[i] / f[i];
    }
  asm volatile ("");
  for (i = 0; i < N; i++)
    if (a[i] != b[i] / c[i] || d[i] != e[i] / f[i])
      __builtin_abort ();
  return 0;
}
Comment 97 Jakub Jelinek 2012-06-12 17:10:15 UTC
On Tue, Jun 12, 2012 at 03:46:26PM +0000, Jakub Jelinek wrote:
> Quick patch to implement 128-bit VDIVPS/VDIVPD and 256-bit VMOVAPS.

And here is another, to implement V{AND,ANDN,OR,XOR}P{S,D} 256-bit.
Testcase (to be tested with gcc -O3 -mavx and
gcc -O3 -mavx -DINTRIN -fno-strict-aliasing
(the former is just autovectorization, the latter using intrinsics)):

#include <x86intrin.h>

#define N 1024
long a[N], b[N], c[N];
int d[N], e[N], f[N];

int
main ()
{
  int i;
  for (i = 0; i < N; i++)
    {
      b[i] = i * 0x123456789abcdefUL;
      c[i] = i * 0xfedcba987654321UL;
      d[i] = i * 0x1234567;
      e[i] = i + 0x7654321;
      asm ("");
    }
  asm volatile ("");
#ifdef INTRIN
  double *ad = (double *) &a[0], *bd = (double *) &b[0], *cd = (double *) &c[0];
  for (i = 0; i < N; i += 4)
    _mm256_store_pd (ad + i, _mm256_and_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i)));
  float *dd = (float *) &d[0], *ed = (float *) &e[0], *fd = (float *) &f[0];
  for (i = 0; i < N; i += 8)
    _mm256_store_ps (dd + i, _mm256_and_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i)));
#else
  for (i = 0; i < N; i++)
    {
      a[i] = b[i] & c[i];
      d[i] = e[i] & f[i];
    }
#endif
  asm volatile ("");
  for (i = 0; i < N; i++)
    if (a[i] != (b[i] & c[i]) || d[i] != (e[i] & f[i]))
      __builtin_abort ();
  asm volatile ("");
#ifdef INTRIN
  for (i = 0; i < N; i += 4)
    _mm256_store_pd (ad + i, _mm256_andnot_pd (_mm256_load_pd (cd + i), _mm256_load_pd (bd + i)));
  for (i = 0; i < N; i += 8)
    _mm256_store_ps (dd + i, _mm256_andnot_ps (_mm256_load_ps (fd + i), _mm256_load_ps (ed + i)));
#else
  for (i = 0; i < N; i++)
    {
      a[i] = b[i] & ~c[i];
      d[i] = e[i] & ~f[i];
    }
#endif
  asm volatile ("");
  for (i = 0; i < N; i++)
    if (a[i] != (b[i] & ~c[i]) || d[i] != (e[i] & ~f[i]))
      __builtin_abort ();
  asm volatile ("");
#ifdef INTRIN
  for (i = 0; i < N; i += 4)
    _mm256_store_pd (ad + i, _mm256_or_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i)));
  for (i = 0; i < N; i += 8)
    _mm256_store_ps (dd + i, _mm256_or_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i)));
#else
  for (i = 0; i < N; i++)
    {
      a[i] = b[i] | c[i];
      d[i] = e[i] | f[i];
    }
#endif
  asm volatile ("");
  for (i = 0; i < N; i++)
    if (a[i] != (b[i] | c[i]) || d[i] != (e[i] | f[i]))
      __builtin_abort ();
  asm volatile ("");
#ifdef INTRIN
  for (i = 0; i < N; i += 4)
    _mm256_store_pd (ad + i, _mm256_xor_pd (_mm256_load_pd (bd + i), _mm256_load_pd (cd + i)));
  for (i = 0; i < N; i += 8)
    _mm256_store_ps (dd + i, _mm256_xor_ps (_mm256_load_ps (ed + i), _mm256_load_ps (fd + i)));
#else
  for (i = 0; i < N; i++)
    {
      a[i] = b[i] ^ c[i];
      d[i] = e[i] ^ f[i];
    }
#endif
  asm volatile ("");
  for (i = 0; i < N; i++)
    if (a[i] != (b[i] ^ c[i]) || d[i] != (e[i] ^ f[i]))
      __builtin_abort ();
  return 0;
}
Comment 98 Jakub Jelinek 2012-06-12 17:11:43 UTC
I'll wait with further changes now to see if that is the desired way to do it.
Comment 99 Jakub Jelinek 2012-06-13 07:12:14 UTC
In the #c97 testcase please replace:
    d[i] = i * 0x1234567;
    e[i] = i + 0x7654321;
with
    d[i] = i * 0x1234567U;
    e[i] = i * 0x7654321U;
(without U suffix gcc 4.8 optimizes the first loop into endless loop due to undefined signed overflow in it).
Comment 100 Julian Seward 2012-06-13 11:16:24 UTC
r2381 adds support for:

VUNPCKLPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 14 /r
VUNPCKHPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 15 /r
VUNPCKLPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 14 /r
VUNPCKHPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 15 /r
VANDPD r/m, rV, r ::: r = rV & r/m (256 bit)
VXORPD r/m, rV, r ::: r = rV ^ r/m (256 bit)
VDIVPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5E /r
VCMPPD xmm3/m64(E=argL), xmm2(V=argR), xmm1(G)
VSHUFPS imm8, ymm3/m256, ymm2, ymm1, ymm2
VCVTDQ2PD xmm2/m64, xmm1 = VEX.128.F3.0F.WIG E6 /r
VBROADCASTSD m64, ymm1 = VEX.256.66.0F38.W0 19 /r

This gives at least moderately usable coverage for code created by
"gcc-4.7.0 -mavx -O3".  Memcheck will still fail, though.
Comment 101 Franz Trischberger 2012-06-14 07:12:42 UTC
(In reply to comment #83)
> vex amd64->IR: unhandled instruction bytes: 0xC5 0xF9 0x71 0xD1 0x8 0xC5
> 0xF1 0xDB
> vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
> vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
> vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
> ==13668== valgrind: Unrecognised instruction at address 0x874f086.
> ==13668==    at 0x874F086: _Z31comp_func_solid_SourceOver_sse2Pjijj+326
> (emmintrin.h:1167)
> ==13668==    by 0x893CAB2: _ZL19blend_color_genericiPK11QT_FT_Span_Pv+306
> (qdrawhelper.cpp:3316)
> ==13668==    by 0x894A4BF: gray_convert_glyph+1631 (qgrayraster.c:1756)
> ==13668==    by 0x891491A:
> _ZN25QRasterPaintEnginePrivate9rasterizeEP14QT_FT_Outline_PFviPK11QT_FT_Span_
> PvES5_P13QRasterBuffer.part.114+474 (qpaintengine_raster.cpp:3834)

This now became:
vex amd64->IR: unhandled instruction bytes: 0xC5 0xC9 0xFC 0xC0 0xC4 0xC1 0x79 0x7F
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x6 ESC=0F
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==3182== valgrind: Unrecognised instruction at address 0x89630be.
==3182==    at 0x89630BE: comp_func_solid_SourceOver_sse2(unsigned int*, int, unsigned int, unsigned int) (emmintrin.h:992)
==3182==    by 0x8B50AB2: blend_color_generic(int, QT_FT_Span_ const*, void*) (qdrawhelper.cpp:3316)
==3182==    by 0x8B5E4BF: gray_convert_glyph (qgrayraster.c:1756)

emmintrin.h comes with gcc-4.6.3.
Line 992:
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_epi8 (__m128i __A, __m128i __B)
{
  return (__m128i)__builtin_ia32_paddb128 ((__v16qi)__A, (__v16qi)__B); /// <- :992
}

Line 1167:
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_srli_epi16 (__m128i __A, int __B)
{
  return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); /// <- :1167
}
Comment 102 Julian Seward 2012-06-14 08:55:06 UTC
r2382 adds support for

    VMOVHPD m64, xmm1, xmm2 = VEX.NDS.128.66.0F.WIG 16 /r
    VMOVAPS ymm2/m256, ymm1 = VEX.256.0F.WIG 28 /r
    VCVTPD2PS ymm2/m256, xmm1 = VEX.256.66.0F.WIG 5A /r
    VPUNPCKHDQ = VEX.NDS.128.66.0F.WIG 6A /r
    VPCMPEQW r/m, rV, r ::: r = rV `eq-by-16s` r/m
    VPSUBUSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D9 /r
    VCVTDQ2PD xmm2/m128, ymm1 = VEX.256.F3.0F.WIG E6 /r
    VPADDB r/m, rV, r ::: r = rV + r/m
    VBROADCASTSS m32, xmm1 = VEX.128.66.0F38.W0 18 /r
    VPMOVSXBW xmm2/m64, xmm1
    VPMOVSXWD xmm2/m64, xmm1
    VPMOVSXDQ xmm2/m64, xmm1
Comment 103 Franz Trischberger 2012-06-14 09:10:18 UTC
(In reply to comment #102)
> r2382

Thx!
Now its:

vex amd64->IR: unhandled instruction bytes: 0xC5 0x70 0xC2 0xDA 0x1 0xC5 0xE8 0x58
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x1 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==19107== valgrind: Unrecognised instruction at address 0x89638d6.
==19107==    at 0x89638D6: unsigned int const* qt_fetch_radial_gradient_template<QRadialFetchSimd<QSimdSse2> >(unsigned int*, Operator const*, QSpanData const*, int, int, int) (xmmintrin.h:352)
==19107==    by 0x8B59D55: void handleSpans<BlendSrcGeneric<(SpanMethod)0> >(int, QT_FT_Span_ const*, QSpanData const*, BlendSrcGeneric<(SpanMethod)0>&) (qdrawhelper.cpp:3575)
==19107==    by 0x8B540E1: void blend_src_generic<(SpanMethod)0>(int, QT_FT_Span_ const*, void*) (qdrawhelper.cpp:3599)

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpgt_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_cmpgtps ((__v4sf)__A, (__v4sf)__B);
}

Sidenote for the interested:
I wondered that there should be unknown SSE2-instructions. But today I read, that -mavx translates those instructions to AVX. :)
Comment 104 Eero Pajarre 2012-06-14 10:43:02 UTC
> AFAICS, the only not-implemented one so far reported (apart from
> Josef's vector ones) are vstmxcsr/vldmxcsr, in comment #52.  I don't
> have a good way to test those.  I'll get to them though.

As these instructions still seem to crash my session, here is a minimal test case (in case it is needed)

It "works" for me when compiling for -march=native on my I7 processor

  Eero


#include <xmmintrin.h>
 
int main()
{
 _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
  return 0;
}
Comment 105 Julian Seward 2012-06-14 23:35:30 UTC
(In reply to comment #97)
> And here is another, to implement V{AND,ANDN,OR,XOR}P{S,D} 256-bit.

Committed as r2383, thanks.  Patch is fine.  Only comment is, ideally
you could add test cases to none/tests/amd64/avx-1.c (very easy to
do) as you go along.
Comment 106 Tom Hughes 2012-06-15 14:02:49 UTC
*** Bug 301967 has been marked as a duplicate of this bug. ***
Comment 107 Julian Seward 2012-06-15 15:50:25 UTC
r2384 adds support for 

VSTMXCSR m32 = VEX.LZ.0F.WIG AE /3
VLDMXCSR m32 = VEX.LZ.0F.WIG AE /2
0F 01 D0 = XGETBV
VPUNPCKHQDQ r/m, rV, r ::: r = interleave-hi-64bitses(rV, r/m)
VPSRAW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /4 ib
VPMULHW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E5 /r
VPERMILPS imm8, ymm2/m256, ymm1 = VEX.256.66.0F3A.W0 04 /r ib

It also adds a CPUID emulation that announces AVX support, although this
is currently disabled.
Comment 108 Eero Pajarre 2012-06-15 16:27:42 UTC
Thank you for your work on this problem, and with Valgrind generally!

With the latest update I got forward, now I am seeing:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x51 0xC1 0xC5 0xF8 0x2E 0xC0

gdb disas points to:
0x00000000005dd3b9 <+25>:	c5 fa 51 c1	vsqrtss %xmm1,%xmm0,%xmm0
   
  Eero
Comment 109 Jakub Jelinek 2012-06-15 16:54:26 UTC
Created attachment 71858 [details]
VPALIGNR and VBROADCASTSS (256-bit) support

VPALIGNR and VBROADCASTSS patch.  No test adjustment yet, started hacking also on VPINSRW support, but run out of time for today.  Posting here primarily to avoid work duplication.
Comment 110 Jakub Jelinek 2012-06-18 13:05:34 UTC
Created attachment 71909 [details]
avx.patch

On top of the previous patch, support for further mainly permutation/shuffle insns, which came up during
cd /usr/src/gcc/gcc/testsuite/gcc.dg/torture; for i in vshuf*.c; do \
  gcc -O2 -mavx -DEXPENSIVE -o /tmp/n $i -lm; \
  VALGRIND_LIB=/usr/src/valgrind/.in_place/ /usr/src/valgrind/coregrind/valgrind \
    --tool=none --quiet /tmp/n || echo $i $?; \
done
testing (still not all insns covered yet, but several tests already pass).  Additionally tested
with the above command with -mss{e4,se3,e3,e2} instead of -mavx.

This adds support for 3 operand VMOVSD, VMOVLPD, VPINSRW, VSHUFPD (128 and 256), VPERMILPS, VBLENDPS (128 and 256), VBLENDPD (128 and 256), VPBLENDW and VPINSRB.
Comment 111 Julian Seward 2012-06-18 14:07:29 UTC
(In reply to comment #109)
> Created attachment 71858 [details]

(In reply to comment #100)
> Created attachment 71909 [details]

Committed, r2386, r2387.  Thanks!
Comment 112 Jakub Jelinek 2012-06-18 14:12:45 UTC
For that gcc.dg/torture/ set of tests, the only remaining unsupported insn is probably VPERMILPS (the variable version; though it would be nice to also support VPERMILPD variable version).  Not sure how those would be best implemented, new IR opcodes, or perhaps new IR opcodes plus some masking on the selection vector.  Julian, could you look at that?
I'll in the mean time look at some gcc.target/i386/ unsupported insns...
Comment 113 Julian Seward 2012-06-18 15:02:23 UTC
r2388 adds support for:

VMOVUPD ymm1, ymm2/m256 = VEX.256.66.0F.WIG 11 /r
VMOVUPS ymm1, ymm2/m256 = VEX.256.0F.WIG 11 /r
VCOMISD  xmm2/m64, xmm1 = VEX.LIG.66.0F.WIG 2F /r
VPCMPGTD r/m, rV, r ::: r = rV `>s-by-32s` r/m
VPMOVSXBD xmm2/m32, xmm1
VPMOVZXBD xmm2/m32, xmm1
VDPPD xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 41 /r ib
Comment 114 Julian Seward 2012-06-18 15:03:37 UTC
(In reply to comment #112)
> Julian, could you look at that?

Will do.
Comment 115 Jakub Jelinek 2012-06-18 16:19:06 UTC
Created attachment 71917 [details]
avx2.patch

This patch adds:

VMOVUPS ymm2/m256, ymm1 = VEX.256.0F.WIG 10 /r
VSQRTSS xmm3/m64(E), xmm2(V), xmm1(G) = VEX.NDS.LIG.F3.0F.WIG 51 /r
VSQRTPS xmm2/m128(E), xmm1(G) = VEX.NDS.128.0F.WIG 51 /r
VSQRTPS ymm2/m256(E), ymm1(G) = VEX.NDS.256.0F.WIG 51 /r
VSQRTPD xmm2/m128(E), xmm1(G) = VEX.NDS.128.66.0F.WIG 51 /r
VSQRTPD ymm2/m256(E), ymm1(G) = VEX.NDS.256.66.0F.WIG 51 /r
VRSQRTSS xmm3/m64(E), xmm2(V), xmm1(G) = VEX.NDS.LIG.F3.0F.WIG 52 /r
VRSQRTPS xmm2/m128(E), xmm1(G) = VEX.NDS.128.0F.WIG 52 /r
VRSQRTPS ymm2/m256(E), ymm1(G) = VEX.NDS.256.0F.WIG 52 /r
VZEROALL = VEX.256.0F.WIG 77
VMOVDQU ymm1, ymm2/m256 = VEX.256.F3.0F.WIG 7F

found during gcc.target/i386/ testing (further changes to come).
Comment 116 Jakub Jelinek 2012-06-18 19:21:22 UTC
Created attachment 71922 [details]
avx-3.patch

This patch adds:

VCVTPS2PD xmm2/m128, ymm1 = VEX.256.0F.WIG 5A /r
VCVTPS2DQ ymm2/m256, ymm1 = VEX.256.66.0F.WIG 5B /r
VCVTTPS2DQ xmm2/m128, xmm1 = VEX.128.F3.0F.WIG 5B /r
VCVTTPS2DQ ymm2/m256, ymm1 = VEX.256.F3.0F.WIG 5B /r
VCVTDQ2PS xmm2/m128, xmm1 = VEX.128.0F.WIG 5B /r
VCVTDQ2PS ymm2/m256, ymm1 = VEX.256.0F.WIG 5B /r
VCVTTPD2DQ xmm2/m128, xmm1 = VEX.128.66.0F.WIG E6 /r
VCVTTPD2DQ ymm2/m256, xmm1 = VEX.256.66.0F.WIG E6 /r
VCVTPD2DQ xmm2/m128, xmm1 = VEX.128.F2.0F.WIG E6 /r
VCVTPD2DQ ymm2/m256, xmm1 = VEX.256.F2.0F.WIG E6 /r

(again, from gcc.target/i386/ avx-*.c tests).
Comment 117 Julian Seward 2012-06-18 23:17:13 UTC
(In reply to comment #115)
> Created attachment 71917 [details] avx2.patch

(In reply to comment #116)
Created attachment 71922 [details] avx-3.patch

Committed w/ minor fixes, r2390.  Thanks!
Comment 118 Julian Seward 2012-06-19 06:10:33 UTC
(In reply to comment #112)
> is probably VPERMILPS (the variable version; though it would be nice to also
> support VPERMILPD variable version).  Not sure how those would be best
> implemented, new IR opcodes, or perhaps new IR opcodes plus some masking on
> the selection vector.

I can't think of a really simple and efficient way to do this.  My
least-worst suggestion, for VPERMILPS_128(vec, ctrl), where vec and
ctrl are the data and control vectors respectively, is to generate
this expression

   ctrl_clean = ShrN32x4(ShlN32x4(ctrl, 30), 30)

This zeroes the top 30 bits of each control vector lane, which is
important for avoiding false positives in Memcheck.

Then add a new primop, Iop_Perm32x4 (put it next to existing
Perm8x16).  Implement this in the back end using a function call,
which is handled via "do_SseAssistedBinary" in iselVecExpr_wrk.
Add the relevant helper function in host_generic_simd128.c, in the
style of eg h_generic_calc_Mul32x4.

Overall result is then

   Perm32x4(vec, ctrl_clean)

To be a bit cleverer, if we know the host actually supports AVX,
we could implement Perm32x4 using VPERMILPS, but that means adding
a new case for AMD64Instr, which sounds more complexity than it is
worth for this one case.
Comment 119 Eero Pajarre 2012-06-19 07:44:19 UTC
(I have two programs which I would like to run with Valgrind, after the latest patch the other one seems to be OK, great!)

From some code written with xmmintrin.h:

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x12 0xFB 0xC5 0xFA 0x16 0xDB
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==6596== valgrind: Unrecognised instruction at address 0x51585c.

gdb disasm:
 0x000000000051585c <+140>:	c5 fa 12 fb	        vmovsldup %xmm3,%xmm7
 0x0000000000515860 <+144>:	c5 fa 16 db	vmovshdup %xmm3,%xmm3
 0x0000000000515864 <+148>:	c5 e0 58 df	vaddps %xmm7,%xmm3,%xmm3
 
Do you need a test case to be compiled/run ?

Eero
Comment 120 Gunther Piez 2012-06-19 11:06:39 UTC
Another one, a not so uncommon instruction:

vex amd64->IR: unhandled instruction bytes: 0xC4 0x41 0x12 0x10 0xF4 0xC4 0x41 0x79
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0xD ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=1
==10414== valgrind: Unrecognised instruction at address 0x46b17e.

 46b17e:       c4 41 12 10 f4          vmovss %xmm12,%xmm13,%xmm14
 
This was generated from a _mm_insert_epi64x(vector, value, 0), so gcc without any intrinsics will probably not do this.
Comment 121 Jakub Jelinek 2012-06-19 11:09:11 UTC
Created attachment 71936 [details]
VMOVS[LH]DUP patch

This patch adds
VMOVSLDUP xmm2/m128, xmm1 = VEX.NDS.128.F3.0F.WIG 12 /r
VMOVSLDUP ymm2/m256, ymm1 = VEX.NDS.256.F3.0F.WIG 12 /r
VMOVSHDUP xmm2/m128, xmm1 = VEX.NDS.128.F3.0F.WIG 16 /r
VMOVSHDUP ymm2/m256, ymm1 = VEX.NDS.256.F3.0F.WIG 16 /r
Comment 122 Jakub Jelinek 2012-06-19 12:10:13 UTC
Created attachment 71937 [details]
VMOVSS patch

VMOVSS xmm3, xmm2, xmm1 = VEX.LIG.F3.0F.WIG 10 /r
VMOVSS xmm3, xmm2, xmm1 = VEX.LIG.F3.0F.WIG 11 /r
Comment 123 Gunther Piez 2012-06-19 13:17:30 UTC
That was fast. The next one is evil.

vex amd64->IR: unhandled instruction bytes: 0xC4 0x42 0x29 0xB 0xD9 0xC4 0xC1 0x79
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0xA ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==9622== valgrind: Unrecognised instruction at address 0x441de6.
==9622==    at 0x441DE6: Eval::Init::material() (tmmintrin.h:124)
==9622==    by 0x45809A: Eval::Init::setEvalParameters(Parameters const&) (evalinit.cpp:778)
==9622==    by 0x46DA40: Eval::init(Parameters const&) (eval.cpp:91)
==9622==    by 0x44FD5F: Game::Game(Console*, Parameters const&, unsigned long, unsigned long) (game.cpp:115)
==9622==    by 0x468559: Console::init(int&, char**) (console.cpp:78)
==9622==    by 0x427D79: main (main.cpp:28)

  441de6:       c4 42 29 0b d9          vpmulhrsw %xmm9,%xmm10,%xmm11
  441deb:       c4 c1 79 c5 eb 01       vpextrw $0x1,%xmm11,%ebp
  441df1:       c4 c1 79 c5 d3 00       vpextrw $0x0,%xmm11,%edx

I shouldn't have used this, I knew it would hurt me later :-) It actually does give a nice speedup, but fortunately for valgrind testing I can easily disable it. It comes from this code

    #ifdef __SSSE3__
    	// The SSE version loses half a bit of precision, because is rounds first
    	// and then sums up, where the normal code rounds last.
        __v8hi score16 = _mm_mulhrs_epi16(weights.data, score.data);
        int16_t s0 = _mm_extract_epi16( score16, 0 ) + _mm_extract_epi16( score16, 1 );
        return s0;
    #else    
    	int o = weights.opening();
        int e = weights.endgame();
        int s = (o*score.opening() + e*score.endgame() + 0x4000) >> 15;
        ASSERT(s==s0);
        return s;
    #endif

If I disable the SSSE3 path, the next stop ist at

vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x17 0xDA 0x49 0x8B 0x48
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==12207== valgrind: Unrecognised instruction at address 0x443236

which is
  443236:       c4 e2 79 17 da          vptest %xmm2,%xmm3
Comment 124 Jakub Jelinek 2012-06-19 14:48:56 UTC
Created attachment 71948 [details]
VPSRAD and VPSLLW patch

VPSLLW imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 71 /6 ib
VPSRAD imm8, xmm2, xmm1 = VEX.NDD.128.66.0F.WIG 72 /4 ib

There is still big amount of unsupported insns, gcc.target/i386/avx-*.c tests that still fail because of unsupported insns are:
avx-ceilf-sfix-vec.c avx-ceilf-vec.c avx-ceil-sfix-2-vec.c avx-ceil-sfix-vec.c avx-ceil-vec.c avx-cmpss-1.c avx-cond-1.c avx-floorf-sfix-vec.c avx-floorf-vec.c avx-floor-sfix-2-vec.c avx-floor-sfix-vec.c avx-floor-vec.c avx-mul-1.c avx-pr51581-1.c avx-pr51581-2.c avx-rintf-sfix-vec.c avx-rintf-vec.c avx-rint-sfix-2-vec.c avx-rint-sfix-vec.c avx-rint-vec.c avx-vaddsubpd-1.c avx-vaddsubpd-256-1.c avx-vaddsubps-1.c avx-vaddsubps-256-1.c avx-vaesdec-1.c avx-vaesdeclast-1.c avx-vaesenc-1.c avx-vaesenclast-1.c avx-vaesimc-1.c avx-vaeskeygenassist-1.c avx-vblendvpd-256-1.c avx-vblendvps-256-1.c avx-vcmppd-1.c avx-vcmppd-256-1.c avx-vcmpps-1.c avx-vcmpps-256-1.c avx-vcmpsd-1.c avx-vcmpss-1.c avx-vcvtsd2si-1.c avx-vcvtsd2si-2.c avx-vcvtss2si-1.c avx-vcvtss2si-2.c avx-vdpps-1.c avx-vdpps-2.c avx-vextractps-1.c avx-vhaddpd-1.c avx-vhaddpd-256-1.c avx-vhaddps-1.c avx-vhaddps-256-1.c avx-vhsubpd-1.c avx-vhsubpd-256-1.c avx-vhsubps-1.c avx-vhsubps-256-1.c avx-vlddqu-1.c avx-vlddqu-256-1.c avx-vmaskmovdqu.c avx-vmaskmovpd-1.c avx-vmaskmovpd-256-1.c avx-vmaskmovpd-256-2.c avx-vmaskmovpd-2.c avx-vmaskmovps-1.c avx-vmaskmovps-256-1.c avx-vmaskmovps-256-2.c avx-vmaskmovps-2.c avx-vmaxpd-256-1.c avx-vmaxps-256-1.c avx-vminpd-256-1.c avx-vminps-256-1.c avx-vmovhps-1.c avx-vmovhps-2.c avx-vmovmskpd-1.c avx-vmovmskpd-256-1.c avx-vmovmskps-1.c avx-vmovmskps-256-1.c avx-vmovntdq-256-1.c avx-vmovntdqa-1.c avx-vmovntpd-1.c avx-vmovntpd-256-1.c avx-vmovntps-1.c avx-vmovntps-256-1.c avx-vmpsadbw-1.c avx-vpabsb-1.c avx-vpabsw-1.c avx-vpacksswb-1.c avx-vpackusdw-1.c avx-vpaddsb-1.c avx-vpaddsw-1.c avx-vpavgb-1.c avx-vpavgw-1.c avx-vpcmpestri-1.c avx-vpcmpestri-2.c avx-vpcmpestrm-1.c avx-vpcmpestrm-2.c avx-vpcmpgtb-1.c avx-vpcmpgtw-1.c avx-vpcmpistri-1.c avx-vpcmpistri-2.c avx-vpcmpistrm-1.c avx-vpcmpistrm-2.c avx-vpermilpd-256-2.c avx-vpermilpd-2.c avx-vpermilps-256-2.c avx-vpermilps-2.c avx-vphaddd-1.c avx-vphaddsw-1.c avx-vphaddw-1.c avx-vphsubd-1.c avx-vphsubsw-1.c avx-vphsubw-1.c avx-vpmaddubsw-1.c avx-vpmovsxbq-1.c avx-vpmovsxwq-1.c avx-vpmovzxbq-1.c avx-vpmovzxdq-1.c avx-vpmovzxwq-1.c avx-vpmuldq-1.c avx-vpmulhrsw-1.c avx-vpsadbw-1.c avx-vpsignb-1.c avx-vpsignd-1.c avx-vpsignw-1.c avx-vpsubsb-1.c avx-vpsubsw-1.c avx-vptest-1.c avx-vptest-256-1.c avx-vptest-256-2.c avx-vptest-256-3.c avx-vptest-2.c avx-vptest-3.c avx-vrcpps-1.c avx-vrcpps-256-1.c avx-vroundpd-1.c avx-vroundpd-256-1.c avx-vroundpd-256-2.c avx-vroundpd-256-3.c avx-vroundpd-2.c avx-vroundpd-3.c avx-vroundps-256-1.c avx-vtestpd-1.c avx-vtestpd-256-1.c avx-vtestpd-256-2.c avx-vtestpd-256-3.c avx-vtestpd-2.c avx-vtestpd-3.c avx-vtestps-1.c avx-vtestps-256-1.c avx-vtestps-256-2.c avx-vtestps-256-3.c avx-vtestps-2.c avx-vtestps-3.c
Comment 125 Jakub Jelinek 2012-06-19 14:58:44 UTC
Created attachment 71949 [details]
gcc.target/i386 hack patch

Attaching also the hack I'm using to get the above list of failures.
cd /usr/src/gcc/gcc/testsuite/gcc.target/i386/; for i in `grep -L 'dg-do.*compile' avx-*.c`; do gcc -O3 -mavx -g -DNEED_IEEE754_DOUBLE -o /tmp/n $i -lm; VALGRIND_LIB=/usr/src/valgrind/.in_place/ /usr/src/valgrind/coregrind/valgrind --tool=none --quiet /tmp/n || echo $i $?; done

(with sse[234]*-.c ssse3-*.c also several tests fail).
Comment 126 Julian Seward 2012-06-20 10:23:04 UTC
(In reply to comment #121)
> Created attachment 71936 [details]
> VMOVS[LH]DUP patch

(In reply to comment #122)
> Created attachment 71937 [details]
> VMOVSS patch

(In reply to comment #124)
> Created attachment 71948 [details]
> VPSRAD and VPSLLW patch

Thanks -- all 3 committed together as r2394/r12656.
Comment 127 Jakub Jelinek 2012-06-20 11:52:08 UTC
Created attachment 71980 [details]
VPTEST and VTESTP[SD]

VTESTPS xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 0E /r
VTESTPS ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 0E /r
VTESTPD xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 0F /r
VTESTPD ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 0F /r
VPTEST xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 17 /r
VPTEST ymm2/m256, ymm1 = VEX.256.66.0F38.WIG 17 /r
Comment 128 Jakub Jelinek 2012-06-20 12:40:15 UTC
Created attachment 71984 [details]
VPERMILP{S,D} fix

vshuf-v8sf.c testcase was crashing in valgrind, as VPERMILP{S,D} variable form doesn't have the immediate byte, so needs to pass 0 as extra_bytes.
Comment 129 Jakub Jelinek 2012-06-20 16:32:17 UTC
Created attachment 71990 [details]
Variable 128-bit integer shifts and VBLENDVP{S,D}

VPSRLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D1 /r
VPSRLD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D2 /r
VPSRLQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D3 /r
VPSRAW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E1 /r
VPSRAD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E2 /r
VPSLLW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F1 /r
VPSLLD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F2 /r
VPSLLQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F3 /r
VBLENDVPS xmm4, xmm3/mem128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 4A /r /is4
VBLENDVPS ymm4, ymm3/mem256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 4A /r /is4
VBLENDVPD xmm4, xmm3/mem128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 4B /r /is4
VBLENDVPD ymm4, ymm3/mem256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 4B /r /is4

and in addition to that fixes out of bound counts for SSE2 PS{LL,RA,RL}{W,D,Q} - the shift count is 64-bit rather than 32-bit.
Comment 130 Jakub Jelinek 2012-06-20 19:56:41 UTC
Created attachment 71998 [details]
VROUND* and VPSUBS[BW]

VPSUBSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E8 /r
VPSUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E9 /r
VROUNDPS imm8, xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 08 ib
VROUNDPS imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 08 ib
VROUNDPD imm8, xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 09 ib
VROUNDPD imm8, ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F3A.WIG 09 ib
VROUNDSS imm8, xmm3/m32, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 0A ib
VROUNDSD imm8, xmm3/m64, xmm2, xmm1 = VEX.NDS.128.66.0F3A.WIG 0B ib
Comment 131 Julian Seward 2012-06-21 09:28:05 UTC
(In reply to comment #128)
> Created attachment 71984 [details]

(In reply to comment #127)
> Created attachment 71980 [details]

(In reply to comment #129)
> Created attachment 71990 [details]

(In reply to comment #130)
> Created attachment 71998 [details]

Committed, revs 2397, 2398, 2399, 2400 respectively.  Thanks.
Comment 132 Jakub Jelinek 2012-06-21 09:49:12 UTC
Created attachment 72008 [details]
Another set of random AVX insns from gcc.target/i386

VPCMPGTB = VEX.NDS.128.66.0F.WIG 64 /r
VPCMPGTW = VEX.NDS.128.66.0F.WIG 65 /r
VCMPPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG C2 /r ib
VCMPPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.0F.WIG C2 /r ib
VCMPPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG C2 /r ib
VADDSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG D0 /r
VADDSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG D0 /r
VADDSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG D0 /r
VADDSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG D0 /r
VPMADDWD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F5 /r
VPMULDQ xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 28 /r
Comment 133 Jakub Jelinek 2012-06-21 10:38:07 UTC
Created attachment 72009 [details]
VCMPPD and VCMPPS incremental fix

I've missed avx-cond-1.c abort, which was due to swapping the arguments incorrectly.
Comment 134 Jakub Jelinek 2012-06-21 13:53:15 UTC
Created attachment 72016 [details]
Further AVX insns

VCVTSD2SI xmm1/m32, r32 = VEX.LIG.F2.0F.W0 2D /r
VCVTSD2SI xmm1/m64, r64 = VEX.LIG.F2.0F.W1 2D /r
VCVTSS2SI xmm1/m32, r32 = VEX.LIG.F3.0F.W0 2D /r
VCVTSS2SI xmm1/m64, r64 = VEX.LIG.F3.0F.W1 2D /r
VHADDPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG 7C /r
VHSUBPS xmm3/m128, xmm2, xmm1 = VEX.NDS.128.F2.0F.WIG 7D /r
VHADDPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG 7C /r
VHSUBPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.F2.0F.WIG 7D /r
VHADDPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 7C /r
VHSUBPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 7D /r
VHADDPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 7C /r
VHSUBPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 7D /r
VLDDQU m256, ymm1 = VEX.256.F2.0F.WIG F0 /r
VLDDQU m128, xmm1 = VEX.128.F2.0F.WIG F0 /r
VEXTRACTPS imm8, xmm1, r32/m32 = VEX.128.66.0F3A.WIG 17 /r ib
VDPPS imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 40 /r ib
VDPPS imm8, ymm3/m128,ymm2,ymm1 = VEX.NDS.128.66.0F3A.WIG 40 /r ib
Comment 135 Jakub Jelinek 2012-06-21 16:03:01 UTC
Created attachment 72020 [details]
Further insns

VMOVHPS m64, xmm1, xmm2 = VEX.NDS.128.0F.WIG 16 /r
VMOVHPS xmm1, m64 = VEX.128.0F.WIG 17 /r
VMOVNTPD xmm1, m128 = VEX.128.66.0F.WIG 2B /r
VMOVNTPS xmm1, m128 = VEX.128.0F.WIG 2B /r
VMOVNTPD ymm1, m256 = VEX.256.66.0F.WIG 2B /r
VMOVNTPS ymm1, m256 = VEX.256.0F.WIG 2B /r
VMOVMSKPD xmm2, r32 = VEX.128.66.0F.WIG 50 /r
VMOVMSKPD ymm2, r32 = VEX.256.66.0F.WIG 50 /r
VMOVMSKPS xmm2, r32 = VEX.128.0F.WIG 50 /r
VMOVMSKPS ymm2, r32 = VEX.256.0F.WIG 50 /r
VMINPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5D /r
VMINPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5D /r
VMINPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5D /r
VMAXPS ymm3/m256, ymm2, ymm1 = VEX.NDS.256.0F.WIG 5F /r
VMAXPD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 5F /r
VMAXPD ymm3/m256, ymm2, ymm1 = VEX.NDS.256.66.0F.WIG 5F /r
VMOVNTDQ ymm1, m256 = VEX.256.66.0F.WIG E7 /r
VMASKMOVDQU xmm2, xmm1 = VEX.128.66.0F.WIG F7 /r
VMOVNTDQA m128, xmm1 = VEX.128.66.0F38.WIG 2A /r
Comment 136 Jakub Jelinek 2012-06-21 18:39:28 UTC
Created attachment 72021 [details]
Last patch for today

VPACKSSWB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG 63 /r
VPAVGB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E0 /r         
VPAVGW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG E3 /r          
VPADDSB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG EC /r     
VPADDSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG ED /r     
VPHADDW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 01 /r   
VPHADDD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 02 /r   
VPHADDSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 03 /r  
VPMADDUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 04 /r
VPHSUBW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 05 /r 
VPHSUBD xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 06 /r 
VPHSUBSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 07 /r
VPABSB xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 1C /r  
VPABSW xmm2/m128, xmm1 = VEX.128.66.0F38.WIG 1D /r  
VPMOVSXBQ xmm2/m16, xmm1 = VEX.128.66.0F38.WIG 22 /r
VPMOVSXWQ xmm2/m32, xmm1 = VEX.128.66.0F38.WIG 24 /r
VPACKUSDW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2B /r
VPMOVZXBQ xmm2/m16, xmm1 = VEX.128.66.0F38.WIG 32 /r
VPMOVZXWQ xmm2/m32, xmm1 = VEX.128.66.0F38.WIG 34 /r
VPMOVZXDQ xmm2/m64, xmm1 = VEX.128.66.0F38.WIG 35 /r
VMPSADBW imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 42 /r ib
Comment 137 Jakub Jelinek 2012-06-22 14:00:42 UTC
Created attachment 72040 [details]
Another set of insns

VMOVDDUP ymm2/m256, ymm1 = VEX.256.F2.0F.WIG /12 r
VMOVLPS m64, xmm1, xmm2 = VEX.NDS.128.0F.WIG 12 /r
VMOVLPS xmm1, m64 = VEX.128.0F.WIG 13 /r
VRCPSS xmm3/m64, xmm2, xmm1 = VEX.NDS.LIG.F3.0F.WIG 53 /r
VRCPPS xmm2/m128, xmm1 = VEX.NDS.128.0F.WIG 53 /r
VRCPPS ymm2/m256, ymm1 = VEX.NDS.256.0F.WIG 53 /r
VMOVQ xmm1, m64/r64 = VEX.128.66.0F.W1 7E /r (mem case only)
VPSADBW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F.WIG F6 /r
VPSIGNB xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 08 /r
VPSIGNW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 09 /r
VPSIGND xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 0A /r
VPMULHRSW xmm3/m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 0B /r
VBROADCASTF128 m128, ymm1 = VEX.256.66.0F38.WIG 1A /r
VPEXTRW imm8, reg/m16, xmm2 = VEX.128.66.0F3A.W0 15 /r ib
Comment 138 Jakub Jelinek 2012-06-22 15:09:05 UTC
Created attachment 72041 [details]
AVX encoded AES support

VAESIMC xmm2/m128, xmm1 = VEX.128.66.0F38.WIG DB /r
VAESENC xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DC /r
VAESENCLAST xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DD /r
VAESDEC xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DE /r
VAESDECLAST xmm3/m128, xmm2, xmm1 = VEX.128.66.0F38.WIG DF /r
VAESKEYGENASSIST imm8, xmm2/m128, xmm1 = VEX.128.66.0F3A.WIG DF /r
Comment 139 Jakub Jelinek 2012-06-22 15:31:14 UTC
Created attachment 72045 [details]
VPCLMULQDQ

VPCLMULQDQ imm8, xmm3/m128,xmm2,xmm1 = VEX.NDS.128.66.0F3A.WIG 44 /r ib
Comment 140 Jakub Jelinek 2012-06-22 15:56:04 UTC
I'm running out of time, have to move back to GCC hacking next week.
From gcc.target/i386 remaining failures and eyeballing the AVX opcode tables at the end of the PDF, comparing them to the VEX 0F, 0F38 and 0F3A routines, it seems almost everything is now implemented, with the exception of:
1) VMASKMOVP[SD]:
VMASKMOVPS m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2C /r
VMASKMOVPS m256, ymm2, ymm1 = VEX.NDS.256.66.0F38.WIG 2C /r
VMASKMOVPD m128, xmm2, xmm1 = VEX.NDS.128.66.0F38.WIG 2D /r
VMASKMOVPD m256, ymm2, ymm1 = VEX.NDS.256.66.0F38.WIG 2D /r
VMASKMOVPS xmm2, xmm1, m128 = VEX.NDS.128.66.0F38.WIG 2E /r
VMASKMOVPS ymm2, ymm1, m256 = VEX.NDS.256.66.0F38.WIG 2E /r
VMASKMOVPD xmm2, xmm1, m128 = VEX.NDS.128.66.0F38.WIG 2F /r
VMASKMOVPD ymm2, ymm1, m256 = VEX.NDS.256.66.0F38.WIG 2F /r
2) FMA:
VFMADDSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 96 /r 
VFMADDSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 96 /r 
VFMADDSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 96 /r 
VFMADDSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 96 /r 
VFMSUBADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 97 /r 
VFMSUBADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 97 /r 
VFMSUBADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 97 /r 
VFMSUBADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 97 /r 
VFMADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 98 /r 
VFMADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 98 /r 
VFMADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 98 /r 
VFMADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 98 /r 
VFMADD132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 99 /r 
VFMADD132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 99 /r 
VFMSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9A /r 
VFMSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9A /r 
VFMSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9A /r 
VFMSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9A /r 
VFMSUB132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9B /r 
VFMSUB132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9B /r 
VFNMADD132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9C /r 
VFNMADD132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9C /r 
VFNMADD132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9C /r 
VFNMADD132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9C /r 
VFNMADD132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9D /r 
VFNMADD132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9D /r 
VFNMSUB132PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9E /r 
VFNMSUB132PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 9E /r 
VFNMSUB132PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9E /r 
VFNMSUB132PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 9E /r 
VFNMSUB132SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 9F /r 
VFNMSUB132SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 9F /r 
VFMADDSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A6 /r 
VFMADDSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A6 /r 
VFMADDSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A6 /r 
VFMADDSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A6 /r 
VFMSUBADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A7 /r 
VFMSUBADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A7 /r 
VFMSUBADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A7 /r 
VFMSUBADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A7 /r 
VFMADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A8 /r 
VFMADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 A8 /r 
VFMADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A8 /r 
VFMADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 A8 /r 
VFMADD213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 A9 /r 
VFMADD213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 A9 /r 
VFMSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AA /r 
VFMSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AA /r 
VFMSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AA /r 
VFMSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AA /r 
VFMSUB213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AB /r 
VFMSUB213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AB /r 
VFNMADD213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AC /r 
VFNMADD213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AC /r 
VFNMADD213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AC /r 
VFNMADD213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AC /r 
VFNMADD213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AD /r 
VFNMADD213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AD /r 
VFNMSUB213PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AE /r 
VFNMSUB213PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 AE /r 
VFNMSUB213PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AE /r 
VFNMSUB213PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 AE /r 
VFNMSUB213SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 AF /r 
VFNMSUB213SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 AF /r 
VFMADDSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B6 /r 
VFMADDSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B6 /r 
VFMADDSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B6 /r 
VFMADDSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B6 /r 
VFMSUBADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B7 /r 
VFMSUBADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B7 /r 
VFMSUBADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B7 /r 
VFMSUBADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B7 /r 
VFMADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B8 /r 
VFMADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 B8 /r 
VFMADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B8 /r 
VFMADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 B8 /r 
VFMADD231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 B9 /r 
VFMADD231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 B9 /r 
VFMSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BA /r 
VFMSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BA /r 
VFMSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BA /r 
VFMSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BA /r 
VFMSUB231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BB /r 
VFMSUB231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BB /r 
VFNMADD231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BC /r 
VFNMADD231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BC /r 
VFNMADD231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BC /r 
VFNMADD231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BC /r 
VFNMADD231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BD /r 
VFNMADD231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BD /r 
VFNMSUB231PS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BE /r 
VFNMSUB231PS ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W0 BE /r 
VFNMSUB231PD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BE /r 
VFNMSUB231PD ymm2/m256, ymm1, ymm0 = VEX.DDS.256.66.0F38.W1 BE /r 
VFNMSUB231SS xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W0 BF /r 
VFNMSUB231SD xmm2/m128, xmm1, xmm0 = VEX.DDS.128.66.0F38.W1 BF /r 
3) various comparison codes for VCMP[PS][SD].  Various gcc.target/i386 tests fail because of this.  Not sure if valgrind differentiates between signalling/non-signalling variants, if it should, then we'll need some hacks to handle them all.
4) various VPCMP[EI]STR[iM] modes (valgrind handles just a subset.
Comment 141 Julian Seward 2012-06-24 15:17:03 UTC
Committed:

(In reply to comment #132)
> Created attachment 72008 [details]
r2404

(In reply to comment #133)
> Created attachment 72009 [details]
r2405

(In reply to comment #134)
> Created attachment 72016 [details]
r2406

(In reply to comment #135)
> Created attachment 72020 [details]
r2407

(In reply to comment #136)
> Created attachment 72021 [details]
r2408

(In reply to comment #137)
> Created attachment 72040 [details]
r2409

(In reply to comment #138)
> Created attachment 72041 [details]
(In reply to comment #139)
> Created attachment 72045 [details]
r2410 (#138 and #139 together)

Thanks!
Comment 142 Jakub Jelinek 2012-06-24 17:03:43 UTC
Thanks.  To add to the list that is left unimplemented is XGETBV (that e.g. the gcc.dg/i386 tests use to detect whether OS has support for saving/restoring of the AVX state.  Given that valgrind doesn't actually use AVX insns itself, it might be fine to always claim it has support.
Comment 143 Julian Seward 2012-06-24 18:54:25 UTC
(In reply to comment #142)
> Thanks.  To add to the list that is left unimplemented is XGETBV (that e.g.

I implemented XGETBV about a week back, along with a CPUID that claims it
is an AVX capable CPU.  However, the latter is currently disabled.  Once I fix up
the Memcheck side of things for the new IRops, I'll re-enable the new CPUID, and
then the gcc tests should then work ok.
Comment 144 Julian Seward 2012-06-25 08:05:33 UTC
AVX support is now enabled by default, on processors that support it,
and works properly with Memcheck.  I now consider it usable.  Please
let me know (file bug reports) if you find any problems with it.  Many
thanks to Jakub Jelinek for helping out with this.
Comment 145 Julian Seward 2012-06-25 08:08:25 UTC
(In reply to comment #140)
> 2) FMA:
> [... big list of FMA insns ...]

The processor I have (Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz) doesn't
support FMA, AFAICS.  Is FMA support generally available yet?

In any case, FMA is a different (although related) insn set extension,
and should have its own bug report.  I propose to close this one as
resolved-fixed unless anybody objects.
Comment 146 Jakub Jelinek 2012-06-25 08:32:18 UTC
You are right, FMA seems to be deferred for Haswell, for which we don't support AVX2 either.
Adding 1), VMASKMOVP[SD] support would be nice though, GCC emits those at least for the intrinsics and hopefully for 4.8 also in vectorized code.  A slight complication with those insns is that faults shouldn't be reported for masked out loads or stores (e.g. if mask is all zeros, or if the memory load or store crosses a page boundary and after the end of cliff mask doesn't have sign bits set).  So either it could be implemented as a scalar loop, always testing a single mask bits and conditionally doing a load resp. store, or using new IR opcode.  For AVX2 VGATHER* will be even harder to do...
Comment 147 Julian Seward 2012-06-25 08:53:53 UTC
(In reply to comment #146)
> For AVX2 VGATHER* will be even harder to do...

ARM/NEON already has scatter/gather loads (kind of) and we do a very
poor, although correct, translation of them.  Maybe some IR
enhancements to support scatter/gather better will help with both
VGATHER and NEON.
Comment 148 Julian Seward 2012-06-25 08:55:00 UTC
Closing.  Please report followup any problems in new bug reports.
Comment 149 Tom Hughes 2012-06-27 23:56:47 UTC
*** Bug 302656 has been marked as a duplicate of this bug. ***
Comment 150 Julian Seward 2012-07-05 07:49:15 UTC
*** Bug 298227 has been marked as a duplicate of this bug. ***
Comment 151 Julian Seward 2012-07-05 07:51:59 UTC
*** Bug 298335 has been marked as a duplicate of this bug. ***
Comment 152 Julian Seward 2012-07-13 12:21:05 UTC
*** Bug 303466 has been marked as a duplicate of this bug. ***
Comment 153 Tom Hughes 2012-09-13 09:22:03 UTC
*** Bug 306721 has been marked as a duplicate of this bug. ***
Comment 154 Tom Hughes 2012-09-30 11:59:56 UTC
*** Bug 307612 has been marked as a duplicate of this bug. ***
Comment 155 Sergey Kishchenko 2012-11-02 16:24:37 UTC
I've built latest valgrind from the SVN and I am still getting vex related issues:

 vex x86->IR: unhandled instruction bytes: 0xC5 0xF9 0x6E 0x40
==27173== valgrind: Unrecognised instruction at address 0x4014e00.
==27173==    at 0x4014E00: _dl_sysdep_start (in /lib32/ld-2.15.so)
==27173==    by 0xF770EFFF: ???

I'm using gentoo x86_64 built with -march=native and trying to run valgrind on -m32 built binary. Have I missed anything?
Comment 156 Jakub Jelinek 2012-11-02 16:28:59 UTC
(In reply to comment #155)
> I'm using gentoo x86_64 built with -march=native and trying to run valgrind
> on -m32 built binary. Have I missed anything?

Yes,  -m32 built binary is not x86_64, and AVX is only supported on x86_64, not on i?86 in valgrind.
Comment 157 Sergey Kishchenko 2012-11-02 16:43:23 UTC
Are there any plans to support plain x86 AVX instructions in upcoming valgrind releases? Should I create a separate issue for it?
Comment 158 Tom Hughes 2012-11-02 16:48:18 UTC
There already is one - bug #301967 but as Julian says on that bug it is unlikely to be supported.
Comment 159 Damien Levac 2013-10-11 17:12:54 UTC
I might be a little late about this issue, but for anyone still having issue with AVX on gentoo (maybe for x86 people?), there is no need to not use -march=native... just do CFLAGS="{$CFLAGS} -march=native -mno-avx" which will enable all relevant optimization for your CPU without AVX. then 'emerge -eav @world @system' and you are good to go!