Bug 222560

Summary: ARM NEON support
Product: [Developer tools] valgrind Reporter: Dmitry Zhurikhin <zhur>
Component: vexAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: wishlist CC: batuzovk, glider
Priority: NOR    
Version First Reported In: unspecified   
Target Milestone: ---   
Platform: unspecified   
OS: All   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Initial NEON support patch for VEX SVN revision 1954
An example program with instructions mentioned in comment #1
Integer NEON support patch for VEX SVN revision 1963
Integer NEON support patch for Valgrind SVN revision 11057
NEON support patch for VEX SVN revision 1964
NEON support patch for Valgrind SVN revision 11094
FCONSTS VFP instruction support
Fix for the bug from the previous comment
NEON support patch for VEX SVN revision 1974
NEON support patch for Valgrind SVN revision 11104
NEON support patch for VEX SVN revision 1974
NEON support patch for Valgrind SVN revision 11104

Description Dmitry Zhurikhin 2010-01-13 17:22:24 UTC
Created attachment 39850 [details]
Initial NEON support patch for VEX SVN revision 1954

Hello. This is an effort to add support for NEON (ARM Advanced SIMD extension) instruction set to Valgrind.

Initial version of the patch has infrastructure for disassembling and reassembling of NEON instructions and about 35 NEON instructions implemented (but not tested extensively yet). We are planning to add full NEON instruction set incrementally.

This patch is not ready for submission yet. We just want to announce that there are some work in progress in order to avoid effort duplication. The resulting patch should be rather big so before submission we will do some clean up and split it into several smaller ones. If anyone wants to help with testing or development please feel free to contact me. Any comments are welcome too.
Comment 1 Alexander Potapenko 2010-02-03 13:43:52 UTC
I'm ready to help testing the patch on Google Chromium tests built for ARM
(http://code.google.com/p/chromium/wiki/LinuxChromiumArm). The list of
instructions currently used by base_unittests built for NEON is the following
one:
vabs.f64
vadd.f64
vcmpe.f32
vcmpe.f64
vcmp.f64
vcvt.f32.f64
vcvt.f32.s32
vcvt.f64.f32
vcvt.f64.s32
vcvt.f64.u32
vcvt.s32.f32
vcvt.s32.f64
vcvt.u32.f64
vdiv.f32
vdiv.f64
vldmia
vldr
vmla.f64
vmov
vmov.f64
vmrs
vmul.f32
vmul.f64
vneg.f64
vpop
vpush
vsqrt.f64
vstr
vsub.f64

Unfortunately even the lack of vldr instruction support stops me from running
the tests with NEON instructions enabled.
Comment 2 Dmitry Zhurikhin 2010-02-03 15:18:15 UTC
That's mostly floating point instructions. We are currently busy with implementing the integer part of NEON as we work under contract with Samsung and those are first priority. We plan to make a patch with a sufficient to be usable number of (integer) NEON instructions before the end of the month. Then we would probably work on floating point ones. Any help in implementing these will be highly appreciated.

PS: Some of these NEON instructions you listed (including vldr) are VFP instructions. They should be handled by trunk Valgrind though they may not know about d16..d31 registers.
Comment 3 Alexander Potapenko 2010-02-03 17:25:24 UTC
In fact they are not implemented yet, except for vmov (I've patched V to recognize d16..d31).

Thanks anyway, I'll be looking forward for your patches. I can also try to contribute once I have some time to experiment with the floating point instructions.
Comment 4 Kirill Batuzov 2010-02-04 17:29:03 UTC
Created attachment 40531 [details]
An example program with instructions mentioned in comment #1

Actually all these instructions are VFP and are supported by Valgrind. I created a small example which includes all these instructions. Today's Valgrind/VEX trunk successfully processed this example after I commented out two "vassert(0); //ATC" (at lines 3114 and 3240 in guest_arm_toIR.c).

You probably got confused by the fact that older (pre-UAL) syntax is used in Valgrind. So vadd.f64 is called faddd, vcvt.f32.f64 is called fcvtsd and so on.

Looks like the only instructions from your list which may be Neon instructions (or may be VFP like in my example; it depends on arguments) are:

vcvt.s32.f32
vcvt.f32.s32
vmul.f32
vmov
vmov.f64
vmrs
Comment 5 Alexander Potapenko 2010-02-04 18:02:02 UTC
Thanks for the clarification! I'll try to test those already implemented instructions then.
Comment 6 Dmitry Zhurikhin 2010-03-03 11:59:29 UTC
Created attachment 41281 [details]
Integer NEON support patch for VEX SVN revision 1963

Hello again.  This is an updated patch which adds ARM NEON support to Valgrind.  Important changes from the previous patch include:
* All integer NEON instructions are implemented and somewhat tested
* HWCAPS are introduced for ARM (they are needed during instruction selection)
* Tests for NEON instructions disassemble/reassemble added (but not included in 'make regtest')
Things to note:
* No floating point NEON support currently present
* HWCAPS detection may be inaccurate
Comment 7 Dmitry Zhurikhin 2010-03-03 12:04:08 UTC
Created attachment 41282 [details]
Integer NEON support patch for Valgrind SVN revision 11057
Comment 8 Dmitry Zhurikhin 2010-03-23 12:27:14 UTC
Created attachment 42197 [details]
NEON support patch for VEX SVN revision 1964

Hello.  Here's an update to the patch.  We've implemented all floating-point NEON instructions.  This work concludes NEON support for Valgrind.  Please review the patches and comment what things should be done for our work to be included into the trunk.  One specific question is whether we need to rewrite 'switch' statements from disassembling code into large number of 'if' statements (as VFP and scalar disassembling is written now; this will increase the code size enormously but may be more convenient in supporting and adding new features).

Known issues:
 - Flags IDC, IXC, UFC, OFC, DZC, IOC are not set.  We do not think this will impose any problems any time soon because: a) they are not set in the VFP code too actually and still there are no complains in the bugzilla; b) we did not manage to find any program reading these flags using google.
 - DIP macro is not called for NEON instructions.
 - VCVT floating-point <-> fixed-point was not tested at all because we do not have a hardware with the corresponding extension.
 - Neon tests should be compiled with -O0 because GCC tends to miscompile them.  See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43440 for details.
 - You may need a FCONSTS instruction support to run these tests.
Comment 9 Dmitry Zhurikhin 2010-03-23 12:28:12 UTC
Created attachment 42198 [details]
NEON support patch for Valgrind SVN revision 11094
Comment 10 Dmitry Zhurikhin 2010-03-23 12:29:46 UTC
Created attachment 42199 [details]
FCONSTS VFP instruction support
Comment 11 Julian Seward 2010-04-02 16:08:44 UTC
I gave it a try.  Although it passes its own regression tests,
I got an assertion failure when running real code.  Any ideas?

vex: priv/host_arm_isel.c:1982 (iselNeon64Expr_wrk): Assertion `ty == Ity_I64' failed.

This is with the patches on comments 8 and 9 but not the one
on comment 10 ("FCONSTS VFP instruction support").  Is that
also necessary?

this was running with --tool=none.  It was somewhere in this
block:

        0x5FFDF30:  stmdb r13!, {0x4070}

              ------ IMark(0x5FFDF30, 4) ------
              t0 = GET:I32(52)
              t1 = t0
              PUT(52) = Sub32(t0,0x10:I32)
              STle(Sub32(t1,0x4:I32)) = GET:I32(56)
              STle(Sub32(t1,0x8:I32)) = GET:I32(24)
              STle(Sub32(t1,0xC:I32)) = GET:I32(20)
              STle(Sub32(t1,0x10:I32)) = GET:I32(16)

        0x5FFDF34:  fstmdbd r13!, {d8-d8}

              ------ IMark(0x5FFDF34, 4) ------
              PUT(60) = 0x5FFDF34:I32
              t2 = GET:I32(52)
              t3 = Sub32(t2,0x8:I32)
              t4 = t3
              PUT(52) = t3
              STle(Add32(t4,0x0:I32)) = GET:F64(168)

        0x5FFDF38:  mov r5, r0

              ------ IMark(0x5FFDF38, 4) ------
              PUT(60) = 0x5FFDF38:I32
              t6 = GET:I32(0)
              t5 = t6
              t7 = t5
              PUT(20) = t7

        0x5FFDF3C:  vmov d8, r2, r3

              ------ IMark(0x5FFDF3C, 4) ------
              PUT(60) = 0x5FFDF3C:I32
              PUT(168) = ReinterpI64asF64(32HLto64(GET:I32(12),GET:I32(8)))

        0x5FFDF40:  ldr r4, [r15, #+184]

              ------ IMark(0x5FFDF40, 4) ------
              PUT(60) = 0x5FFDF40:I32
              t8 = Add32(0x5FFDF48:I32,0xB8:I32)
              t9 = 0x5FFDF48:I32
              PUT(16) = LDle:I32(t8)

        0x5FFDF44:  bl 0x5FFDDF8

              ------ IMark(0x5FFDF44, 4) ------
              PUT(60) = 0x5FFDF44:I32
              PUT(56) = 0x5FFDF48:I32

        0x5FFDDF8:  stmdb r13!, {0x40F0}

              ------ IMark(0x5FFDDF8, 4) ------
              PUT(60) = 0x5FFDDF8:I32
              t10 = GET:I32(52)
              t11 = t10
              PUT(52) = Sub32(t10,0x14:I32)
              STle(Sub32(t11,0x4:I32)) = GET:I32(56)
              STle(Sub32(t11,0x8:I32)) = GET:I32(28)
              STle(Sub32(t11,0xC:I32)) = GET:I32(24)
              STle(Sub32(t11,0x10:I32)) = GET:I32(20)
              STle(Sub32(t11,0x14:I32)) = GET:I32(16)

        0x5FFDDFC:  ldr r4, [r15, #+136]

              ------ IMark(0x5FFDDFC, 4) ------
              PUT(60) = 0x5FFDDFC:I32
              t12 = Add32(0x5FFDE04:I32,0x88:I32)
              t13 = 0x5FFDE04:I32
              PUT(16) = LDle:I32(t12)

        0x5FFDE00:  ldr r5, [r15, #+136]

              ------ IMark(0x5FFDE00, 4) ------
              PUT(60) = 0x5FFDE00:I32
              t14 = Add32(0x5FFDE08:I32,0x88:I32)
              t15 = 0x5FFDE08:I32
              PUT(20) = LDle:I32(t14)

        0x5FFDE04:  sub r13, r13, #0x14

              ------ IMark(0x5FFDE04, 4) ------
              PUT(60) = 0x5FFDE04:I32
              t16 = GET:I32(52)
              t17 = 0x14:I32
              t18 = Sub32(t16,t17)
              PUT(52) = t18

        0x5FFDE08:  add r4, r15, r4

              ------ IMark(0x5FFDE08, 4) ------
              PUT(60) = 0x5FFDE08:I32
              t19 = 0x5FFDE10:I32
              t21 = GET:I32(16)
              t20 = t21
              t22 = Add32(t19,t20)
              PUT(16) = t22

        0x5FFDE0C:  add r7, r4, r5

              ------ IMark(0x5FFDE0C, 4) ------
              PUT(60) = 0x5FFDE0C:I32
              t23 = GET:I32(16)
              t25 = GET:I32(20)
              t24 = t25
              t26 = Add32(t23,t24)
              PUT(28) = t26

        0x5FFDE10:  ldr r6, [r7, #+4]

              ------ IMark(0x5FFDE10, 4) ------
              PUT(60) = 0x5FFDE10:I32
              t27 = Add32(GET:I32(28),0x4:I32)
              t28 = GET:I32(28)
              PUT(24) = LDle:I32(t27)

        0x5FFDE14:  cmp r6, #0x0

              ------ IMark(0x5FFDE14, 4) ------
              PUT(60) = 0x5FFDE14:I32
              t29 = GET:I32(24)
              t30 = 0x0:I32
              t31 = 0x0:I32
              PUT(64) = 0x2:I32
              PUT(68) = t29
              PUT(72) = t30
              PUT(76) = t31

        0x5FFDE18:  b{eq} 0x5FFDE2C

              ------ IMark(0x5FFDE18, 4) ------
              PUT(60) = 0x5FFDE18:I32
              t32 = armg_calculate_condition[mcx=0x9]{0x38135ac4}(Or32(GET:I32(64),0x0:I32),GET:I32(68),GET:I32(72),GET:I32(76)):I32
              if (32to1(t32)) goto {Boring} 0x5FFDE2C:I32
              goto {Boring} 0x5FFDE1C:I32
Comment 12 Dmitry Zhurikhin 2010-04-02 18:34:42 UTC
Created attachment 42446 [details]
Fix for the bug from the previous comment

Ouch.  This is a "caused-by-lack-of-attention" bug.  Please use this patch.
Comment 13 Julian Seward 2010-04-04 14:00:28 UTC
Nice work, first of all.  With the fix in comment 12, it seems to work
ok at least from my very limited testing.

This is a huge patch pair and it will take some time to review/check
it all.  Here are some comments about the VEX part.  I have a number
of small comments and two larger concerns.  The small comments first:


----------

+/* CALLED FROM GENERATED CODE: CLEAN HELPER */
+/* Calculate the QC flag from the thunk components, in the lowest bit
+   of the word (bit 0). */
+UInt armg_calculate_flag_qc ( UInt resL1, UInt resL2,
+                              UInt resR1, UInt resR2 )
+{
+   if (resL1 != resR1 || resL2 != resR2)
+      return 1;
+   else
+      return 0;
+}

Wouldn't it be faster to compute this in-line?  The call/return
overhead is large compared to the amount of useful work.  Also,
inlining might allow ir_opt to improve it if any of the args are
constants.

   1Uto32(
      CmpNE32(
         Or32(Xor32(resL1, resR1), Xor32(resL2, resR2)),
         mkU32(0)
      )
   )


----------

+static inline
+UInt dis_neon_reg_d(UInt theInstr)

Please use "dis_" only for functions which generate IR.  This function
and its friends produce a register number AFAICS.  A better name would
be get_neon_d_regno.



----------

No panicing/asserting on unrecognised instructions:

+            if(C == 3) {
+               vpanic("dis_neon_data_3same: VHSUB encountered undefined instruction\n");
+            }

In this and other places: we should never panic or assert when
encountering an instruction that is not recognised.  Instead
control must flow back to disArmInstr_WRK so that the
standard "generate a SIGILL" piece of IR can be generated.


----------

NO TABS

Please make sure there aren't any tab characters in the patch.  I long
ago told my emacs to only insert spaces, not tabs.


----------

Excessively verbose switches.  In many places in both the front
and back ends, you have simple switch statements which take up
many lines of code.  Please compact these where it is possible to
do so without exceeding an 80-char line length.  eg

+                  switch (size) {
+                     case 0:
+                        op = Iop_Max8Sx8;
+                        break;
+                     case 1:
+                        op = Iop_Max16Sx4;
+                        break;
+                     case 2:
+                        op = Iop_Max32Sx2;
+                        break;
+                     case 3:
+                        return False;
+                     default:
+                        vassert(0);
+                  }

becomes

+                  switch (size) {
+                     case 0:  op = Iop_Max8Sx8;  break;
+                     case 1:  op = Iop_Max16Sx4; break;
+                     case 2:  op = Iop_Max32Sx2; break;
+                     case 3:  return False;
+                     default: vassert(0);
+                  }


-----------

Space after if's:  "if(" is ugly; use "if ("

+            if(U == 1) {
+            if (U == 1) {


------------

If you have #if 0'd one of these "vassert(0); //ATC" assertions, just
delete it.  ATC means Awaiting Test Case, so I assume you found such
a test case for each one you #if 0'd out.

+#if 0
          vassert(0); //ATC
+#endif


------------

In comment 8 you write:

> - DIP macro is not called for NEON instructions.

Please ensure the front end can print _all_ NEON instructions
that it can translate into IR.  It is impossible to
maintain/debug the front ends without proper printing.



------------

+extern ARMAModeN* mkARMAModeN_RR ( HReg, HReg );
+extern ARMAModeN* mkARMAModeN_R ( HReg );

Is the _RR form actually used?  I thought the only allowable addressing 
mode form Neon was _R.


------------

+      ARMin_NShift,
+      ARMin_NeonImm,
+      /* This is not a NEON instruction. Actually there is no corresponding
+         instruction in ARM instruction set at all. We need this one to
+         generate spill/reload of 128-bit registers since current register
+         allocator demands them to consist of no more than two instructions.
+         We will split this instruction into 2 or 3 ARM instructions on the
+         emiting phase.
+
+         NOTE: source and destination registers should be different!
*/

I did not understand this comment.  What does spilling have to do with
generating Neon immediates?  And why do Neon spills require > 2
instructions?


------------

+               typeOfIRExpr(env->type_env, e->Iex.Binop.arg2) != Ity_I8) {
+                  vassert("ARM supports Iop_VDup with constant "
+                          "second argument less than 16 only\n");
+            }
+            imm4 = e->Iex.Binop.arg2->Iex.Const.con->Ico.U8;
+            if (imm4 >= 16) {
+               vassert("ARM supports Iop_VDup with constant "
+                       "second argument less than 16 only\n");
+            }

that's not right.  You mean to call vpanic, not vassert, yes?


------------

Larger concerns:

(0) Regressions.  Did you check that the resulting Valgrind still
builds and runs on 32- and 64-bit x86 linux, and that the
regression test results on those platforms are unchanged?


(1) QC flag handling.  You've added support for QC -- is that a
sticky saturation flag?  How does this interact, if at all, with
the existing NZCV flag handling?  Where in the guest state is QC
stored?


(2) New primops.  This is my #1 concern.  A large number of new
IROps have been added.  Normally I try to avoid adding primops
unless it's really hard to avoid, because (a) adding primops
impacts all tools built on Valgrind, and (b) it's impossible to
get rid of them later.

Are you sure all of them are really necessary?  I wonder if some
of them don't duplicate existing primops, or can be synthesised
using a few existing primops.  I will study the added primops and
see if it is possible to reduce the number of new ones.
Comment 14 Dmitry Zhurikhin 2010-04-05 12:26:12 UTC
Thanks alot for these comments!

Regarding the minor issues we'll correct (and already did several of) them and will post an updated patch as soon as.  To clarify, the comment after "ARMin_NeonImm," goes to the next instruction "ARMin_Add32", not to "ARMin_NeonImm" itself.  Two or more instructions are needed because ARM immediate length is sometimes not enough to address NEON registers in guest state.  So one instruction for load/store, one for calculating offset (r12 = r8 + immediate) and one more if immediate can't be represented in 8-4 form.

Now to the bigger concerns.

(0) We tested that Valgrind runs on x86 and amd64 but didn't run a testsuite.  We'll surely do this but I think there shouldn't be any problems.  

(1) Yes, QC is a saturation flag.  It is stored along with NCVZ in FPSCR register and read and written via special instructions (VMRS and VRMS).

(2) Actually we tried to not introduce unneeded new IRops.  But unfortunately there are many very specfic NEON instructions, e.g. most (if not all) saturating instructions can't be represented with more simple ones.  Besides them AFAIK we mostly added IRops for instructions of absent sizes or types.  We'll get over new IRops too to try and reduce their number.
Comment 15 Dmitry Zhurikhin 2010-04-19 14:45:52 UTC
Created attachment 42893 [details]
NEON support patch for VEX SVN revision 1974

Here's one more update for correcting most uncovered issues.

> Wouldn't it be faster to compute this in-line?
For some unknown reasons it is not.  We've implemented inline computation and measured the execution time of 10^9 VQADD instructions: helper call was faster. Anyway right now both cases are present in the code and one of them is chosen by #if directive.

DIP is now implemented for all NEON instructions.

> Is the _RR form actually used?  I thought the only allowable
> addressing mode form Neon was _R.
_RR form exists.  It is not "[rN + rM]" but "[rN], rM" which means
"read from address rN and increment rN by rM".

About new IROp's:
After some reviewing we managed to get rid of Iop_PolynomialMul16x4, Iop_PolynomialMul32x2, Iop_PolynomialMul16x8, Iop_PolynomialMul32x4, Iop_SetElem8x16, Iop_SetElem16x8, Iop_SetElem32x4, Iop_SetElem64x2, Iop_VDup<all sizes>
What is left: conversion, saturating operations, pairwise operations, some data-shuffling operations and missing sizes for existing operations.  They are hardly representable with few existing ones.  Actually we are thinking about adding new IROps for VRHADD, VHADD, VRHSUB, VHSUB: they are now built from existing primops but this gives serious impact on performance.

Also cleanup and styling-fixing were done (thanks again for comments).

We slightly tested this patch also on x86 and amd64 and it showed no problems.

NB: This patch contains some code that should be removed before comminting (such as FCONSTS VFP instruction patch, debugging code and some fixes that allow building with the newest CodeSourcery 2009q3 compiler).  Once approved we will clean the NEON patch and make additional separate patches.
Comment 16 Dmitry Zhurikhin 2010-04-19 14:47:10 UTC
Created attachment 42894 [details]
NEON support patch for Valgrind SVN revision 11104
Comment 17 Julian Seward 2010-05-24 12:21:26 UTC
Thanks for the revised patch.  I'm still unhappy about the large
number of new IROps, but I see they are mostly unavoidable.

Could you provide better documentation in comments about what the new
primops do?  In many cases I can't even guess what the arity, type or
behaviour is.  It would be good to give an example (eg) like this for
all the new primops, or at least for each group:

   Iop_Foo32x2( [a,b], [c,d] ) = [a+d, c-b]

Also, there are "P" and "D" in many of the primop names.  What
do they mean?


+      /* Convertion to/from int */
+      Iop_I32UtoFx2,     Iop_I32StoFx2,       /* I32x4 -> F32x4 */
Doesn't take a rounding mode?  What's the behaviour in the case
where the value needs to be rounded?


+      Iop_F32ToFixed32Ux2, Iop_F32ToFixed32Sx2, /* fp -> fixed-point */
+      Iop_Fixed32UToF32x2, Iop_Fixed32SToF32x2, /* fixed-point -> fp */
What is this Fixed32 format?


+      Iop_PMax32Fx2,     Iop_PMin32Fx2,
What does the P mean here?


+      Iop_Abd32Fx2,
Abd ?


+      /* ADDITION (pairwise) */
+      /* The first element of the resulting vector is a sum of the first and
+         the second elements of the left operand, second - a sum of the third
+         and the forth elements of the left operand etc. */
+      Iop_PAdd8x8, Iop_PAdd16x4, Iop_PAdd32x2,
I don't understand.  If a,b,c,d are all 32-bit int values,
what does (eg)
   Iop_PAdd32x2( [a,b], [c,d] )
produce?


+      /* Longening variant is unary. The resulting vector contains two times
+         less elements than operand, but they are two times wider. */
+      Iop_PAddL8Ux8, Iop_PAddL16Ux4, Iop_PAddL32Ux2,
+      Iop_PAddL8Sx8, Iop_PAddL16Sx4, Iop_PAddL32Sx2,
again, what does (eg)
   Iop_PAddL32Ux2( [a,b] )
produce?


+      /* ABSOLUTE VALUE */
+      Iop_Abs8x8, Iop_Abs16x4, Iop_Abs32x2,
what is the result when the argument value is the lowest negative
value (-0x80, -0x8000, -0x80000000) ?


+      Iop_QDMulHi16Sx4, Iop_QDMulHi32Sx2,
+      Iop_QRDMulHi16Sx4, Iop_QRDMulHi32Sx2,
what are these?


+      /* MIN/MAX (pairwise) */
+      Iop_PMax8Sx8, Iop_PMax16Sx4, Iop_PMax32Sx2,
+      Iop_PMax8Ux8, Iop_PMax16Ux4, Iop_PMax32Ux2,
+      Iop_PMin8Sx8, Iop_PMin16Sx4, Iop_PMin32Sx2,
+      Iop_PMin8Ux8, Iop_PMin16Ux4, Iop_PMin32Ux2,
what are these?


+      /* COUNT ones / leading zeroes / leading sign bits (not including topmost
+         bit) */
+      Iop_Cnt8x8,
+      Iop_Clz8Sx8, Iop_Clz16Sx4, Iop_Clz32Sx2,
+      Iop_Cls8Sx8, Iop_Cls16Sx4, Iop_Cls32Sx2,
if Iop_Cnt8x8 counts leading ones, would't Iop_Clo8x8 be a more
consistent name?


+      /* GET / SET elements of VECTOR
+         GET is binop (I64, I8) -> I<elem_size>
+         SET is triop (I64, I8, I<elem_size>) -> I64 */
+      Iop_GetElem8x8, Iop_GetElem16x4, Iop_GetElem32x2,
+      Iop_SetElem8x8, Iop_SetElem16x4, Iop_SetElem32x2,
what happens if the I8 is not known at JIT time, only at
run time?  Can the back end handle this case?


+      /* ZIP / UNZIP
+          unzipped           zipped
+         ( A B C D ) <--> ( G C H D )
+         ( E F G H )      ( E A F B ) */
+      Iop_ZipHI8x8, Iop_ZipHI16x4, Iop_ZipHI32x2,
+      Iop_ZipLO8x8, Iop_ZipLO16x4, Iop_ZipLO32x2,
+      Iop_UnZipHI8x8, Iop_UnZipHI16x4, Iop_UnZipHI32x2,
+      Iop_UnZipLO8x8, Iop_UnZipLO16x4, Iop_UnZipLO32x2,
I didn't understand the comment.  What does (eg)
   ZipHI16x4( [a,b,c,d], [e,f,g,h] )
produce?

How do these relate to Iop_Interleave{HI,LO}* and
Iop_Cat{Even,Odd}Lanes* ?  I wonder if there is some
redundancy here.


+      /* EXTRACT -- copy 8-N highest bytes from argL to 8-N lowest bytes
+         of result and N lowest bytes of argR to N highest bytes of result. */
+      Iop_Extract64,
What's N ?  I can't even guess the arity of this primop from
the description.  (1-arg?  2-arg?)


+      /* REVERSE in Half-words, Words, Double-words */
+      Iop_Reverse16_8x8,
+      Iop_Reverse32_8x8, Iop_Reverse32_16x4,
+      Iop_Reverse64_8x8, Iop_Reverse64_16x4, Iop_Reverse64_32x2,
what are these?


+      /* Vector Reciprocal Estimate finds an approximate reciprocal of each
+      element in the operand vector, and places the results in the destination
+      vector.  */
+      Iop_Recip32Fx2, Iop_Recip32x2,
+      Iop_Recps32Fx2,
+
+      Iop_Rsqrts32Fx2,
+      Iop_Rsqrte32Fx2, Iop_Rsqrte32x2,
What's the difference between  Iop_Recip32Fx2 and Iop_Recps32Fx2 ?
What's the difference between Iop_Rsqrts32Fx2 and Iop_Rsqrte32Fx2 ?
How many good bits are there in the estimate (see Iop_Est5RFSqrt) ?
What happens if the argument is zero?


+      /* Vector Absolute */
+      Iop_Abd32Fx4,
+      Iop_Abs32Fx4,
arity? type? example?


+      /* WIDENING */
+      /* Longening - opposite to shortening, I64->V128 */
+      Iop_Longen8Ux8, Iop_Longen16Ux4, Iop_Longen32Ux2,
+      Iop_Longen8Sx8, Iop_Longen16Sx4, Iop_Longen32Sx2,
please, give an example
Comment 18 Dmitry Zhurikhin 2010-06-01 16:26:44 UTC
> Thanks for the revised patch.  I'm still unhappy about the large
> number of new IROps, but I see they are mostly unavoidable.
>
> Could you provide better documentation in comments about what the new
> primops do?  In many cases I can't even guess what the arity, type or
> behaviour is.  It would be good to give an example (eg) like this for
> all the new primops, or at least for each group:
>
>     Iop_Foo32x2( [a,b], [c,d] ) = [a+d, c-b]
>
> Also, there are "P" and "D" in many of the primop names.  What
> do they mean?
> +      /* Convertion to/from int */
> +      Iop_I32UtoFx2,     Iop_I32StoFx2,       /* I32x4 ->  F32x4 */
> Doesn't take a rounding mode?  What's the behaviour in the case
> where the value needs to be rounded?
>
>
> +      Iop_F32ToFixed32Ux2, Iop_F32ToFixed32Sx2, /* fp ->  fixed-point */
> +      Iop_Fixed32UToF32x2, Iop_Fixed32SToF32x2, /* fixed-point ->  fp */
> What is this Fixed32 format?

Fixed32 format is floating-point number with fixed number of fraction bits. The number of fraction bits is passed as a second argument.
The floating-point to fixed-point operation uses the Round towards Zero rounding mode. The fixed-point to floating-point operation uses the Round to Nearest rounding mode. So no Rounding-Mode arguments are passed to this conversion operations. If it is not correct and Rounding-Mode should be always passed to conversion operations and asserted to be of a certain value later on the instruction selection phase, we can change these IROp's accordingly.
Iop_I32UtoFx2 and Iop_I32StoFx2 use round to nearest rounding mode.
We can rename them by adding _RN and _RZ trailers like Iop_FtoI32Ux4_RZ if needed.


> +      Iop_PMax32Fx2,     Iop_PMin32Fx2,
> What does the P mean here?
>
> +      /* MIN/MAX (pairwise) */
> +      Iop_PMax8Sx8, Iop_PMax16Sx4, Iop_PMax32Sx2,
> +      Iop_PMax8Ux8, Iop_PMax16Ux4, Iop_PMax32Ux2,
> +      Iop_PMin8Sx8, Iop_PMin16Sx4, Iop_PMin32Sx2,
> +      Iop_PMin8Ux8, Iop_PMin16Ux4, Iop_PMin32Ux2,
> what are these?
>
> +      /* ADDITION (pairwise) */
> +      /* The first element of the resulting vector is a sum of the first and
> +         the second elements of the left operand, second - a sum of the third
> +         and the forth elements of the left operand etc. */
> +      Iop_PAdd8x8, Iop_PAdd16x4, Iop_PAdd32x2,
> I don't understand.  If a,b,c,d are all 32-bit int values,
> what does (eg)
>     Iop_PAdd32x2( [a,b], [c,d] )
> produce?

P means pairwise.
Iop_PFoo32x4([a,b,c,d],[e,f,g,h]) =
[Foo32(a,b), Foo32(c,d), Foo32(e,f), Foo32(g,h)]


>
>
> +      /* Longening variant is unary. The resulting vector contains two times
> +         less elements than operand, but they are two times wider. */
> +      Iop_PAddL8Ux8, Iop_PAddL16Ux4, Iop_PAddL32Ux2,
> +      Iop_PAddL8Sx8, Iop_PAddL16Sx4, Iop_PAddL32Sx2,
> again, what does (eg)
>     Iop_PAddL32Ux2( [a,b] )
> produce?

Imagine the argument vector consists of 4 16-bit values a, b, c, d. The resulting vector then consists of 2 32-bit values a+b and c+d.
Iop_PAddL16Ux4( [a,b,c,d] ) = [a+b,c+d]


> +      /* ABSOLUTE VALUE */
> +      Iop_Abs8x8, Iop_Abs16x4, Iop_Abs32x2,
> what is the result when the argument value is the lowest negative
> value (-0x80, -0x8000, -0x80000000) ?

The result will be equal to argument in this case. It is always guaranteed that the result is absolute value of argument if treated as an unsigned integer.


> +      Iop_QDMulHi16Sx4, Iop_QDMulHi32Sx2,
> +      Iop_QRDMulHi16Sx4, Iop_QRDMulHi32Sx2,
> what are these?

They are Vector Saturating Doubling Multiply Returning High Half and Vector Saturating Rounding Doubling Multiply Returning High Half.
They multiply corresponding elements in two vectors, double the results, and place the most significant half of the final results in the destination vector. The results are truncated or rounded. If any of the results overflow, they are saturated.

Example:
Iop_QDMulHi16Sx4([a,b,c,d],[e,f,g,h]) = [SignedSat((a * e * 2) >> 16, 16), SignedSat((b * f * 2) >> 16, 16), SignedSat((c * g * 2) >> 16, 16), SignedSat((d * h * 2) >> 16, 16)]
Iop_QRDMulHi16Sx4([a,b,c,d],[e,f,g,h]) = [SignedSat((a * e * 2 + (1 << 15)) >> 16, 16), SignedSat((b * f * 2 + (1 << 15)) >> 16, 16), SignedSat((c * g * 2 + (1 << 15)) >> 16, 16), SignedSat((d * h * 2 + (1 << 15)) >> 16, 16)]


> +      /* COUNT ones / leading zeroes / leading sign bits (not including
> topmost
> +         bit) */
> +      Iop_Cnt8x8,
> +      Iop_Clz8Sx8, Iop_Clz16Sx4, Iop_Clz32Sx2,
> +      Iop_Cls8Sx8, Iop_Cls16Sx4, Iop_Cls32Sx2,
> if Iop_Cnt8x8 counts leading ones, would't Iop_Clo8x8 be a more
> consistent name?

Iop_Cnt8x8 counts all ones in each byte of input vector, not only leading.


> +      /* GET / SET elements of VECTOR
> +         GET is binop (I64, I8) ->  I<elem_size>
> +         SET is triop (I64, I8, I<elem_size>) ->  I64 */
> +      Iop_GetElem8x8, Iop_GetElem16x4, Iop_GetElem32x2,
> +      Iop_SetElem8x8, Iop_SetElem16x4, Iop_SetElem32x2,
> what happens if the I8 is not known at JIT time, only at
> run time?  Can the back end handle this case?

On ARM I8 is hardcoded into the instruction. The back end can not handle the case when I8 is known only at run time. There is an assert with proper error message in the back end.


> +      /* ZIP / UNZIP
> +          unzipped           zipped
> +         ( A B C D )<-->  ( G C H D )
> +         ( E F G H )      ( E A F B ) */
> +      Iop_ZipHI8x8, Iop_ZipHI16x4, Iop_ZipHI32x2,
> +      Iop_ZipLO8x8, Iop_ZipLO16x4, Iop_ZipLO32x2,
> +      Iop_UnZipHI8x8, Iop_UnZipHI16x4, Iop_UnZipHI32x2,
> +      Iop_UnZipLO8x8, Iop_UnZipLO16x4, Iop_UnZipLO32x2,
> I didn't understand the comment.  What does (eg)
>     ZipHI16x4( [a,b,c,d], [e,f,g,h] )
> produce?

ZipHI16x4( [a,b,c,d], [e,f,g,h] ) = [g, c, h, d]
ZipLO16x4( [a,b,c,d], [e,f,g,h] ) = [e, a, f, b]
UnZipHI16x4( [g, c, h, d], [e, a, f, b] ) = [a, b, c, d]
UnZipLO16x4( [g, c, h, d], [e, a, f, b] ) = [e, f, g, h]

> How do these relate to Iop_Interleave{HI,LO}* and
> Iop_Cat{Even,Odd}Lanes* ?  I wonder if there is some
> redundancy here.

This is a difficult question. We actually don't understand clearly what their semanthics is. Where should we look for information about them?


> +      /* EXTRACT -- copy 8-N highest bytes from argL to 8-N lowest bytes
> +         of result and N lowest bytes of argR to N highest bytes of result. */
> +      Iop_Extract64,
> What's N ?  I can't even guess the arity of this primop from
> the description.  (1-arg?  2-arg?)

EXTRACT -- copy 8-arg3 highest bytes from arg1 to 8-arg3 lowest bytes
of result and arg3 lowest bytes of arg2 to arg3 highest bytes of result.
(I64, I64, I8) -> I64
The ARM back end handles only constant arg3 in this operation.

> +      /* REVERSE in Half-words, Words, Double-words */
> +      Iop_Reverse16_8x8,
> +      Iop_Reverse32_8x8, Iop_Reverse32_16x4,
> +      Iop_Reverse64_8x8, Iop_Reverse64_16x4, Iop_Reverse64_32x2,
> what are these?

REVERSE the order of elements in each Half-word, Word or Double-word.
Reverse16_8x8([a,b,c,d,e,f,g,h]) = [b,a,d,c,f,e,h,g]
Reverse32_8x8([a,b,c,d,e,f,g,h]) = [d,c,b,a,h,g,f,e]
Reverse64_8x8([a,b,c,d,e,f,g,h]) = [h,g,f,e,d,c,b,a]


> +      /* Vector Reciprocal Estimate finds an approximate reciprocal of each
> +      element in the operand vector, and places the results in the destination
> +      vector.  */
> +      Iop_Recip32Fx2, Iop_Recip32x2,
> +      Iop_Recps32Fx2,
> +
> +      Iop_Rsqrts32Fx2,
> +      Iop_Rsqrte32Fx2, Iop_Rsqrte32x2,
> What's the difference between  Iop_Recip32Fx2 and Iop_Recps32Fx2 ?
> What's the difference between Iop_Rsqrts32Fx2 and Iop_Rsqrte32Fx2 ?
> How many good bits are there in the estimate (see Iop_Est5RFSqrt) ?
> What happens if the argument is zero?

Iop_Recip finds an reciprocal estimate of each element in the operand vector. If the result of Iop_Recip(d) is used as x[0] of the following recurrent equation
x[n+1] = x[n](2 - d*x[n])
the sequence x[n] converges to 1/d.
If the argument is zero the result is infinity.
Iop_Recps represents one iteration of the equation above. It computes
(2.0 - arg1 * arg2).
Iop_Rsqrte finds an reciprocal of square root estimate of each element in the operand vector. If the result of Iop_Rsqrte(d) is used as x[0] of the following recurrent equation
x[n+1] = x[n] * (3.0 - d*x[n]*x[n]) / 2
the sequence x[n] converges to 1/sqrt(d).
Iop_Rsqrts represents one iteration of the equation above. It computes
(3.0 - arg1 * arg2) / 2.0
We have no clue how many good bits are out there - it will need additional study.
We are going to replace Recps and Rsqrts with explicit computations of their formulas.


> +      /* Vector Absolute */
> +      Iop_Abd32Fx4,
> +      Iop_Abs32Fx4,
> arity? type? example?
>
>
> +      Iop_Abd32Fx2,
> Abd ?
>
Abd meens aboslute difference. So abs is an unary operation
Abs32Fx4( [a, b, c, d] ) = [abs(a), abs(b), abs(c), abs(d) ]
and Abd is binary
Abd32Fx4( [a, b, c, d], [e, f, g, h] ) =
[ abs(a - e), abs(b - f), abs(c - g), abs(d - h) ]
Abd is probably redundant and can be expressed via Abs and Sub but different extreme values (NaN, ± inf and so on) need to be checked.


> +      /* WIDENING */
> +      /* Longening - opposite to shortening, I64->V128 */
> +      Iop_Longen8Ux8, Iop_Longen16Ux4, Iop_Longen32Ux2,
> +      Iop_Longen8Sx8, Iop_Longen16Sx4, Iop_Longen32Sx2,
> please, give an example

Longening --- sign or zero extended each element of the argument vector to the twice original size. The resulting vector consists of the same number of elements but each element and the vector itself are two times wider.
Iop_Longen32Sx2( [a, b] ) = [c, d]
where c = Iop_32Sto64(a) and d = Iop_32Sto64(b)
Comment 19 Julian Seward 2010-06-01 17:33:43 UTC
> ZipHI16x4( [a,b,c,d], [e,f,g,h] ) = [g, c, h, d]
> ZipLO16x4( [a,b,c,d], [e,f,g,h] ) = [e, a, f, b]
> UnZipHI16x4( [g, c, h, d], [e, a, f, b] ) = [a, b, c, d]
> UnZipLO16x4( [g, c, h, d], [e, a, f, b] ) = [e, f, g, h]
> 
> > How do these relate to Iop_Interleave{HI,LO}* and
> > Iop_Cat{Even,Odd}Lanes* ?  I wonder if there is some
> > redundancy here.
> 
> This is a difficult question. We actually don't understand clearly what their
> semanthics is. Where should we look for information about them?

Hmm, they never got properly documented.  I think it is this:

  // InterleaveLO interleaves lanes from the low halves
  // of the operand vectors, and InterleaveHI from the
  // high parts of the operand vectors:

  InterleaveLO( [a,b,c,d], [e,f,g,h] ) = [c,g,d,h]
  InterleaveHI( [a,b,c,d], [e,f,g,h] ) = [a,e,b,f]

  // CatOdd concatenates odd-numbered lanes from the
  // operand vectors; CatEven concatentates even-numbered
  // lanes from the operand vectors

  CatOddLanes ( [a,b,c,d], [e,f,g,h] ) = [a,c,e,g]
  CatEvenLanes( [a,b,c,d], [e,f,g,h] ) = [b,d,f,h]

so ZipHI( X, Y ) == InterleaveLO( Y, X )
   ZipLO( X, Y ) == InterleaveHI( Y, X )

   UnzipHI( X, Y ) == CatEven( Y, X )
   UnzipLO( X, Y ) == CaddOdd( Y, X )
Comment 20 Dmitry Zhurikhin 2010-06-03 15:49:16 UTC
Created attachment 47636 [details]
NEON support patch for VEX SVN revision 1974

An updated patch.

* A lot of comments are added to libvex_ir.h
* Zip/UnZip primops are removed
* Interleave{Odd,Even}Lanes are added for VTRN instruction disassemble
  Examples:
    InterleaveOddLanes( [a,b,c,d], [e,f,g,h] ) = [b,f,d,h]
    InterleaveEvenLanes( [a,b,c,d], [e,f,g,h] ) = [a,e,c,g]
* Abd primops are removed
* Tests for NaN and INFs are added
* Fixed bug in NaN comparison.  This results in addition of Iop_CmpGE32Fx2 primop (Iop_CmpGE32Fx4 has already existed)
* Recps and Rsqrts can not be easily removed because of the way they handle zero and infinite arguments
Comment 21 Dmitry Zhurikhin 2010-06-03 16:04:32 UTC
Created attachment 47637 [details]
NEON support patch for Valgrind SVN revision 11104

This patch is gzip'ped because it's too big for the bugzilla.
Comment 22 Julian Seward 2010-08-01 21:59:42 UTC
UInt armg_calculate_flag_idc ( UInt res1, UInt res2,
                              UInt res3, UInt res4 )
{
   UInt exp1 = (res1 >> 23) & 0xff;
   UInt exp2 = (res2 >> 23) & 0xff;
   UInt exp3 = (res3 >> 23) & 0xff;
   UInt exp4 = (res4 >> 23) & 0xff;
   if ((exp1 == 0) || (exp2 == 0) || (exp3 == 0) || (exp3 == 0))
      return 1;                                      ^^^^
   else                                              here
      return 0;
}

You mean 'exp4', right?
Comment 23 Dmitry Zhurikhin 2010-08-02 09:45:27 UTC
Yes, thanks.
Comment 24 Julian Seward 2010-08-07 12:49:18 UTC
Committed, thanks.  revs 2002, 11251, 11252.  On branches/THUMB so
it can be integrated with Thumb/Thumb2 support.
Comment 25 Julian Seward 2010-09-27 13:04:00 UTC
Closing.  NEON support is now available in the trunk.