Created attachment 39850 [details] Initial NEON support patch for VEX SVN revision 1954 Hello. This is an effort to add support for NEON (ARM Advanced SIMD extension) instruction set to Valgrind. Initial version of the patch has infrastructure for disassembling and reassembling of NEON instructions and about 35 NEON instructions implemented (but not tested extensively yet). We are planning to add full NEON instruction set incrementally. This patch is not ready for submission yet. We just want to announce that there are some work in progress in order to avoid effort duplication. The resulting patch should be rather big so before submission we will do some clean up and split it into several smaller ones. If anyone wants to help with testing or development please feel free to contact me. Any comments are welcome too.
I'm ready to help testing the patch on Google Chromium tests built for ARM (http://code.google.com/p/chromium/wiki/LinuxChromiumArm). The list of instructions currently used by base_unittests built for NEON is the following one: vabs.f64 vadd.f64 vcmpe.f32 vcmpe.f64 vcmp.f64 vcvt.f32.f64 vcvt.f32.s32 vcvt.f64.f32 vcvt.f64.s32 vcvt.f64.u32 vcvt.s32.f32 vcvt.s32.f64 vcvt.u32.f64 vdiv.f32 vdiv.f64 vldmia vldr vmla.f64 vmov vmov.f64 vmrs vmul.f32 vmul.f64 vneg.f64 vpop vpush vsqrt.f64 vstr vsub.f64 Unfortunately even the lack of vldr instruction support stops me from running the tests with NEON instructions enabled.
That's mostly floating point instructions. We are currently busy with implementing the integer part of NEON as we work under contract with Samsung and those are first priority. We plan to make a patch with a sufficient to be usable number of (integer) NEON instructions before the end of the month. Then we would probably work on floating point ones. Any help in implementing these will be highly appreciated. PS: Some of these NEON instructions you listed (including vldr) are VFP instructions. They should be handled by trunk Valgrind though they may not know about d16..d31 registers.
In fact they are not implemented yet, except for vmov (I've patched V to recognize d16..d31). Thanks anyway, I'll be looking forward for your patches. I can also try to contribute once I have some time to experiment with the floating point instructions.
Created attachment 40531 [details] An example program with instructions mentioned in comment #1 Actually all these instructions are VFP and are supported by Valgrind. I created a small example which includes all these instructions. Today's Valgrind/VEX trunk successfully processed this example after I commented out two "vassert(0); //ATC" (at lines 3114 and 3240 in guest_arm_toIR.c). You probably got confused by the fact that older (pre-UAL) syntax is used in Valgrind. So vadd.f64 is called faddd, vcvt.f32.f64 is called fcvtsd and so on. Looks like the only instructions from your list which may be Neon instructions (or may be VFP like in my example; it depends on arguments) are: vcvt.s32.f32 vcvt.f32.s32 vmul.f32 vmov vmov.f64 vmrs
Thanks for the clarification! I'll try to test those already implemented instructions then.
Created attachment 41281 [details] Integer NEON support patch for VEX SVN revision 1963 Hello again. This is an updated patch which adds ARM NEON support to Valgrind. Important changes from the previous patch include: * All integer NEON instructions are implemented and somewhat tested * HWCAPS are introduced for ARM (they are needed during instruction selection) * Tests for NEON instructions disassemble/reassemble added (but not included in 'make regtest') Things to note: * No floating point NEON support currently present * HWCAPS detection may be inaccurate
Created attachment 41282 [details] Integer NEON support patch for Valgrind SVN revision 11057
Created attachment 42197 [details] NEON support patch for VEX SVN revision 1964 Hello. Here's an update to the patch. We've implemented all floating-point NEON instructions. This work concludes NEON support for Valgrind. Please review the patches and comment what things should be done for our work to be included into the trunk. One specific question is whether we need to rewrite 'switch' statements from disassembling code into large number of 'if' statements (as VFP and scalar disassembling is written now; this will increase the code size enormously but may be more convenient in supporting and adding new features). Known issues: - Flags IDC, IXC, UFC, OFC, DZC, IOC are not set. We do not think this will impose any problems any time soon because: a) they are not set in the VFP code too actually and still there are no complains in the bugzilla; b) we did not manage to find any program reading these flags using google. - DIP macro is not called for NEON instructions. - VCVT floating-point <-> fixed-point was not tested at all because we do not have a hardware with the corresponding extension. - Neon tests should be compiled with -O0 because GCC tends to miscompile them. See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43440 for details. - You may need a FCONSTS instruction support to run these tests.
Created attachment 42198 [details] NEON support patch for Valgrind SVN revision 11094
Created attachment 42199 [details] FCONSTS VFP instruction support
I gave it a try. Although it passes its own regression tests, I got an assertion failure when running real code. Any ideas? vex: priv/host_arm_isel.c:1982 (iselNeon64Expr_wrk): Assertion `ty == Ity_I64' failed. This is with the patches on comments 8 and 9 but not the one on comment 10 ("FCONSTS VFP instruction support"). Is that also necessary? this was running with --tool=none. It was somewhere in this block: 0x5FFDF30: stmdb r13!, {0x4070} ------ IMark(0x5FFDF30, 4) ------ t0 = GET:I32(52) t1 = t0 PUT(52) = Sub32(t0,0x10:I32) STle(Sub32(t1,0x4:I32)) = GET:I32(56) STle(Sub32(t1,0x8:I32)) = GET:I32(24) STle(Sub32(t1,0xC:I32)) = GET:I32(20) STle(Sub32(t1,0x10:I32)) = GET:I32(16) 0x5FFDF34: fstmdbd r13!, {d8-d8} ------ IMark(0x5FFDF34, 4) ------ PUT(60) = 0x5FFDF34:I32 t2 = GET:I32(52) t3 = Sub32(t2,0x8:I32) t4 = t3 PUT(52) = t3 STle(Add32(t4,0x0:I32)) = GET:F64(168) 0x5FFDF38: mov r5, r0 ------ IMark(0x5FFDF38, 4) ------ PUT(60) = 0x5FFDF38:I32 t6 = GET:I32(0) t5 = t6 t7 = t5 PUT(20) = t7 0x5FFDF3C: vmov d8, r2, r3 ------ IMark(0x5FFDF3C, 4) ------ PUT(60) = 0x5FFDF3C:I32 PUT(168) = ReinterpI64asF64(32HLto64(GET:I32(12),GET:I32(8))) 0x5FFDF40: ldr r4, [r15, #+184] ------ IMark(0x5FFDF40, 4) ------ PUT(60) = 0x5FFDF40:I32 t8 = Add32(0x5FFDF48:I32,0xB8:I32) t9 = 0x5FFDF48:I32 PUT(16) = LDle:I32(t8) 0x5FFDF44: bl 0x5FFDDF8 ------ IMark(0x5FFDF44, 4) ------ PUT(60) = 0x5FFDF44:I32 PUT(56) = 0x5FFDF48:I32 0x5FFDDF8: stmdb r13!, {0x40F0} ------ IMark(0x5FFDDF8, 4) ------ PUT(60) = 0x5FFDDF8:I32 t10 = GET:I32(52) t11 = t10 PUT(52) = Sub32(t10,0x14:I32) STle(Sub32(t11,0x4:I32)) = GET:I32(56) STle(Sub32(t11,0x8:I32)) = GET:I32(28) STle(Sub32(t11,0xC:I32)) = GET:I32(24) STle(Sub32(t11,0x10:I32)) = GET:I32(20) STle(Sub32(t11,0x14:I32)) = GET:I32(16) 0x5FFDDFC: ldr r4, [r15, #+136] ------ IMark(0x5FFDDFC, 4) ------ PUT(60) = 0x5FFDDFC:I32 t12 = Add32(0x5FFDE04:I32,0x88:I32) t13 = 0x5FFDE04:I32 PUT(16) = LDle:I32(t12) 0x5FFDE00: ldr r5, [r15, #+136] ------ IMark(0x5FFDE00, 4) ------ PUT(60) = 0x5FFDE00:I32 t14 = Add32(0x5FFDE08:I32,0x88:I32) t15 = 0x5FFDE08:I32 PUT(20) = LDle:I32(t14) 0x5FFDE04: sub r13, r13, #0x14 ------ IMark(0x5FFDE04, 4) ------ PUT(60) = 0x5FFDE04:I32 t16 = GET:I32(52) t17 = 0x14:I32 t18 = Sub32(t16,t17) PUT(52) = t18 0x5FFDE08: add r4, r15, r4 ------ IMark(0x5FFDE08, 4) ------ PUT(60) = 0x5FFDE08:I32 t19 = 0x5FFDE10:I32 t21 = GET:I32(16) t20 = t21 t22 = Add32(t19,t20) PUT(16) = t22 0x5FFDE0C: add r7, r4, r5 ------ IMark(0x5FFDE0C, 4) ------ PUT(60) = 0x5FFDE0C:I32 t23 = GET:I32(16) t25 = GET:I32(20) t24 = t25 t26 = Add32(t23,t24) PUT(28) = t26 0x5FFDE10: ldr r6, [r7, #+4] ------ IMark(0x5FFDE10, 4) ------ PUT(60) = 0x5FFDE10:I32 t27 = Add32(GET:I32(28),0x4:I32) t28 = GET:I32(28) PUT(24) = LDle:I32(t27) 0x5FFDE14: cmp r6, #0x0 ------ IMark(0x5FFDE14, 4) ------ PUT(60) = 0x5FFDE14:I32 t29 = GET:I32(24) t30 = 0x0:I32 t31 = 0x0:I32 PUT(64) = 0x2:I32 PUT(68) = t29 PUT(72) = t30 PUT(76) = t31 0x5FFDE18: b{eq} 0x5FFDE2C ------ IMark(0x5FFDE18, 4) ------ PUT(60) = 0x5FFDE18:I32 t32 = armg_calculate_condition[mcx=0x9]{0x38135ac4}(Or32(GET:I32(64),0x0:I32),GET:I32(68),GET:I32(72),GET:I32(76)):I32 if (32to1(t32)) goto {Boring} 0x5FFDE2C:I32 goto {Boring} 0x5FFDE1C:I32
Created attachment 42446 [details] Fix for the bug from the previous comment Ouch. This is a "caused-by-lack-of-attention" bug. Please use this patch.
Nice work, first of all. With the fix in comment 12, it seems to work ok at least from my very limited testing. This is a huge patch pair and it will take some time to review/check it all. Here are some comments about the VEX part. I have a number of small comments and two larger concerns. The small comments first: ---------- +/* CALLED FROM GENERATED CODE: CLEAN HELPER */ +/* Calculate the QC flag from the thunk components, in the lowest bit + of the word (bit 0). */ +UInt armg_calculate_flag_qc ( UInt resL1, UInt resL2, + UInt resR1, UInt resR2 ) +{ + if (resL1 != resR1 || resL2 != resR2) + return 1; + else + return 0; +} Wouldn't it be faster to compute this in-line? The call/return overhead is large compared to the amount of useful work. Also, inlining might allow ir_opt to improve it if any of the args are constants. 1Uto32( CmpNE32( Or32(Xor32(resL1, resR1), Xor32(resL2, resR2)), mkU32(0) ) ) ---------- +static inline +UInt dis_neon_reg_d(UInt theInstr) Please use "dis_" only for functions which generate IR. This function and its friends produce a register number AFAICS. A better name would be get_neon_d_regno. ---------- No panicing/asserting on unrecognised instructions: + if(C == 3) { + vpanic("dis_neon_data_3same: VHSUB encountered undefined instruction\n"); + } In this and other places: we should never panic or assert when encountering an instruction that is not recognised. Instead control must flow back to disArmInstr_WRK so that the standard "generate a SIGILL" piece of IR can be generated. ---------- NO TABS Please make sure there aren't any tab characters in the patch. I long ago told my emacs to only insert spaces, not tabs. ---------- Excessively verbose switches. In many places in both the front and back ends, you have simple switch statements which take up many lines of code. Please compact these where it is possible to do so without exceeding an 80-char line length. eg + switch (size) { + case 0: + op = Iop_Max8Sx8; + break; + case 1: + op = Iop_Max16Sx4; + break; + case 2: + op = Iop_Max32Sx2; + break; + case 3: + return False; + default: + vassert(0); + } becomes + switch (size) { + case 0: op = Iop_Max8Sx8; break; + case 1: op = Iop_Max16Sx4; break; + case 2: op = Iop_Max32Sx2; break; + case 3: return False; + default: vassert(0); + } ----------- Space after if's: "if(" is ugly; use "if (" + if(U == 1) { + if (U == 1) { ------------ If you have #if 0'd one of these "vassert(0); //ATC" assertions, just delete it. ATC means Awaiting Test Case, so I assume you found such a test case for each one you #if 0'd out. +#if 0 vassert(0); //ATC +#endif ------------ In comment 8 you write: > - DIP macro is not called for NEON instructions. Please ensure the front end can print _all_ NEON instructions that it can translate into IR. It is impossible to maintain/debug the front ends without proper printing. ------------ +extern ARMAModeN* mkARMAModeN_RR ( HReg, HReg ); +extern ARMAModeN* mkARMAModeN_R ( HReg ); Is the _RR form actually used? I thought the only allowable addressing mode form Neon was _R. ------------ + ARMin_NShift, + ARMin_NeonImm, + /* This is not a NEON instruction. Actually there is no corresponding + instruction in ARM instruction set at all. We need this one to + generate spill/reload of 128-bit registers since current register + allocator demands them to consist of no more than two instructions. + We will split this instruction into 2 or 3 ARM instructions on the + emiting phase. + + NOTE: source and destination registers should be different! */ I did not understand this comment. What does spilling have to do with generating Neon immediates? And why do Neon spills require > 2 instructions? ------------ + typeOfIRExpr(env->type_env, e->Iex.Binop.arg2) != Ity_I8) { + vassert("ARM supports Iop_VDup with constant " + "second argument less than 16 only\n"); + } + imm4 = e->Iex.Binop.arg2->Iex.Const.con->Ico.U8; + if (imm4 >= 16) { + vassert("ARM supports Iop_VDup with constant " + "second argument less than 16 only\n"); + } that's not right. You mean to call vpanic, not vassert, yes? ------------ Larger concerns: (0) Regressions. Did you check that the resulting Valgrind still builds and runs on 32- and 64-bit x86 linux, and that the regression test results on those platforms are unchanged? (1) QC flag handling. You've added support for QC -- is that a sticky saturation flag? How does this interact, if at all, with the existing NZCV flag handling? Where in the guest state is QC stored? (2) New primops. This is my #1 concern. A large number of new IROps have been added. Normally I try to avoid adding primops unless it's really hard to avoid, because (a) adding primops impacts all tools built on Valgrind, and (b) it's impossible to get rid of them later. Are you sure all of them are really necessary? I wonder if some of them don't duplicate existing primops, or can be synthesised using a few existing primops. I will study the added primops and see if it is possible to reduce the number of new ones.
Thanks alot for these comments! Regarding the minor issues we'll correct (and already did several of) them and will post an updated patch as soon as. To clarify, the comment after "ARMin_NeonImm," goes to the next instruction "ARMin_Add32", not to "ARMin_NeonImm" itself. Two or more instructions are needed because ARM immediate length is sometimes not enough to address NEON registers in guest state. So one instruction for load/store, one for calculating offset (r12 = r8 + immediate) and one more if immediate can't be represented in 8-4 form. Now to the bigger concerns. (0) We tested that Valgrind runs on x86 and amd64 but didn't run a testsuite. We'll surely do this but I think there shouldn't be any problems. (1) Yes, QC is a saturation flag. It is stored along with NCVZ in FPSCR register and read and written via special instructions (VMRS and VRMS). (2) Actually we tried to not introduce unneeded new IRops. But unfortunately there are many very specfic NEON instructions, e.g. most (if not all) saturating instructions can't be represented with more simple ones. Besides them AFAIK we mostly added IRops for instructions of absent sizes or types. We'll get over new IRops too to try and reduce their number.
Created attachment 42893 [details] NEON support patch for VEX SVN revision 1974 Here's one more update for correcting most uncovered issues. > Wouldn't it be faster to compute this in-line? For some unknown reasons it is not. We've implemented inline computation and measured the execution time of 10^9 VQADD instructions: helper call was faster. Anyway right now both cases are present in the code and one of them is chosen by #if directive. DIP is now implemented for all NEON instructions. > Is the _RR form actually used? I thought the only allowable > addressing mode form Neon was _R. _RR form exists. It is not "[rN + rM]" but "[rN], rM" which means "read from address rN and increment rN by rM". About new IROp's: After some reviewing we managed to get rid of Iop_PolynomialMul16x4, Iop_PolynomialMul32x2, Iop_PolynomialMul16x8, Iop_PolynomialMul32x4, Iop_SetElem8x16, Iop_SetElem16x8, Iop_SetElem32x4, Iop_SetElem64x2, Iop_VDup<all sizes> What is left: conversion, saturating operations, pairwise operations, some data-shuffling operations and missing sizes for existing operations. They are hardly representable with few existing ones. Actually we are thinking about adding new IROps for VRHADD, VHADD, VRHSUB, VHSUB: they are now built from existing primops but this gives serious impact on performance. Also cleanup and styling-fixing were done (thanks again for comments). We slightly tested this patch also on x86 and amd64 and it showed no problems. NB: This patch contains some code that should be removed before comminting (such as FCONSTS VFP instruction patch, debugging code and some fixes that allow building with the newest CodeSourcery 2009q3 compiler). Once approved we will clean the NEON patch and make additional separate patches.
Created attachment 42894 [details] NEON support patch for Valgrind SVN revision 11104
Thanks for the revised patch. I'm still unhappy about the large number of new IROps, but I see they are mostly unavoidable. Could you provide better documentation in comments about what the new primops do? In many cases I can't even guess what the arity, type or behaviour is. It would be good to give an example (eg) like this for all the new primops, or at least for each group: Iop_Foo32x2( [a,b], [c,d] ) = [a+d, c-b] Also, there are "P" and "D" in many of the primop names. What do they mean? + /* Convertion to/from int */ + Iop_I32UtoFx2, Iop_I32StoFx2, /* I32x4 -> F32x4 */ Doesn't take a rounding mode? What's the behaviour in the case where the value needs to be rounded? + Iop_F32ToFixed32Ux2, Iop_F32ToFixed32Sx2, /* fp -> fixed-point */ + Iop_Fixed32UToF32x2, Iop_Fixed32SToF32x2, /* fixed-point -> fp */ What is this Fixed32 format? + Iop_PMax32Fx2, Iop_PMin32Fx2, What does the P mean here? + Iop_Abd32Fx2, Abd ? + /* ADDITION (pairwise) */ + /* The first element of the resulting vector is a sum of the first and + the second elements of the left operand, second - a sum of the third + and the forth elements of the left operand etc. */ + Iop_PAdd8x8, Iop_PAdd16x4, Iop_PAdd32x2, I don't understand. If a,b,c,d are all 32-bit int values, what does (eg) Iop_PAdd32x2( [a,b], [c,d] ) produce? + /* Longening variant is unary. The resulting vector contains two times + less elements than operand, but they are two times wider. */ + Iop_PAddL8Ux8, Iop_PAddL16Ux4, Iop_PAddL32Ux2, + Iop_PAddL8Sx8, Iop_PAddL16Sx4, Iop_PAddL32Sx2, again, what does (eg) Iop_PAddL32Ux2( [a,b] ) produce? + /* ABSOLUTE VALUE */ + Iop_Abs8x8, Iop_Abs16x4, Iop_Abs32x2, what is the result when the argument value is the lowest negative value (-0x80, -0x8000, -0x80000000) ? + Iop_QDMulHi16Sx4, Iop_QDMulHi32Sx2, + Iop_QRDMulHi16Sx4, Iop_QRDMulHi32Sx2, what are these? + /* MIN/MAX (pairwise) */ + Iop_PMax8Sx8, Iop_PMax16Sx4, Iop_PMax32Sx2, + Iop_PMax8Ux8, Iop_PMax16Ux4, Iop_PMax32Ux2, + Iop_PMin8Sx8, Iop_PMin16Sx4, Iop_PMin32Sx2, + Iop_PMin8Ux8, Iop_PMin16Ux4, Iop_PMin32Ux2, what are these? + /* COUNT ones / leading zeroes / leading sign bits (not including topmost + bit) */ + Iop_Cnt8x8, + Iop_Clz8Sx8, Iop_Clz16Sx4, Iop_Clz32Sx2, + Iop_Cls8Sx8, Iop_Cls16Sx4, Iop_Cls32Sx2, if Iop_Cnt8x8 counts leading ones, would't Iop_Clo8x8 be a more consistent name? + /* GET / SET elements of VECTOR + GET is binop (I64, I8) -> I<elem_size> + SET is triop (I64, I8, I<elem_size>) -> I64 */ + Iop_GetElem8x8, Iop_GetElem16x4, Iop_GetElem32x2, + Iop_SetElem8x8, Iop_SetElem16x4, Iop_SetElem32x2, what happens if the I8 is not known at JIT time, only at run time? Can the back end handle this case? + /* ZIP / UNZIP + unzipped zipped + ( A B C D ) <--> ( G C H D ) + ( E F G H ) ( E A F B ) */ + Iop_ZipHI8x8, Iop_ZipHI16x4, Iop_ZipHI32x2, + Iop_ZipLO8x8, Iop_ZipLO16x4, Iop_ZipLO32x2, + Iop_UnZipHI8x8, Iop_UnZipHI16x4, Iop_UnZipHI32x2, + Iop_UnZipLO8x8, Iop_UnZipLO16x4, Iop_UnZipLO32x2, I didn't understand the comment. What does (eg) ZipHI16x4( [a,b,c,d], [e,f,g,h] ) produce? How do these relate to Iop_Interleave{HI,LO}* and Iop_Cat{Even,Odd}Lanes* ? I wonder if there is some redundancy here. + /* EXTRACT -- copy 8-N highest bytes from argL to 8-N lowest bytes + of result and N lowest bytes of argR to N highest bytes of result. */ + Iop_Extract64, What's N ? I can't even guess the arity of this primop from the description. (1-arg? 2-arg?) + /* REVERSE in Half-words, Words, Double-words */ + Iop_Reverse16_8x8, + Iop_Reverse32_8x8, Iop_Reverse32_16x4, + Iop_Reverse64_8x8, Iop_Reverse64_16x4, Iop_Reverse64_32x2, what are these? + /* Vector Reciprocal Estimate finds an approximate reciprocal of each + element in the operand vector, and places the results in the destination + vector. */ + Iop_Recip32Fx2, Iop_Recip32x2, + Iop_Recps32Fx2, + + Iop_Rsqrts32Fx2, + Iop_Rsqrte32Fx2, Iop_Rsqrte32x2, What's the difference between Iop_Recip32Fx2 and Iop_Recps32Fx2 ? What's the difference between Iop_Rsqrts32Fx2 and Iop_Rsqrte32Fx2 ? How many good bits are there in the estimate (see Iop_Est5RFSqrt) ? What happens if the argument is zero? + /* Vector Absolute */ + Iop_Abd32Fx4, + Iop_Abs32Fx4, arity? type? example? + /* WIDENING */ + /* Longening - opposite to shortening, I64->V128 */ + Iop_Longen8Ux8, Iop_Longen16Ux4, Iop_Longen32Ux2, + Iop_Longen8Sx8, Iop_Longen16Sx4, Iop_Longen32Sx2, please, give an example
> Thanks for the revised patch. I'm still unhappy about the large > number of new IROps, but I see they are mostly unavoidable. > > Could you provide better documentation in comments about what the new > primops do? In many cases I can't even guess what the arity, type or > behaviour is. It would be good to give an example (eg) like this for > all the new primops, or at least for each group: > > Iop_Foo32x2( [a,b], [c,d] ) = [a+d, c-b] > > Also, there are "P" and "D" in many of the primop names. What > do they mean? > + /* Convertion to/from int */ > + Iop_I32UtoFx2, Iop_I32StoFx2, /* I32x4 -> F32x4 */ > Doesn't take a rounding mode? What's the behaviour in the case > where the value needs to be rounded? > > > + Iop_F32ToFixed32Ux2, Iop_F32ToFixed32Sx2, /* fp -> fixed-point */ > + Iop_Fixed32UToF32x2, Iop_Fixed32SToF32x2, /* fixed-point -> fp */ > What is this Fixed32 format? Fixed32 format is floating-point number with fixed number of fraction bits. The number of fraction bits is passed as a second argument. The floating-point to fixed-point operation uses the Round towards Zero rounding mode. The fixed-point to floating-point operation uses the Round to Nearest rounding mode. So no Rounding-Mode arguments are passed to this conversion operations. If it is not correct and Rounding-Mode should be always passed to conversion operations and asserted to be of a certain value later on the instruction selection phase, we can change these IROp's accordingly. Iop_I32UtoFx2 and Iop_I32StoFx2 use round to nearest rounding mode. We can rename them by adding _RN and _RZ trailers like Iop_FtoI32Ux4_RZ if needed. > + Iop_PMax32Fx2, Iop_PMin32Fx2, > What does the P mean here? > > + /* MIN/MAX (pairwise) */ > + Iop_PMax8Sx8, Iop_PMax16Sx4, Iop_PMax32Sx2, > + Iop_PMax8Ux8, Iop_PMax16Ux4, Iop_PMax32Ux2, > + Iop_PMin8Sx8, Iop_PMin16Sx4, Iop_PMin32Sx2, > + Iop_PMin8Ux8, Iop_PMin16Ux4, Iop_PMin32Ux2, > what are these? > > + /* ADDITION (pairwise) */ > + /* The first element of the resulting vector is a sum of the first and > + the second elements of the left operand, second - a sum of the third > + and the forth elements of the left operand etc. */ > + Iop_PAdd8x8, Iop_PAdd16x4, Iop_PAdd32x2, > I don't understand. If a,b,c,d are all 32-bit int values, > what does (eg) > Iop_PAdd32x2( [a,b], [c,d] ) > produce? P means pairwise. Iop_PFoo32x4([a,b,c,d],[e,f,g,h]) = [Foo32(a,b), Foo32(c,d), Foo32(e,f), Foo32(g,h)] > > > + /* Longening variant is unary. The resulting vector contains two times > + less elements than operand, but they are two times wider. */ > + Iop_PAddL8Ux8, Iop_PAddL16Ux4, Iop_PAddL32Ux2, > + Iop_PAddL8Sx8, Iop_PAddL16Sx4, Iop_PAddL32Sx2, > again, what does (eg) > Iop_PAddL32Ux2( [a,b] ) > produce? Imagine the argument vector consists of 4 16-bit values a, b, c, d. The resulting vector then consists of 2 32-bit values a+b and c+d. Iop_PAddL16Ux4( [a,b,c,d] ) = [a+b,c+d] > + /* ABSOLUTE VALUE */ > + Iop_Abs8x8, Iop_Abs16x4, Iop_Abs32x2, > what is the result when the argument value is the lowest negative > value (-0x80, -0x8000, -0x80000000) ? The result will be equal to argument in this case. It is always guaranteed that the result is absolute value of argument if treated as an unsigned integer. > + Iop_QDMulHi16Sx4, Iop_QDMulHi32Sx2, > + Iop_QRDMulHi16Sx4, Iop_QRDMulHi32Sx2, > what are these? They are Vector Saturating Doubling Multiply Returning High Half and Vector Saturating Rounding Doubling Multiply Returning High Half. They multiply corresponding elements in two vectors, double the results, and place the most significant half of the final results in the destination vector. The results are truncated or rounded. If any of the results overflow, they are saturated. Example: Iop_QDMulHi16Sx4([a,b,c,d],[e,f,g,h]) = [SignedSat((a * e * 2) >> 16, 16), SignedSat((b * f * 2) >> 16, 16), SignedSat((c * g * 2) >> 16, 16), SignedSat((d * h * 2) >> 16, 16)] Iop_QRDMulHi16Sx4([a,b,c,d],[e,f,g,h]) = [SignedSat((a * e * 2 + (1 << 15)) >> 16, 16), SignedSat((b * f * 2 + (1 << 15)) >> 16, 16), SignedSat((c * g * 2 + (1 << 15)) >> 16, 16), SignedSat((d * h * 2 + (1 << 15)) >> 16, 16)] > + /* COUNT ones / leading zeroes / leading sign bits (not including > topmost > + bit) */ > + Iop_Cnt8x8, > + Iop_Clz8Sx8, Iop_Clz16Sx4, Iop_Clz32Sx2, > + Iop_Cls8Sx8, Iop_Cls16Sx4, Iop_Cls32Sx2, > if Iop_Cnt8x8 counts leading ones, would't Iop_Clo8x8 be a more > consistent name? Iop_Cnt8x8 counts all ones in each byte of input vector, not only leading. > + /* GET / SET elements of VECTOR > + GET is binop (I64, I8) -> I<elem_size> > + SET is triop (I64, I8, I<elem_size>) -> I64 */ > + Iop_GetElem8x8, Iop_GetElem16x4, Iop_GetElem32x2, > + Iop_SetElem8x8, Iop_SetElem16x4, Iop_SetElem32x2, > what happens if the I8 is not known at JIT time, only at > run time? Can the back end handle this case? On ARM I8 is hardcoded into the instruction. The back end can not handle the case when I8 is known only at run time. There is an assert with proper error message in the back end. > + /* ZIP / UNZIP > + unzipped zipped > + ( A B C D )<--> ( G C H D ) > + ( E F G H ) ( E A F B ) */ > + Iop_ZipHI8x8, Iop_ZipHI16x4, Iop_ZipHI32x2, > + Iop_ZipLO8x8, Iop_ZipLO16x4, Iop_ZipLO32x2, > + Iop_UnZipHI8x8, Iop_UnZipHI16x4, Iop_UnZipHI32x2, > + Iop_UnZipLO8x8, Iop_UnZipLO16x4, Iop_UnZipLO32x2, > I didn't understand the comment. What does (eg) > ZipHI16x4( [a,b,c,d], [e,f,g,h] ) > produce? ZipHI16x4( [a,b,c,d], [e,f,g,h] ) = [g, c, h, d] ZipLO16x4( [a,b,c,d], [e,f,g,h] ) = [e, a, f, b] UnZipHI16x4( [g, c, h, d], [e, a, f, b] ) = [a, b, c, d] UnZipLO16x4( [g, c, h, d], [e, a, f, b] ) = [e, f, g, h] > How do these relate to Iop_Interleave{HI,LO}* and > Iop_Cat{Even,Odd}Lanes* ? I wonder if there is some > redundancy here. This is a difficult question. We actually don't understand clearly what their semanthics is. Where should we look for information about them? > + /* EXTRACT -- copy 8-N highest bytes from argL to 8-N lowest bytes > + of result and N lowest bytes of argR to N highest bytes of result. */ > + Iop_Extract64, > What's N ? I can't even guess the arity of this primop from > the description. (1-arg? 2-arg?) EXTRACT -- copy 8-arg3 highest bytes from arg1 to 8-arg3 lowest bytes of result and arg3 lowest bytes of arg2 to arg3 highest bytes of result. (I64, I64, I8) -> I64 The ARM back end handles only constant arg3 in this operation. > + /* REVERSE in Half-words, Words, Double-words */ > + Iop_Reverse16_8x8, > + Iop_Reverse32_8x8, Iop_Reverse32_16x4, > + Iop_Reverse64_8x8, Iop_Reverse64_16x4, Iop_Reverse64_32x2, > what are these? REVERSE the order of elements in each Half-word, Word or Double-word. Reverse16_8x8([a,b,c,d,e,f,g,h]) = [b,a,d,c,f,e,h,g] Reverse32_8x8([a,b,c,d,e,f,g,h]) = [d,c,b,a,h,g,f,e] Reverse64_8x8([a,b,c,d,e,f,g,h]) = [h,g,f,e,d,c,b,a] > + /* Vector Reciprocal Estimate finds an approximate reciprocal of each > + element in the operand vector, and places the results in the destination > + vector. */ > + Iop_Recip32Fx2, Iop_Recip32x2, > + Iop_Recps32Fx2, > + > + Iop_Rsqrts32Fx2, > + Iop_Rsqrte32Fx2, Iop_Rsqrte32x2, > What's the difference between Iop_Recip32Fx2 and Iop_Recps32Fx2 ? > What's the difference between Iop_Rsqrts32Fx2 and Iop_Rsqrte32Fx2 ? > How many good bits are there in the estimate (see Iop_Est5RFSqrt) ? > What happens if the argument is zero? Iop_Recip finds an reciprocal estimate of each element in the operand vector. If the result of Iop_Recip(d) is used as x[0] of the following recurrent equation x[n+1] = x[n](2 - d*x[n]) the sequence x[n] converges to 1/d. If the argument is zero the result is infinity. Iop_Recps represents one iteration of the equation above. It computes (2.0 - arg1 * arg2). Iop_Rsqrte finds an reciprocal of square root estimate of each element in the operand vector. If the result of Iop_Rsqrte(d) is used as x[0] of the following recurrent equation x[n+1] = x[n] * (3.0 - d*x[n]*x[n]) / 2 the sequence x[n] converges to 1/sqrt(d). Iop_Rsqrts represents one iteration of the equation above. It computes (3.0 - arg1 * arg2) / 2.0 We have no clue how many good bits are out there - it will need additional study. We are going to replace Recps and Rsqrts with explicit computations of their formulas. > + /* Vector Absolute */ > + Iop_Abd32Fx4, > + Iop_Abs32Fx4, > arity? type? example? > > > + Iop_Abd32Fx2, > Abd ? > Abd meens aboslute difference. So abs is an unary operation Abs32Fx4( [a, b, c, d] ) = [abs(a), abs(b), abs(c), abs(d) ] and Abd is binary Abd32Fx4( [a, b, c, d], [e, f, g, h] ) = [ abs(a - e), abs(b - f), abs(c - g), abs(d - h) ] Abd is probably redundant and can be expressed via Abs and Sub but different extreme values (NaN, ± inf and so on) need to be checked. > + /* WIDENING */ > + /* Longening - opposite to shortening, I64->V128 */ > + Iop_Longen8Ux8, Iop_Longen16Ux4, Iop_Longen32Ux2, > + Iop_Longen8Sx8, Iop_Longen16Sx4, Iop_Longen32Sx2, > please, give an example Longening --- sign or zero extended each element of the argument vector to the twice original size. The resulting vector consists of the same number of elements but each element and the vector itself are two times wider. Iop_Longen32Sx2( [a, b] ) = [c, d] where c = Iop_32Sto64(a) and d = Iop_32Sto64(b)
> ZipHI16x4( [a,b,c,d], [e,f,g,h] ) = [g, c, h, d] > ZipLO16x4( [a,b,c,d], [e,f,g,h] ) = [e, a, f, b] > UnZipHI16x4( [g, c, h, d], [e, a, f, b] ) = [a, b, c, d] > UnZipLO16x4( [g, c, h, d], [e, a, f, b] ) = [e, f, g, h] > > > How do these relate to Iop_Interleave{HI,LO}* and > > Iop_Cat{Even,Odd}Lanes* ? I wonder if there is some > > redundancy here. > > This is a difficult question. We actually don't understand clearly what their > semanthics is. Where should we look for information about them? Hmm, they never got properly documented. I think it is this: // InterleaveLO interleaves lanes from the low halves // of the operand vectors, and InterleaveHI from the // high parts of the operand vectors: InterleaveLO( [a,b,c,d], [e,f,g,h] ) = [c,g,d,h] InterleaveHI( [a,b,c,d], [e,f,g,h] ) = [a,e,b,f] // CatOdd concatenates odd-numbered lanes from the // operand vectors; CatEven concatentates even-numbered // lanes from the operand vectors CatOddLanes ( [a,b,c,d], [e,f,g,h] ) = [a,c,e,g] CatEvenLanes( [a,b,c,d], [e,f,g,h] ) = [b,d,f,h] so ZipHI( X, Y ) == InterleaveLO( Y, X ) ZipLO( X, Y ) == InterleaveHI( Y, X ) UnzipHI( X, Y ) == CatEven( Y, X ) UnzipLO( X, Y ) == CaddOdd( Y, X )
Created attachment 47636 [details] NEON support patch for VEX SVN revision 1974 An updated patch. * A lot of comments are added to libvex_ir.h * Zip/UnZip primops are removed * Interleave{Odd,Even}Lanes are added for VTRN instruction disassemble Examples: InterleaveOddLanes( [a,b,c,d], [e,f,g,h] ) = [b,f,d,h] InterleaveEvenLanes( [a,b,c,d], [e,f,g,h] ) = [a,e,c,g] * Abd primops are removed * Tests for NaN and INFs are added * Fixed bug in NaN comparison. This results in addition of Iop_CmpGE32Fx2 primop (Iop_CmpGE32Fx4 has already existed) * Recps and Rsqrts can not be easily removed because of the way they handle zero and infinite arguments
Created attachment 47637 [details] NEON support patch for Valgrind SVN revision 11104 This patch is gzip'ped because it's too big for the bugzilla.
UInt armg_calculate_flag_idc ( UInt res1, UInt res2, UInt res3, UInt res4 ) { UInt exp1 = (res1 >> 23) & 0xff; UInt exp2 = (res2 >> 23) & 0xff; UInt exp3 = (res3 >> 23) & 0xff; UInt exp4 = (res4 >> 23) & 0xff; if ((exp1 == 0) || (exp2 == 0) || (exp3 == 0) || (exp3 == 0)) return 1; ^^^^ else here return 0; } You mean 'exp4', right?
Yes, thanks.
Committed, thanks. revs 2002, 11251, 11252. On branches/THUMB so it can be integrated with Thumb/Thumb2 support.
Closing. NEON support is now available in the trunk.