The vperm instruction uses a 4-bit mask to extract the index value for the permute instruction. It should be a 5-bit mask. The ISA says the index value is Let index be the value specified by bits 3:7 of byte element i of VR[VRC]. This patch fixes the bug.
Created attachment 108147 [details] vperm instuction, fix index mask Add patch to fix issue
I looked at the IR generation for vperm (case 0x2B: { // vperm (Permute, AV p218), etc) and it looks correct to me. Which test case is failing and for which target? (32-be, 64-be, 64-le) ? The code already does use 5 bits to select. It uses 4 bits here: assign( vC_andF, binop(Iop_AndV128, mkexpr(vC), unop(Iop_Dup8x16, mkU8(0xF))) ); and one from here: // mask[i8] = (vC[i8]_4 == 1) ? 0xFF : 0x0 assign( mask, binop(Iop_SarN8x16, binop(Iop_ShlN8x16, mkexpr(vC), mkU8(3)), mkU8(7)) ); by computing, for each 8-bit lane, (lane << 3) >>signed 7. This copies bit[3] in the IBM encoding into all 8 bits of the lane, and hence makes it into a usable mask. I am wondering now if the problem is one of endianness. The IR definition of Iop_Perm8x16 only allows values 0 .. 15 in the selector fields, which the patch ends up violating. Given the above I suspect there's some other way to fix this.
I will go back and look at this again. The issue was actually seen in the re-working of the vpermr, xxperm, xxpermr. The code from the vperm was used to implement the other instructions with the needed tweaks to do the reverse order and fetch from the other register file. This patch then applied the same change to the vperm instruction.
The issue is with the vpermr and xxpermr instructions that were reimplemented based on the vperm instruction. In those implementations, the mask needed to be 0x1F instead of 0xF and the mask assignment uses perm_val or vC_andF depending on the instruction rather then the vector containing the indexes. The vpermr and xxpermr instructions do the reverse permutation, specifically the permute value is 31 - index value for each vector entry. The subtraction causes the bits to get flipped. The real issue is that the implementation was doing the masking each of the indexes in the vector then subtracting them from 31. The indexes should be subtracted from 31 first, then the result is masked with 0xF. Once the order of operations is corrected, then the mask assignment is done against the vector-31. So, the comments and the implementation of the vpermr, xxperm and xxpermr instructions need some cleaning up to make the mask 0xF not 0x1F, etc. The patch to fix the vperm instruction need to be replaced with a clean up patch.
Created attachment 108173 [details] fix mask for vpermr, xxperm, xxpermr instruction Replace the vperm instruction patch with a clean up patch for the vpermr, xxperm, xxpermr instructions so their implementation is correct and "mirror" the implementation of the vperm instruction.
Random observation, w.r.t new patch: Instead of this binop( Iop_64HLtoV128, mkU64( 0x1F1F1F1F1F1F1F1F ), mkU64( 0x1F1F1F1F1F1F1F1F ) ) you can just do this - unop( Iop_Dup8x16, mkU8( 0x1F ) ) as is actually done only a few lines further up in the patch.
Created attachment 108190 [details] fix mask for vpermr, xxperm, xxpermr instruction patch cleanup suggested by Julian
patch committed. commit 856d45eb7e3661a61ace32be2cfa10bf198620c8