Bug 383010 (valgrind-avx512)

Summary: Add support for AVX-512 instructions
Product: [Developer tools] valgrind Reporter: Tanya <tatyana.a.volnina>
Component: vexAssignee: Julian Seward <jseward>
Status: CONFIRMED ---    
Severity: normal CC: ahajkova, arkangath, arnaud.desitter, aymaneltemsahi, bart.vanassche+kde, bill200491, fchen0000, gabravier, henri, hjl.tools, kde, kde, kobalicek.petr, ktstngubljie, leferi99, lopresti, manodeep, mark, Melven.Roehrig-Zoellner, mfrancis, mgsexton1, mingfei.gao, moog621, mr284, oli.huber, paola_arc, pjfloyd, qomanaut, rashawn.l.knapp, Rincebrain, rjiejie, rjones, sam, shaveer.bajpeyi, simon.fenney, smueller, steve+kde, tanya_mineeva, tatyana.a.volnina, thiago, tiborgyri, toitran4195, tusooa, ytrezq
Priority: VHI    
Version: 3.14 SVN   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Prototype implementation of several AVX-512 instructions
Corrected prototype implementation
Updated AVX-512 implementation prototype
Nulgrind test for the AVX-512 instructions
Nulgrind test for the AVX instructions on AVX-512 machine
Nulgrind test for the AVX-2 instructions on AVX-512 machine
Nulgrind test for the serial vFMA instructions
Updated AVX-512 implementation prototype with all KNL instructions
Nulgrind tests for AVX-512 machine, part 1
Nulgrind tests for AVX-512 machine, part 2
Nulgrind tests for AVX-512 machine, part 3
Nulgrind tests for AVX-512 machine, part 4
Nulgrind tests for AVX-512 machine, part 5
Refactored implementation for Skylake machines
patch
patch
patch
patch
patch
Part 1 of AVX-512 patch - main implementation
Part 2 of AVX-512 patch - auto-generated files
Part 3 of AVX-512 patch - AVX-512 tests
Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Version 2: Part 1 of AVX-512 patch - main implementation
Version 2: Part 2 of AVX-512 patch - auto-generated files
Version 2: Part 3 of AVX-512 patch - AVX-512 tests
Version 2: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Version 3: Part 1 of AVX-512 patch - main implementation
Version 3: Part 2 of AVX-512 patch - auto-generated files
Version 3: Part 3 of AVX-512 patch - AVX-512 tests
Version 3: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Demonstrates misbehaving `vmovdqu8 %ymm7, (%r9){%k5}`
valgrind-avx512-rollup-fixes-2021Dec29.diff
Fix copyright notices on the new AVX-512 files
Fix handling of no-mask reg-reg versions of VEXPAND* and VCOMPRESS*
Update for AVX-512 Valgrind regression tests to spot differences between mask k0 and no mask
Version 4: Part 2 of AVX-512 patch - auto-generated files
Version 4: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files

Description Tanya 2017-08-01 16:36:58 UTC
Created attachment 107012 [details]
Prototype implementation of several AVX-512 instructions

Valgrind does not support AVX-512 yet.
AVX-512 programming reference and hardware emulator are available at https://software.intel.com/en-us/isa-extensions
AVX-512 is supported by GCC 4.9 and later.

Attached a prototype implementation of several AVX-512 instructions in Nulgrind.
A test for these instructions is located in none/tests/amd64/avx512.c
Comment 1 Julian Seward 2017-08-04 14:41:03 UTC
This is a quite good start.  Here are some initial comments.  This isn't a
full review but should give you some initial feedback.

------------------------------

+// AVX-512 - use a real CPUid
+// If XGETBV, XSAVE and XRSTOR fail, might need to update XSaveOpt
+void amd64g_dirtyhelper_CPUID_avx512 ( VexGuestAMD64State* st ) {

I see that you're doing this in order to get started quickly.  But in
general this doesn't work, because there's no fixed relationship between
what the CPUID of the host claims is supported and what you actually support
in your patch.  In the end you'll need to return your own values from CPUID,
like all the other CPUID helper function variants do.

----------------------------------

+static IRExpr* mkV512 ( UInt val )
+{
+   return IRExpr_Const(IRConst_V512(val));
+}

Assuming that IRConst_V512 contains 64 bits, one for each byte of the value,
this is wrong.  UInt is a 32 bit value.  You need to use ULong, which is
always 64 bits.

----------------------------------

@@ -2511,7 +2796,7 @@ static IRTemp disAMode_copy2tmp ( IRExpr* addr64 )
 static 
 IRTemp disAMode ( /*OUT*/Int* len,
                   const VexAbiInfo* vbi, Prefix pfx, Long delta, 
-                  /*OUT*/HChar* buf, Int extra_bytes )
+                  /*OUT*/HChar* buf, Int extra_bytes )

Nit: please don't include whitespace changes

----------------------------------

@@ -17797,7 +18164,7 @@ static Long dis_AESx ( const VexAbiInfo* vbi, Prefix pfx,
       use of mkU64 rather than mkIRExpr_HWord implies the
       assumption that the host's word size is 64-bit. */
    UInt gstOffD = ymmGuestRegOffset(rG);
-   UInt gstOffL = regNoL == 16 ? OFFB_YMM16 : ymmGuestRegOffset(regNoL);
+   UInt gstOffL = regNoL == 32 ? OFFB_ZMM32 : ymmGuestRegOffset(regNoL);

Did you mean to use zmmGuestRegOffset here?  (I don't know if that even
exists, but there's a Y/Z inconsistency here, when comparing to the
OFFB_YMM16/ZMM32 text.

----------------------------------

+        default:
+            // All other AVX-512 instructions. Crash before things get worse.
+            vex_printf("dis_ESC_0F__EVEX - UNRECOGNIZED OPCODE 0x%x\n", opcode);
+            vpanic(" bye\n ");
+            break;

Please, don't cause the instruction decoder to panic on unimplemented
instructions.  It might be a while before you implement them all.  Instead,
use whatever the return conventions are for this function, to indicate
decode failure (I think it is to return delta unchanged, but you should
check) and let V's illegal-instruction handling deal with it in the normal
way.

The same applies for all the dis_ESC_* functions that you've added.

----------------------------------

Please use the house style as much as possible, in particular, 3-space
indents.  I notice you have 4 spaces in many places.


+   if (pfx & PFX_EVEX)
+   {
+       delta=ParseEVEX(&pfx, &esc, delta);
+   }

House style please: { at the end of the condition.  And spaces on
each side of the =.

----------------------------------

+   if (pfx & PFX_EVEX) {
+       /* EVEX prefixed instruction */
+       Bool uses_vvvv = False;
+       switch (esc) {
...
+           default:
+               vassert(0);
+       }
+

Per comments above, please ensure we can't get to the vassert(0) unless
the decoder is buggy.  We should never get there for valid but unhandled
instructions.

----------------------------------

diff --git a/VEX/priv/host_amd64_isel.c b/VEX/priv/host_amd64_isel.c

+        - vregmap2 and vregmap3 are used for 512-bit vector IRTemps.
+          Maybe it'd be better to use them for 256-bit, too

Allocating these is going to be expensive for the 99.9% of blocks
that don't require them.  It would be nice to only allocate them
if we know they will be required.

----------------------------------

+         }
+do_F64AssistedUnary:
+         {

The use of gotos is OK (it makes the code simpler) but please indent
them at the same level as surrounding code.  Definitely not at column 1:

+         }
+         do_F64AssistedUnary:
+         {

----------------------------------

+    for (int i=0; i<4; i++) {
+        dst[i] = newVRegV(env);
+    }

(1) please, house style: spaces before/after = and <
(2) where does '4' come from?  Can you make it less like a magic number
(give it a name?)


+IRConst* IRConst_V512 ( UInt con )
+{
+   IRConst* c  = LibVEX_Alloc_inline(sizeof(IRConst));
+   c->tag      = Ico_V512;
+   c->Ico.V512 = con;
+   return c;
+}

Per comments above, this is definitely wrong if you intend to have one bit
per byte for a 512 bit value.

----------------------------------

 /* A union for doing 128-bit vector primitives conveniently. */
 typedef
    union {
@@ -78,6 +81,7 @@ typedef
       UShort w16[8];
       UInt   w32[4];
       ULong  w64[2];
+      double f64[2];

use the house types: Double, not double.

----------------------------------

+         UInt   V512;   /* 64-bit value; see Ico_V512 comment above */

per comments above

----------------------------------

+      /* Conflict detection */
+      Iop_Clz32x16,
+      Iop_CfD32x16,
+      Iop_BcMask_W2D,

Please describe the semantics of these a bit.  What do they do?

----------------------------------

       case Ity_V256:
          tmp1 = assignNew('V', mce, Ity_I64,  unop(Iop_1Sto64, tmp1));
-         tmp1 = assignNew('V', mce, Ity_V128, binop(Iop_64HLtoV128,
-                                                    tmp1, tmp1));
-         tmp1 = assignNew('V', mce, Ity_V256, binop(Iop_V128HLtoV256,
-                                                    tmp1, tmp1));
+         tmp1 = assignNew('V', mce, Ity_V128, binop(Iop_64HLtoV128, tmp1, tmp1));
+         tmp1 = assignNew('V', mce, Ity_V256, binop(Iop_V128HLtoV256, tmp1, tmp1));
+         return tmp1;
+      case Ity_V512:
+         tmp1 = assignNew('V', mce, Ity_I64,  unop(Iop_1Sto64, tmp1));
+         tmp1 = assignNew('V', mce, Ity_V128, binop(Iop_64HLtoV128, tmp1, tmp1));
+         tmp1 = assignNew('V', mce, Ity_V256, binop(Iop_V128HLtoV256, tmp1, tmp1));
+         tmp1 = assignNew('V', mce, Ity_V512, binop(Iop_V256HLtoV512, tmp1, tmp1));
          return tmp1;

Why does this change the V256 case?  This code is fragile and it would be
better not to change handling of existing types.

----------------------------------

-   if (UNLIKELY(ty == Ity_V256)) {
+   if (UNLIKELY((ty == Ity_V512))) {
+      Int     offQ[8];
+      IRDirty *diQ[8];
+      IRAtom  *addrQ[8], *vdataQ[8], *eBiasQ[8];
+      if (end == Iend_LE) {
+          for (int i=0; i<8; i++)
+              offQ[i]=i*8;
+      } else {
+          for (int i=0; i<8; i++)
+              offQ[7-i]=i*8;
+      }
+      for (int i=0; i<8; i++){
+      eBiasQ[i] = tyAddr==Ity_I32 ? mkU32(bias+offQ[i]) : mkU64(bias+offQ[i]);

House style:
- spaces around =, <
- 3 char indent
- proper indentation inside the loop
- use house types (Int, not int)
Comment 2 Tanya 2017-08-18 19:04:02 UTC
Created attachment 107354 [details]
Corrected prototype implementation

Julian,
thank you for the comments!

Attached a corrected version. Not fixed yet:
- Return our own values from the CPUID
- Allocate vregmap2 and vregmap3 only when needed
I'm working on these issues.
Comment 3 Julian Seward 2017-11-10 11:14:05 UTC
Tanya, any progress on this?
Comment 4 Tanya 2017-11-14 12:23:46 UTC
(In reply to Julian Seward from comment #3)
> Tanya, any progress on this?

Julian,
More AVX-512 instructions are added to Nulgrind (at the moment, 168 out of 347 are added), and some AVX-512 functionality (masking and displacement encoding) is updated. I will clean up the code and attach the updated patch in a few days.

Regards,
Tanya
Comment 5 Tanya 2017-11-21 18:53:26 UTC
Created attachment 109000 [details]
Updated AVX-512 implementation prototype
Comment 6 Tanya 2017-11-21 18:56:10 UTC
Created attachment 109001 [details]
Nulgrind test for the AVX-512 instructions

The test is based on the existing Nulgrind AVX and AVX-2 tests
Comment 7 Tanya 2017-11-21 19:01:05 UTC
Created attachment 109002 [details]
Nulgrind test for the AVX instructions on AVX-512 machine

AVX regression test based on the existing Nulgrind AVX test, outputs ZMM registers instead of YMM
Comment 8 Tanya 2017-11-21 19:02:17 UTC
Created attachment 109004 [details]
Nulgrind test for the AVX-2 instructions on AVX-512 machine

AVX-2 regression test based on the existing Nulgrind AVX-2 test, outputs ZMM instead of YMM
Comment 9 Tanya 2017-11-21 19:07:09 UTC
Created attachment 109005 [details]
Nulgrind test for the serial vFMA instructions

The "AVX-512_prototype_v3" patch changes the behaviour of the serial vFMA instructions. For these instructions, Valgrind sets destination bits [127:64] or [127:32] to zero; according to the ISE, these bits should remain unchanged. The test covers the changed instructions.
Comment 10 Tanya 2017-11-21 19:10:18 UTC
Hello, 
Sorry for the late response.

Please find the attachment for the patch AVX-512_prototype_v3.patch. It mainly adds new instructions to Nulgrind; most of these instructions are not added to Memcheck yet. The patch also enables AVX-512 in VGDB.

The patch has these updates to the AVX/AVX2 functionality:
1) AVX and AVX-2 instructions now set the unused upper half of destination ZMM vector register to all zeros.
2) Serial FMA instructions are supposed to keep the destination bits [127:32] for 32-bit instructions or [127:64] for the 64-bit unchanged; original Valgrind sets them to zero. Please let us know if this should be a separate patch.

The tests for AVX-512, AVX-2, AVX and serial vFMA instructions are attached. The vFMA test must run correctly on the AVX machines, too.


Please let me know if you would like us to provide the list of the implemented AVX-512 instructions.

Regards,
Tanya
Comment 11 Julian Seward 2017-12-08 18:04:08 UTC
Sorry for the delay.  I will review in the coming week (11-15 Dec).
Comment 12 Julian Seward 2018-01-02 11:41:53 UTC
(In reply to Tanya from comment #5)
> Created attachment 109000 [details]
> Updated AVX-512 implementation prototype

This looks pretty good.  I have a number of comments, many of them
just style/layout, but a few are a bit more than that, but nothing
serious.

As a side note -- before this lands, I would want to do some performance
runs to check that this doesn't impact performance (or correctness) of
existing IA support.

-----

 static
+Bool have_avx512(void) {

   return (vai.hwcaps & VEX_HWCAPS_AMD64_AVX512 ? True : False);

Potential operator precedence problelsm for & vs ?: make me nervous.  Please
use

   (vai.hwcaps & VEX_HWCAPS_AMD64_AVX512) ? True : False

so we don't accidentally end up with

   vai.hwcaps & (VEX_HWCAPS_AMD64_AVX512 ? True : False)


-----

valgrind-wip/valgrind/memcheck/mc_machine.c

@@ -873,7 +888,6 @@
 
    if (o == GOF(FPSCR)    && sz == 4) return -1;
    if (o == GOF(TPIDRURO) && sz == 4) return -1;
-   if (o == GOF(TPIDRURW) && sz == 4) return -1;
    if (o == GOF(ITSTATE)  && sz == 4) return -1;

Looks like you need to update your -wip copy.  The TPIDRURW line
was added recently for ARM.

-----

 void mc_LOADV_128_or_256_slow ( /*OUT*/ULong* res,
                                 Addr a, SizeT nBits, Bool bigendian )
 {
-   ULong  pessim[4];     /* only used when p-l-ok=yes */
+   ULong  pessim[16];     /* only used when p-l-ok=yes */

IIUC, you're now using this also to handle the slow case for 512 bit
loads.  Correct?

If so,
(1) please rename it to mc_LOADV_128_or_256_or_512_slow
(2) please verify that the 4->16 transition here is correct -- I suspect
it is not.  A ULong is 64 bits, so 'ULong  pessim[4]' contains one byte for
each byte in a 256-bit load.  So to make it work for 512 bit loads, don't
you only need to change it to 'ULong pessim[8]' ?

-----

+void amd64g_dirtyhelper_CPUID_avx512 ( VexGuestAMD64State* st ) {

Please add a comment saying which processor this is taken from (or most
closely represents), like with all the other CPUID helper functions.

-----

+#define OFFB_YMM32     offsetof(VexGuestAMD64State,guest_ZMM32) //NULL

What does //NULL here mean?  Maybe remove it?

-----

+#define PFX_EVEXb  (1<<30)   /* EVEX b bit, if EVEX present, else 0 */
+#define PFX_EVEXTT (0xF<<32) /* EVEX tuple type (4-bits) if EVEX present, else 0 */

Please change these (also, all the pre-existing PFX_ values like PFX_ASO etc)
like this

   1ULL << 30
   0xFULL << 32

so that we don't get problems with (eg) the compiler shifting 0xF as 32-bits
left by 32 (giving zero) and *then* widening to 64 bits.

-----

+static Int tuple_disp_map[14][3] = {

I assume that you want to access this using a TupleType for the first
index.  Please add this:

STATIC_ASSERT(FullVectorMask == 14)

just before the definition of tuple_disp_map.

-----

+static int getMult(Prefix pfx) {
+   int type = getEvexType(pfx);

Please, use house types (Int instead of int).  Int is signed-32 on all
platforms that valgrind supports.

-----

I'm not sure what function these are in, but anyway ..

+
+   // High 256 bits of ZMM0-ZMM15 registers
+   for (reg = 0; reg < 16; reg++) {
+      stmt( IRStmt_StoreG(
+               Iend_LE,
+               binop(Iop_Add64, mkexpr(addr), mkU64(1152 + reg * 32)),

and later

+                  binop(Iop_Add64, mkexpr(addr), mkU64(1664 + lane*16 + reg*64)),

Please don't use magic numbers 1152 and 1664.  Where do these come from?
Are they guest-state offsets?  If yes, please make them be derived from
OFFB_* values.  If no, please at least document where the numbers come
from and what else assumes those same numbers.

-----

       IRExpr* ea  = binop(Iop_Add64, mkexpr(addr), mkU64(576 + reg * 16));

same

+      IRExpr* ea  = binop(Iop_Add64, mkexpr(addr), mkU64(1152 + reg * 32));

etc

-----

+static IRExpr* mask_expr(Prefix pfx, IRExpr* unmasked, IRExpr* original) {

Nit: for big functions like this, please place the { on its own line.

In the same function:

+      default:
+      break;

If it is an internal logic bug that we get to this point, it would be better
to 'vassert(0)'.  But only if it's a bug in the code -- not if it is an
undecoded instruction.

-----

+      assign( e1, amt >= size
+            ? mkV512(0)
+            : binop(op, mkexpr(e0), mkU8(amt))
+            );

Please fix the indentation here and below, to make it clearer that
there are 2 args.  (In emacs C-mode, pressing TAB often fixes it for me).

+      assign( e1, amt >= size
        +            ? mkV512(0)
        +            : binop(op, mkexpr(e0), mkU8(amt))
+            );

-----

+   if (!getEvexMask(pfx)) {
    putYMMRegLoAndZU( rG, unop(op, mkexpr(arg)) );
+   }

Indent the then-clause!

-----

+static
+Long dis_AVX512_cmp_V_E_to_k ( /*OUT*/Bool* uses_vvvv,

+   IRTemp addr = IRTemp_INVALID;
+   Int alen = 0;
+   HChar dis_buf[50];
+   UInt mask = getEvexMask(pfx);
+   UChar modrm = getUChar(delta);
+   UInt rG = gregOfRexRM(pfx, modrm);
+   UInt rV = getVexNvvvv(pfx);
+   *uses_vvvv = True;
+   IRTemp tD = newTemp(Ity_V512);
+   IRExpr* unmasked;
+   UInt imm8;

Please line up the declarationdb like in many other cases, to
make it easier to read.

-----

+static Long opmask_operation_decode (const VexAbiInfo* vbi, Prefix pfx, Long* delta) {

{ on its own line

-----

+static IRTemp math_VPERMD_512 ( IRTemp ctrlV, IRTemp dataV ) {
+   /* In the control vector, zero out all but the bottom four bits of each 32-bit lane. */

Please stay within an 80 column limit.

-----

+      if (!isYMM) {
       putYMMRegLane128( rG, 1, mkV128(0) );
+      }

Indent!

-----

@@ -30062,170 +31359,2467 @@
                 nameIRegG(size,pfx,rm));
             delta += alen;
          }
-
-         /* First mask off bits not set in mask, they are ignored
-            and it should be fine if they contain undefined values.  */
-         IRExpr* masked = binop(mkSizedOp(ty,Iop_And8),
-                                mkexpr(src), mkexpr(mask));
-         IRExpr** args = mkIRExprVec_2( widenUto64(masked),
-                                        widenUto64(mkexpr(mask)) );
-         putIRegG( size, pfx, rm,
-                   narrowTo(ty, mkIRExprCCall(Ity_I64, 0/*regparms*/,
-                                              "amd64g_calculate_pext",
-                                              &amd64g_calculate_pext, args)) );
-         *uses_vvvv = True;
-         /* Flags aren't modified.  */
-         goto decode_success;
+
+         /* First mask off bits not set in mask, they are ignored
+            and it should be fine if they contain undefined values.  */
+         IRExpr* masked = binop(mkSizedOp(ty,Iop_And8),
+                                mkexpr(src), mkexpr(mask));
+         IRExpr** args = mkIRExprVec_2( widenUto64(masked),
+                                        widenUto64(mkexpr(mask)) );
+         putIRegG( size, pfx, rm,
+                   narrowTo(ty, mkIRExprCCall(Ity_I64, 0/*regparms*/,
+                                              "amd64g_calculate_pext",
+                                              &amd64g_calculate_pext, args)) );
+         *uses_vvvv = True;
+         /* Flags aren't modified.  */
+         goto decode_success;

What actually changed here?  I can't see any difference.  If it's just
whitespace, don't include it.

------

+#define MULT512 4

I see some uses of MULT512, but it's unclear what it means.  Please
add a comment.  (Later) I see you're using it to mean 'the number of
128 bit registers into which an Ity_I512 expression is computed'.  Is
there a better name for this?  I couldn't have guessed the meaning just
from the existing name.

------

 static HReg iselIntExpr_R_wrk ( ISelEnv* env, const IRExpr* e )

+            Int octet = e->Iex.Unop.op - Iop_V512to64_0;
+            vec = temp[octet/2];

Use 'UInt octet' so that we're sure the division just turns into a shift.

-----

@@ -2669,6 +2793,49 @@
       return dst;
    }
 
+   if (e->tag == Iex_Unop && e->Iex.Unop.op == Iop_ExtractExpF32) {
+      HReg dst = newVRegV(env);
+      HReg arg = iselFltExpr(env, e->Iex.Unop.arg);
+      sub_from_rsp(env, 32);
+      addInstr(env, AMD64Instr_Lea64(AMD64AMode_IR(0, hregAMD64_RSP()), hregAMD64_RDI()));
+      addInstr(env, AMD64Instr_Lea64(AMD64AMode_IR(4, hregAMD64_RSP()), hregAMD64_RSI()));

It was a bit unclear to me what you're using the RDI value for.  I had to
look at this a couple of times to see that it is the first (only?) arg to
h_generic_calc_GetExp32.  Please add a short comment saying that; also
for the clauses that follow (calls to h_generic_calc_GetMant32 and
h_generic_calc_RoundScaleF32)

-----

static void iselVecExpr_wrk_512 ( /*OUT*/ HReg *dst, ISelEnv* env, IRExpr* e ) {

{ on its own line

-----

-- valgrind-clean/valgrind/VEX/priv/host_generic_fixup.c	1969-12-31 18:00:00.000000000 -0600
+++ valgrind-wip/valgrind/VEX/priv/host_generic_fixup.c	2017-11-17 12:26:07.224622000 -0600

I see that you need a new source file here (fine), but

(1) "fixup" is a pretty meaningless name -- it could mean anything.
    Please choose something that describes better what it does, eg
    host_generic_avx512_helpers.c

(2) It needs a copyright notice.  Copy from any other file
    (eg host_generic_simd256.c and s/OpenWorks etc/Intel (or whoever))

(3) Start/end comment blocks in the file would be nice.

-----

--- valgrind-clean/valgrind/VEX/priv/host_generic_simd512.c	1969-12-31 18:00:00.000000000 -0600
+++ valgrind-wip/valgrind/VEX/priv/host_generic_simd512.c	2017-11-17 12:21:00.513156000 -0600

In the copyright, s/OpenWorks GBR/Intel (or whoever)

Please use Int instead of int in this file.

-----

--- valgrind-clean/valgrind/VEX/priv/host_generic_simd512.h	1969-12-31 18:00:00.000000000 -0600
+++ valgrind-wip/valgrind/VEX/priv/host_generic_simd512.h	2017-11-17 12:21:00.519156000 -0600

Change copyright notice name.

-----

--- valgrind-clean/valgrind/VEX/pub/libvex_basictypes.h	2017-11-21 07:10:54.339770000 -0600
+++ valgrind-wip/valgrind/VEX/pub/libvex_basictypes.h	2017-11-14 09:46:54.822380000 -0600
@@ -71,6 +71,13 @@
 /* Always 256 bits. */
 typedef  UInt  U256[8];
 
+/* Always 512 bits. */
+typedef  UInt  U512[16];
+
+/* Floating point. */
+typedef  float   Float;    /* IEEE754 single-precision (32-bit) value */
+typedef  double  Double;   /* IEEE754 double-precision (64-bit) value */
+

Please add (at this point)

STATIC_ASSERT(sizeof(Float) == 4)
STATIC_ASSERT(sizeof(Double) == 8)

(Call me paranoid.  I don't care :-)

-----

          UInt   V256;   /* 32-bit value; see Ico_V256 comment above */
+         ULong V512;   /* 64-bit value; see Ico_V512 comment above */

Please align the 'V512' -- just to make it pretty

-----

+      /* MASKING */
+      Iop_Mask32to512, Iop_Mask64to512, 
+      Iop_Mask32to256, Iop_Mask64to256, 

Add a 1-line comment explaining roughly what these do, eg like the
INTERLEAVING case just below.

-----

+      /* Detect duplicate values. Compare each element of the source vector
+       * for equality with all other elements closer to the least significant one,
+       * combine the comparison results to a bit vector */
+      Iop_CfD32x16,

Can you give this a better name?  The comment makes sense, but I couldn't
guess that from 'CfD'.  What does CfD stand for?

-----

+      // V512 -> I64, must be sequential

You could (partly) enforce this at compile time with

STATIC_ASSERT(Iop_V512to64_7 == Iop_V512to64_0 + 7)

-----

+      Iop_Align32x16,  Iop_Align64x8,
+      Iop_Expand32x16, Iop_Expand64x8,
+      Iop_Compress32x16,  Iop_Compress64x8,
+      Iop_Ternlog32x16,   Iop_Ternlog64x8,
+

Please add a short comment explaining approximately what these operations do.

-----
Comment 13 Julian Seward 2018-01-02 12:12:23 UTC
(In reply to Tanya from comment #6)
> Created attachment 109001 [details]
> Nulgrind test for the AVX-512 instructions
> 
> The test is based on the existing Nulgrind AVX and AVX-2 tests

Looks fine.  Just one comment:

-----

UChar _randArray[2062] __attribute__((used));

Where does the 2062 number come from?  I'd prefer a #define with
some explanation.
Comment 14 Julian Seward 2018-01-02 12:14:26 UTC
(In reply to Tanya from comment #7)
> Created attachment 109002 [details]
> Nulgrind test for the AVX instructions on AVX-512 machine
> 
> AVX regression test based on the existing Nulgrind AVX test, outputs ZMM
> registers instead of YMM

Is this intended to replace the existing AVX test?  Or is it a new
test?  This is unclear.
Comment 15 Julian Seward 2018-01-02 12:15:41 UTC
(In reply to Tanya from comment #8)
> Created attachment 109004 [details]
> Nulgrind test for the AVX-2 instructions on AVX-512 machine
> 
> AVX-2 regression test based on the existing Nulgrind AVX-2 test, outputs ZMM
> instead of YMM

Same question here as in comment 14.
Comment 16 Julian Seward 2018-01-02 12:18:41 UTC
(In reply to Tanya from comment #9)
> Created attachment 109005 [details]
> Nulgrind test for the serial vFMA instructions
> 
> The "AVX-512_prototype_v3" patch changes the behaviour of the serial vFMA
> instructions. For these instructions, Valgrind sets destination bits
> [127:64] or [127:32] to zero; according to the ISE, these bits should remain
> unchanged. The test covers the changed instructions.

If I understand this right, that means the existing cases for serial vFMA
insns are wrong, and also the VEX implementation is wrong.  Is that correct?

If so, shouldn't we just fix both the test case and implementation?
Comment 17 Tanya 2018-02-07 18:52:40 UTC
Hello Julian,

Sorry for a late reply.
Thank you very much for the comments. We have fixed most of these bugs, and hope to finish adding and debugging KNL AVX-512 instructions in about a month.

Regarding the performance on AVX-2 code:
>> As a side note -- before this lands, I would want to do some performance 
>> runs to check that this doesn't impact performance (or correctness) of 
>> existing IA support.
On a few tiny AVX-2 benchmarks, Memcheck overhead is 0-1% bigger than that of a "clean" Valgrind version. We will run and measure it on bigger AVX-2 benchmarks.
Do you have any obligatory benchmarks for Valgrind correctness and performance? 


Regarding the test files:
>> Is this intended to replace the existing AVX test?  Or is it a new test? 
>> This is unclear.
The attached tests are new tests, usable on AVX-512 machines only. They recheck AVX and AVX-2 instructions on bigger vector registers, similarly to how avx-2.c test rechecks avx-1 instructions on ymm registers. Would it be ok to keep them as three separate tests files for AVX-512, or should they be merged into one avx-512.c test file?


Regarding the FMA instructions:
>> If I understand this right, that means the existing cases for serial vFMA 
>> insns are wrong, and also the VEX implementation is wrong. Is that 
>> correct? If so, shouldn't we just fix both the test case and 
>> implementation?
The issue was, for serial (32- and 64-bit) FMA instructions, Valgrind used to set bits [128:32] or [128:64] of the destination to zero, while they should be left unchanged. We have fixed the implementation and added a new test, because the none/tests/amd64/fma.c test seems to be designed to only verify one float or one double value. 
Would you prefer us to provide not-AVX-512-related changes as a separate patches?


I also have a question on our implementation of translation of EVEX instructions to IR. 
Currently, we use separate functions for VEX- and EVEX- prefixed instructions (file VEX/priv/guest_amd64_toIR.c, functions, for example, dis_ESC_0F38__VEX and dis_ESC_0F38__EVEX, respectively). 

However, looking at the next Intel AVX-512 instruction sets, the VL (Vector Length) set allows to run EVEX-prefixed instructions on xmm and ymm registers, so it basically duplicates the VEX code (for example, EVEX-prefixed "vmovpdd xmm1, xmm2" is an equivalent of VEX-prefixed "vmovpdd xmm1, xmm2").

The easiest way to implement it would be to unite the EVEX- and VEX- translator functions into something like "dis_ESC_0F38__VEX_EVEX". On the upside, there would be less duplicated code. On the downside, it means that EVEX-related code would no longer be contained in separate __EVEX functions, so it would probably be more difficult to review.
An alternate approach would be to add VL code (basically, copy the __VEX translations) to the __EVEX functions. As a downside, it may be bothersome to maintain the __VEX and the EVEX VL implementations identical.

It we were to implement those instructions in the future, what would be a preferable approach?


Thank you,
Tanya
Comment 18 Hou, Ao-ping 2018-02-11 08:59:21 UTC
Hello, 
I have patched the "AVX-512_prototype_v3" patch.
But I can not use this patch for _mm512_setr_pd(...)
Could you please help me to solve it ?

I encounter the following ERROR:

[elliot@blacksburg ~]$ avx512-valgrind ./aaa 
==4021== Memcheck, a memory error detector
==4021== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==4021== Using Valgrind-3.14.0.GIT and LibVEX; rerun with -h for copyright info
==4021== Command: ./aaa
==4021== 
dis_ESC_0F3A__EVEX - UNRECOGNIZED OPCODE 0x1A
==4021== Invalid read of size 4
==4021==    at 0x400AB6: simd_assign(int, double*, double*) (in /home/elliot/aaa)
==4021==    by 0x400825: main (in /home/elliot/aaa)
==4021==  Address 0xfffffffffffffff1 is not stack'd, malloc'd or (recently) free'd
==4021== 
==4021== 
==4021== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==4021==  Access not within mapped region at address 0xFFFFFFFFFFFFFFF1
==4021==    at 0x400AB6: simd_assign(int, double*, double*) (in /home/elliot/aaa)
==4021==    by 0x400825: main (in /home/elliot/aaa)
==4021==  If you believe this happened as a result of a stack
==4021==  overflow in your program's main thread (unlikely but
==4021==  possible), you can try to increase the size of the
==4021==  main thread stack using the --main-stacksize= flag.
==4021==  The main thread stack size used in this run was 8388608.


And my code is:
#include <xmmintrin.h>
#include <emmintrin.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <float.h>

#include <avx512fintrin.h>
#include <avx512dqintrin.h>
#include <avx512cdintrin.h>

typedef double real;
typedef int64_t integer;
typedef __m512d realVect ;
typedef integer intVec __attribute__ ((vector_size(512)));
#define VECTOR_SIZE 8
#define SHIFT_SIZE 3

#define modSize(a) (((a>>SHIFT_SIZE )+1)<< SHIFT_SIZE)
#define NRM_RAND_VALE (5.0 - 10.0*(real)(rand()%10000)/(real)10000.0)

real *new_real(int n)
{
	real *c = (real *)aligned_alloc(1024, sizeof(real)*modSize(n));
	if(c==NULL){
		fprintf(stderr, "%s Error --- Cannot allocate real array\n",__FUNCTION__);
		exit(1);
	}
	return c;
}

integer *new_integer(int n)
{
	integer *c = (integer *)aligned_alloc(1024, sizeof(integer)*n);
	if(c==NULL){
		fprintf(stderr, "%s Error --- Cannot allocate int array\n",__FUNCTION__);
		exit(1);
	}
	return c;
}

inline void simd_assign(int n, real *c, real *a)
{
	realVect *rvc=(realVect *)c;
	for(int i=0; i<n; i+=VECTOR_SIZE){
		rvc[i>>SHIFT_SIZE] = _mm512_setr_pd(a[i], a[i+1], a[i+2], a[i+3], a[i+4], a[i+5], a[i+6], a[i+7]);
	}
}




int main(int argc, char *argv[])
{
	int N=100000;
	real *a=new_real(N);
	real *c=new_real(N);
	
	for(int i=0; i<N; i++)
		a[i] = NRM_RAND_VALE;
	
	simd_assign(N,c,a);
	for(int i=0;i<N; i++){
		if(a[i] != c[i]){
			printf("%s vector assign FAIL %d --- %f != %f\n",argv[0],i,c[i],a[i]);
			return 1;
		}
	}
	printf("PASS\n");
	return 0;
}


Elliot
Comment 19 Tanya 2018-02-13 15:37:18 UTC
(In reply to Hou, Ao-ping from comment #18)
> Hello, 
> I have patched the "AVX-512_prototype_v3" patch.
> But I can not use this patch for _mm512_setr_pd(...)
> Could you please help me to solve it ?
> 
> I encounter the following ERROR:
> ...
> 
> Elliot


Hello Elliot,

I can not reproduce the issue, but it looks like the application uses AVX-512 "vinsertf64x4" instruciton, which haven't been implemented in the patch.
We still have a few AVX-512 instructions missing. I'll attach the updated patch with all AVX-512 instructions next week. 

Thank you,
Tanya
Comment 20 Hou, Ao-ping 2018-02-14 06:58:51 UTC
(In reply to Tanya from comment #19)
> (In reply to Hou, Ao-ping from comment #18)
> > Hello, 
> > I have patched the "AVX-512_prototype_v3" patch.
> > But I can not use this patch for _mm512_setr_pd(...)
> > Could you please help me to solve it ?
> > 
> > I encounter the following ERROR:
> > ...
> > 
> > Elliot
> 
> 
> Hello Elliot,
> 
> I can not reproduce the issue, but it looks like the application uses
> AVX-512 "vinsertf64x4" instruciton, which haven't been implemented in the
> patch.
> We still have a few AVX-512 instructions missing. I'll attach the updated
> patch with all AVX-512 instructions next week. 
> 
> Thank you,
> Tanya

Hi, Tanya,

Thank you so much.
Once I get  new patch, I will try it for basic operations in AVX-512 double precision.

Best Regards

Elliot
Comment 21 Tanya 2018-03-28 10:38:45 UTC
Created attachment 111693 [details]
Updated AVX-512 implementation prototype with all KNL instructions

The patch (AVX-512_prototype_v4_all_knl_insns.patch) implements all instructions from four subsets of AVX-512 (F, ER, CD, PF), available on KNL machines, in Nulgrind and in Memcheck. It also enables AVX-512 registers in vgdb.

New source files, added by this patch:
- VEX/priv/host_generic_simd512.c - Implementation of AVX-512 instrucitons in C, similar to VEX/priv/host_generic_simd256.c
- VEX/priv/host_generic_vrcp14.c - Reference implementations of VRCP14 and VRSQRT14 instructions. 
- VEX/priv/host_generic_avx512er.c - Reference implementations of AVX-512 ER instruction set (VRCP28, VRSQRT28, and VEXP2 instructions).
The latter two files are copied, with minor modifications, from https://software.intel.com/en-us/articles/reference-implementations-for-IA-approximation-instructions-vrcp14-vrsqrt14-vrcp28-vrsqrt28-vexp2 , files RECIP14.c and RECIP28EXP2.c, correspondingly

AVX-512 KNL benchmarks that pass result verification under Nulgrind (not tested under Memcheck yet) with the patch:
NPB IS, IOR, STREAM, STRIDE, AMG-2013, LAMMPS (both GCC-7.1.0 and Intel Compiler 17.0.4 builds).

NPB benchmarks other than IS and the QMCPACK benchmark fail result verification or crash.
Comment 22 Tanya 2018-03-28 10:39:47 UTC
Created attachment 111694 [details]
Nulgrind tests for AVX-512 machine, part 1
Comment 23 Tanya 2018-03-28 10:40:23 UTC
Created attachment 111695 [details]
Nulgrind tests for AVX-512 machine, part 2
Comment 24 Tanya 2018-03-28 10:41:39 UTC
Created attachment 111696 [details]
Nulgrind tests for AVX-512 machine, part 3
Comment 25 Tanya 2018-03-28 10:42:46 UTC
Created attachment 111697 [details]
Nulgrind tests for AVX-512 machine, part 4
Comment 26 Tanya 2018-03-28 10:43:45 UTC
Created attachment 111698 [details]
Nulgrind tests for AVX-512 machine, part 5

Files AVX-512_prototype_v4_tests.patch_p1 - AVX-512_prototype_v4_tests.patch_p5 are parts of new Nulgrind tests, split up because of site file size limit.
To join them together, please run "cat AVX-512_prototype_v4_tests.patch_p? > AVX-512_prototype_v4_tests.patch"

New test files, added by this patch (all in none/tests/amd64/):
avx512.vgtest, avx512.c, avx512.stdout.exp, avx512.stderr.exp - test for AVX_512 instructions. VGETEXPPS and VSCALEFPS instruction tests fail on denormal values.
avx-1_zmm.vgtest, avx-1_zmm.c, avx-1_zmm.stdout.exp, avx-1_zmm.stderr.exp - test  for avx-1 instructions on AVX-512 machine. The only difference from original avx-1 test file is that it check bytes 511:256 of the used registers
avx2-1_zmm.vgtest, avx2-1_zmm.c, avx2-1_zmm.stdout.exp avx2-1_zmm.stderr.exp - test for avx-2 instructions on AVX-512 machine. The only difference from the original file is that it check bytes 511:256 of the used registers
Comment 27 Tanya 2018-03-28 10:46:06 UTC
> Hi, Tanya,
> 
> Thank you so much.
> Once I get  new patch, I will try it for basic operations in AVX-512 double
> precision.
> 
> Best Regards
> 
> Elliot

Hello Elliot,
Attached a new patch. Sorry for the delay.

Thank you,
Tanya
Comment 28 Chris Samuel 2018-03-30 05:13:11 UTC
Hi there,

We hit the illegal instruction message on Skylake (6140) CPUs running valgrind on programs linked against OpenMPI 3.0.0; the cause being that with our GCC 5.5 and higher installs memset() causes the VPXORD command to be emitted during various initialisation routines which Valgrind can't handle.

It took a bit of digging to track down what was going on, so we're certainly interested in this work!

All the best,
Chris (HPC sysadmin at Swinburne University of Technology in Melbourne, Australia)
Comment 29 Hou, Ao-ping 2018-04-10 22:10:43 UTC
(In reply to Tanya from comment #27)
> > Hi, Tanya,
> > 
> > Thank you so much.
> > Once I get  new patch, I will try it for basic operations in AVX-512 double
> > precision.
> > 
> > Best Regards
> > 
> > Elliot
> 
> Hello Elliot,
> Attached a new patch. Sorry for the delay.
> 
> Thank you,
> Tanya

Hi, Tanya,

thank you.
I'll try these codes later.
Again, thank you so much.

Best Regards.

Elliot Hou, Ao-ping
Comment 30 Hou, Ao-ping 2018-04-11 06:32:55 UTC
(In reply to Tanya from comment #27)
> > Hi, Tanya,
> > 
> > Thank you so much.
> > Once I get  new patch, I will try it for basic operations in AVX-512 double
> > precision.
> > 
> > Best Regards
> > 
> > Elliot
> 
> Hello Elliot,
> Attached a new patch. Sorry for the delay.
> 
> Thank you,
> Tanya

Hi, Tanya,

I cannnot merge with the following commands:
git apply --stat ../AVX-512_prototype_v4_all_knl_insns.patch

And there are two errors in this patch:
memcheck/mc_translate.c
@@ -3191,6 +3341,33 @@ IRAtom* expr2vbits_Triop ( MCEnv* mce,
                                 unary64Fx2_w_rm(mce, vatom1, vatom2),
                                 unary64Fx2_w_rm(mce, vatom1, vatom3)));
 
And
memcheck/tests/vbit-test/irops.c
@@ -1131,12 +1143,169 @@ static irop_t irops[] = {
   { DEFOP(Iop_Rotx32, UNDEF_ALL), },
   { DEFOP(Iop_Rotx64, UNDEF_ALL), },
   { DEFOP(Iop_PwBitMtxXpose64x2, UNDEF_64x2_TRANSPOSE), .ppc64 = 1, .ppc32 = 1 },

Could you help me to solve it.

THank you

Elliot Hou, Ao-ping
Comment 31 Jacek Tomaka 2018-09-08 07:20:17 UTC
*** This bug has been confirmed by popular vote. ***
Comment 32 Tom Hughes 2020-05-01 06:12:05 UTC
*** Bug 420834 has been marked as a duplicate of this bug. ***
Comment 33 Tom Hughes 2020-05-01 06:12:29 UTC
*** Bug 393351 has been marked as a duplicate of this bug. ***
Comment 34 Tanya 2020-05-26 11:54:21 UTC
Created attachment 128799 [details]
Refactored implementation for Skylake machines

Attached patch "AVX-512_prototype_all_skx_insns.patch". It implements Skylake AVX-512 instruction subsets (AVX-512F, AVX-512CD, AVX-512VL, AVX-512BW, AVX-512 DQ) and refactors AVX-512 functionality in the following way:
- AVX-512 functionality is separated from the main code and only available under "AVX_512" define
- AVX-512 instruction information is moved to C structures. They can be automatically generated form a master .csv file
- Minimal AVX-512 Memcheck enabling

The patch is in alpha-stage. It has been tested on AVX-512 NPB benchmarks: some benchmarks pass validation under Nulgrind and Memcheck, but validation failures and crashes still occur, and Memcheck often reports false-positive errors.
Comment 35 Patrick J. LoPresti 2020-05-26 15:18:33 UTC
(In reply to Tanya from comment #34)

Nice to see work resuming on this.

Valgrind is almost useless for us at this point, because AVX-512 CPUs are now ubiquitous and the performance benefits are too large to ignore.
Comment 36 Alexandra Hajkova 2020-05-28 12:01:26 UTC
(In reply to Tanya from comment #34)
> Created attachment 128799 [details]
> Refactored implementation for Skylake machines
> 
> Attached patch "AVX-512_prototype_all_skx_insns.patch". It implements
> Skylake AVX-512 instruction subsets (AVX-512F, AVX-512CD, AVX-512VL,
> AVX-512BW, AVX-512 DQ) and refactors AVX-512 functionality in the following
> way:
> - AVX-512 functionality is separated from the main code and only available
> under "AVX_512" define
> - AVX-512 instruction information is moved to C structures. They can be
> automatically generated form a master .csv file
> - Minimal AVX-512 Memcheck enabling
> 
> The patch is in alpha-stage. It has been tested on AVX-512 NPB benchmarks:
> some benchmarks pass validation under Nulgrind and Memcheck, but validation
> failures and crashes still occur, and Memcheck often reports false-positive
> errors.

Hello Tanya,

This work is great. But do you have any special reason to not to use git format-patch? This patch does not applies. 

Thank you,
Alexandra
Comment 37 Alexandra Hajkova 2020-05-28 12:44:59 UTC
Created attachment 128856 [details]
patch

Hello Tanya,

I slighly modified your patch to make it appliable (removed trailing whitespaces, etc.) and used git format-patch.
Comment 38 Tanya 2020-05-28 13:50:05 UTC
(In reply to Alexandra Hajkova from comment #37)
> Created attachment 128856 [details]
> patch
> 
> Hello Tanya,
> 
> I slighly modified your patch to make it appliable (removed trailing
> whitespaces, etc.) and used git format-patch.

Hello Alexandra, 
Thank you! I will use this format for further patches.
Comment 39 Tom Hughes 2020-06-18 19:28:07 UTC
*** Bug 423182 has been marked as a duplicate of this bug. ***
Comment 40 Alexandra Hajkova 2020-06-19 12:32:43 UTC
Created attachment 129522 [details]
patch
Comment 41 Alexandra Hajkova 2020-06-19 13:19:15 UTC
Created attachment 129525 [details]
patch
Comment 42 Alexandra Hajkova 2020-06-19 17:15:33 UTC
Created attachment 129528 [details]
patch
Comment 43 Alexandra Hajkova 2020-06-24 21:32:33 UTC
My git repo:
https://github.com/sasshka/valgrind/commit/2a9d9c2a5e6021cd1b928e03aabb9e493cde5cdd
I tested the patch on Knights Landing. I had to rename gdbserver xml files for gdbserver tests to pass - they used to have avx with upper cases:
64bit-avx512-valgrind-s2.xml
64bit-avx512-valgrind-s1.xml
64bit-avx512.xml

The patch doesn't seem to handle AVX512 subsets, the knights landing I used didn't have KADD instruction which was added for BW/DQ (according to https://en.wikipedia.org/wiki/AVX-512#New_instructions_by_sets)

avx512.stdout.exp is missing in the patch with makes avx512 test to always fail.
Comment 44 Alexandra Hajkova 2020-06-24 21:33:28 UTC
Created attachment 129652 [details]
patch
Comment 45 Tom Hughes 2020-09-09 07:26:32 UTC
*** Bug 426330 has been marked as a duplicate of this bug. ***
Comment 46 Tom Hughes 2021-02-19 22:52:52 UTC
*** Bug 433272 has been marked as a duplicate of this bug. ***
Comment 47 Tanya 2021-02-20 13:37:50 UTC
Created attachment 135971 [details]
Part 1 of AVX-512 patch - main implementation
Comment 48 Tanya 2021-02-20 13:38:22 UTC
Created attachment 135972 [details]
Part 2 of AVX-512 patch - auto-generated files
Comment 49 Tanya 2021-02-20 13:38:51 UTC
Created attachment 135973 [details]
Part 3 of AVX-512 patch - AVX-512 tests
Comment 50 Tanya 2021-02-20 13:43:39 UTC
Created attachment 135974 [details]
Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Comment 51 Tanya 2021-02-20 13:45:22 UTC
Hello,

Attached updated AVX-512 patch for KNL and SkyLake.
It is based on the GIT master from Feb. 20 (commit 1c9a0bf58a47e855e6e5bf78a30bcee0af835804)

Attached files
~~~~~~~~~~~~~~
- AVX-512_KNL_SKX_p1_main.patch - Main AVX-512 Valgrind patch
- AVX-512_KNL_SKX_p2_data.patch - Files, automatically generated by a script. These files are required for the build
- AVX-512_KNL_SKX_p3_test.patch - AVX-512 regression tests
- AVX-512_KNL_SKX_p4_filegen.patch - Generator of Valgrind AVX-512 .c files from a file with instruction descriptions. Not required for build or usage; useful for adding new instructions or modifying AVX-512 behaviour

Functionality
~~~~~~~~~~~~~
- No known regressions on AVX2 machines
- Nulgrind is functional (no known failures of small AVX-512 applications) on KNL and SkyLake
- Memcheck does not crash on AVX-512 code; the analysis might still be incorrect

Limitations
~~~~~~~~~~~
- Needs GCC version 8 or newer
- Cannot emulate AVX-512 code on AVX-2 or older machine (some AVX-512 instructions are emulated through intrinsics)

Build and test
~~~~~~~~~~~~~~
Clone Valgrind master:
> git clone https://sourceware.org/git/valgrind.git
> cd valgrind
Apply the patch:
> git apply AVX-512_KNL_SKX_p1_main.patch
> git apply AVX-512_KNL_SKX_p2_data.patch
> git apply AVX-512_KNL_SKX_p3_test.patch
Optional development tool, not required for the build process:
> git apply AVX-512_KNL_SKX_p4_filegen.patch

Build:

Please check that GCC version is 8 or higher.
> ./autogen.sh
> ./configure --prefix=<install-path>
On AVX-512 machine, verify that AVX-512 version will be built: open config.log and check that "BUILD_AVX512_TESTS_TRUE=''" and "CFLAGS=' -DAVX_512'" lines exist.
> make install

Quick test:
<install-path>/bin/valgrind ls

Invoke a script to generate reference AVX-512 tests results (they are not provided with the patch because file size exceeds limit set by bugtracker):
> ./scripts/generate_test_results.sh
Run regression tests:
> make regtest
Comment 52 Mark Wielaard 2021-02-28 21:57:36 UTC
*** Bug 428004 has been marked as a duplicate of this bug. ***
Comment 53 Tanya 2021-07-13 12:13:23 UTC
Created attachment 140025 [details]
Version 2: Part 1 of AVX-512 patch - main implementation
Comment 54 Tanya 2021-07-13 12:13:58 UTC
Created attachment 140026 [details]
Version 2: Part 2 of AVX-512 patch - auto-generated files
Comment 55 Tanya 2021-07-13 12:14:43 UTC
Created attachment 140027 [details]
Version 2: Part 3 of AVX-512 patch - AVX-512 tests
Comment 56 Tanya 2021-07-13 12:15:05 UTC
Created attachment 140028 [details]
Version 2: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Comment 57 Tanya 2021-07-13 12:15:40 UTC
Attached an updated version of the AVX-512 patch for KNL and SkyLake. It is based on Valgrind 3.18 (GIT master from Jul, 13, commit 61307ee83121aa5f0b57a12a80e90fc2f414380a)

The major improvements are:
- Fixes for runtime crashes
- Changed the way masked instructions access memory

For build and test instructions, please refer to comment #51 (https://bugs.kde.org/show_bug.cgi?id=383010#c51)
Comment 58 Tanya 2021-10-29 12:26:50 UTC
Created attachment 142992 [details]
Version 3: Part 1 of AVX-512 patch - main implementation
Comment 59 Tanya 2021-10-29 12:27:19 UTC
Created attachment 142993 [details]
Version 3: Part 2 of AVX-512 patch - auto-generated files
Comment 60 Tanya 2021-10-29 12:27:49 UTC
Created attachment 142994 [details]
Version 3: Part 3 of AVX-512 patch - AVX-512 tests
Comment 61 Tanya 2021-10-29 12:28:19 UTC
Created attachment 142995 [details]
Version 3: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Comment 62 Tanya 2021-10-29 12:28:45 UTC
Attached an updated version of the AVX-512 patch for KNL and SkyLake. It does not contain any principal changes comparing to the previous version, just bug fixes.
It is based on Valgrind 3.18 (GIT master from Oct 22, commit b77dbefe72e4a5c7bcf1576a02c909010bd56991)
For build and test instructions, please refer to comment #51 (https://bugs.kde.org/show_bug.cgi?id=383010#c51)
Comment 63 Patrick J. LoPresti 2021-10-29 14:16:09 UTC
Could someone please work on getting these changes into mainline? Or at least make this a branch in the git repository? (Is there anything I can do to help?)

AVX-512 is now a "must have" for us.
Comment 64 Tanya 2021-10-29 14:34:35 UTC
(In reply to Patrick J. LoPresti from comment #63)
> Could someone please work on getting these changes into mainline? Or at
> least make this a branch in the git repository? (Is there anything I can do
> to help?)
> 
> AVX-512 is now a "must have" for us.

Hello,

If you would make the patch into a new branch - would you be interested in its internal commit history, perhaps for git blame?

If yes, preparing it with git-format-patch turned out to be too verbose. If it'd be useful, would it be possible to grant me access to this branch, perhaps temporarily?
Comment 65 Patrick J. LoPresti 2021-10-29 16:45:23 UTC
(In reply to Tanya from comment #64)
> 
> If you would make the patch into a new branch - would you be interested in
> its internal commit history, perhaps for git blame?

I am not a Valgrind maintainer; I have only submitted a few patches. I just did not want to ask for something without offering something...

I just think it would be really nice if this support could live on a branch in the official repository, if not (yet) delivered to the mainline.
Comment 66 Tanya 2021-11-03 09:19:45 UTC
*** Bug 441609 has been marked as a duplicate of this bug. ***
Comment 67 Julian Seward 2021-12-28 14:26:37 UTC
Created attachment 144910 [details]
Demonstrates misbehaving `vmovdqu8 %ymm7, (%r9){%k5}`

I've been testing the patches from comment 58, 59, 60 against the trunk, using
Fedora 35 running on a Core i5-1135G7.  It passes the tests in the comment 60
patch, but causes regressions in various other tests.  I tracked one problem
down to an incorrect implementation of 256-bit stores that use a guard
register (k1 .. k7).  This causes glibc's memset() to misbehave, hence causing
--tool=none runs to fail.

Testcase is attached.  I imagine it's caused by an incorrect translation into
IR, but I haven't figured out how that translation is done.
Comment 68 Julian Seward 2021-12-29 12:19:37 UTC
Created attachment 144930 [details]
valgrind-avx512-rollup-fixes-2021Dec29.diff

Rollup fixes to be applied on top of (after) the patches in comments 58, 59 and 60:

* fixes the problem described in comment 67.  The patch set extends
  AMD64Instr::CStore and AMD64Instr::CLoad to also handle 8- and 16- bit
  conditional stores and loads.  However, the emit_AMD64Instr clauses for
  these cases were not correct and still generating 64-bit transactions.  This
  fixes them.  That removes a bunch of incorrect results in regression tests,
  and crashing when running real programs.  The test case avx512-skx is still
  failing, though.

* [trivial fixes] fixes segfaults caused by insufficient alignment in test
  cases avx512-l1.c and avx512-l2.c

* [temporary] disables a few test cases in avx512-l1.c since they don't run on
  my hardware (Core i5-1135G7), even natively.
Comment 69 Tanya 2021-12-29 17:19:21 UTC
Created attachment 144938 [details]
Fix copyright notices on the new AVX-512 files

Patches attached in comment #58, comment #59, comment #60 and comment #61 had incorrect copyright notices. Attached a patch with a fix.
Comment 70 Julian Seward 2022-01-01 18:56:42 UTC
Created attachment 145020 [details]
Fix handling of no-mask reg-reg versions of VEXPAND* and VCOMPRESS*

Here's a bug fix for the VEXPAND and VCOMPRESS instructions, specifically for
the register-to-register, mask-free versions.  By "mask-free" I mean they do
not specify any of `{k1}` to `{k7}`.  (I think that makes the instructions
into trivial reg-to-reg copies, but that's irrelevant).  The bug is that the
generated IR acts as if `{k0}` had been specified, and so the result depends
on whatever value is in `k0` at the time.

I worry that there are potentially other places where the IR is generated
using `getKReg(mask)` when really it should be `mask ? getKReg(mask) :
mkU64(0)`, and that testing isn't catching these.  Not sure though.
Comment 71 Tanya 2022-01-12 13:53:19 UTC
Created attachment 145364 [details]
Update for AVX-512 Valgrind regression tests to spot differences between mask k0 and no mask

(In reply to Julian Seward from comment #70)
> I worry that there are potentially other places where the IR is generated
> using `getKReg(mask)` when really it should be `mask ? getKReg(mask) :
> mkU64(0)`, and that testing isn't catching these.  Not sure though.

Attached a patch to AVX-512 regression tests that catches this kind of errors by writing a random value in k0 before each test.
It did not detect any other affected instructions.
Comment 72 melven 2022-01-18 11:04:47 UTC
Small remark:

If valgrind is compiled explicitly for the skylake_avx512 architecture with GCC 11 (e.g. as done by https://spack.readthedocs.io on appropriate hardware), there is a kmovq instruction generated in vgpreload_memcheck-amd64-linux.so.

This triggers an "unrecognised instruction" error for any executable run with valgrind (memcheck):
```
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE1 0xFB 0x92 0xC8 0x48 0x8D 0x5 0xC0 0x71
vex amd64->IR:   REX=0 REX.W=1 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
vex amd64->IR:   PFX.66=0 PFX.F2=1 PFX.F3=0
==3481== valgrind: Unrecognised instruction at address 0x4c629f4.
==3481==    at 0x4C629F4: stpcpy (vg_replace_strmem.c:1180)
...
```

Everything works of course fine, if one just compiles (generically) for arch=x86_64.
Comment 73 Tanya 2022-02-11 08:19:30 UTC
Created attachment 146572 [details]
Version 4: Part 2 of AVX-512 patch - auto-generated files
Comment 74 Tanya 2022-02-11 08:20:08 UTC
Created attachment 146573 [details]
Version 4: Part 4 of AVX-512 patch - (optional) Generator of AVX-512 .c files
Comment 75 Tanya 2022-02-11 08:24:45 UTC
Julian reported VPTESTMW and VPTESTMB instruction failures - these instructions read data with incorrect granularity (64 and 32 bits instead of 16 and 8 bits, respectively). The granularity has been specified in the master file incorrectly.
Attached an updated master file (also corrects granularity for VPTESTNMW and VPTESTNMB, and corrects exception types for several instructions) and the generated files ("Version 4: Part 2 of AVX-512 patch - auto-generated files").
Comment 76 Tom Hughes 2022-06-14 21:30:08 UTC
*** Bug 408140 has been marked as a duplicate of this bug. ***
Comment 77 Tom Hughes 2022-06-14 21:30:18 UTC
*** Bug 455279 has been marked as a duplicate of this bug. ***
Comment 78 Tom Hughes 2022-06-14 21:30:36 UTC
*** Bug 451837 has been marked as a duplicate of this bug. ***
Comment 79 Sam James 2022-08-03 22:17:47 UTC
Are Tanya's patches still pending review? Is there an outstanding known issue with them?
Comment 80 Paul Floyd 2022-09-26 08:32:24 UTC
*** Bug 458305 has been marked as a duplicate of this bug. ***
Comment 81 Paul Floyd 2022-09-26 08:34:56 UTC
*** Bug 458218 has been marked as a duplicate of this bug. ***
Comment 82 Tom Hughes 2022-11-22 17:02:28 UTC
*** Bug 462135 has been marked as a duplicate of this bug. ***
Comment 83 Tom Hughes 2022-12-15 20:40:51 UTC
*** Bug 463082 has been marked as a duplicate of this bug. ***
Comment 84 Patrick J. LoPresti 2022-12-15 23:33:33 UTC
(In reply to Sam James from comment #79)
> Are Tanya's patches still pending review? Is there an outstanding known
> issue with them?

Similar question... What are the outstanding tasks here, and do they have owners? Can I help?
Comment 85 Mark Wielaard 2023-04-20 11:42:21 UTC
*** Bug 468544 has been marked as a duplicate of this bug. ***
Comment 86 Mark Wielaard 2023-04-20 11:43:46 UTC
*** Bug 460203 has been marked as a duplicate of this bug. ***
Comment 87 Mark Wielaard 2023-04-20 11:50:59 UTC
*** Bug 450952 has been marked as a duplicate of this bug. ***
Comment 88 JojoR 2023-04-26 02:23:22 UTC
Excellent work, it's a long time :)

Is there any schedule for upstreaming about this feature ?
these patches looks like a prototype implementation ?

I opened another thread about RISC-V vector, some issues should be
common in VEX or plugin Memcheck, even more vector ISA generator :)
Any suggestions for these ? @Tanya @Julian Seward

BTW, is there anyone interested in ARM's SVE or RISC-V vector ?
See more details from https://bugs.kde.org/show_bug.cgi?id=468979
Comment 89 Tom Hughes 2023-05-09 09:02:28 UTC
*** Bug 417572 has been marked as a duplicate of this bug. ***
Comment 90 Tom Hughes 2023-05-09 09:02:58 UTC
*** Bug 339416 has been marked as a duplicate of this bug. ***
Comment 91 Tom Hughes 2023-05-17 06:05:14 UTC
*** Bug 469878 has been marked as a duplicate of this bug. ***
Comment 92 Tom Hughes 2023-05-31 14:43:25 UTC
*** Bug 470489 has been marked as a duplicate of this bug. ***
Comment 93 Tom Hughes 2024-01-30 17:08:14 UTC
*** Bug 480545 has been marked as a duplicate of this bug. ***
Comment 94 Paul Floyd 2024-02-23 17:52:22 UTC
*** Bug 481729 has been marked as a duplicate of this bug. ***
Comment 95 hmenke 2024-04-19 10:46:03 UTC
I wanted to give these patches a try, but I can't figure out which version of Valgrind these are supposed to apply to.
Comment 96 Paul Floyd 2024-05-17 04:59:06 UTC
*** Bug 487124 has been marked as a duplicate of this bug. ***
Comment 97 Tom Hughes 2024-06-26 12:34:54 UTC
*** Bug 489221 has been marked as a duplicate of this bug. ***
Comment 98 Mark Wielaard 2024-07-10 17:19:58 UTC
*** Bug 490009 has been marked as a duplicate of this bug. ***
Comment 99 Michał Dec 2024-10-09 20:49:09 UTC
I've just hit this bug and I can't properly work with AVX-512 instruction. I have to scp my entire projects to a machine that doesn't use AVX-512 at glibc level. This is very inconvenient and uncomfortable.
Comment 100 Paul Floyd 2024-10-10 04:22:32 UTC
(In reply to Michał Dec from comment #99)
> I've just hit this bug and I can't properly work with AVX-512 instruction. I
> have to scp my entire projects to a machine that doesn't use AVX-512 at
> glibc level. This is very inconvenient and uncomfortable.

This is a huge task and we have no one to work on it. That means it is more or less dead in the water.
Comment 101 Richard Jones 2024-10-10 07:35:01 UTC
(In reply to Michał Dec from comment #99)
> I've just hit this bug and I can't properly work with AVX-512 instruction. I
> have to scp my entire projects to a machine that doesn't use AVX-512 at
> glibc level. This is very inconvenient and uncomfortable.

You can compile with -mno-avx512f
Comment 102 Michał Dec 2024-10-11 04:31:09 UTC
> You can compile with -mno-avx512f

Would a minimum viable solution be to apply this only to glibc, and then reboot so that the whole system loads up the new glibc?
Comment 103 Mark Wielaard 2024-10-11 14:46:11 UTC
(In reply to Michał Dec from comment #102)
> > You can compile with -mno-avx512f
> 
> Would a minimum viable solution be to apply this only to glibc, and then
> reboot so that the whole system loads up the new glibc?

Normally glibc uses ifuncs which check whether avx512 is available. Since valgrind doesn't advertise avx512 being available glibc should normally not use it. Do you know why it does? How is your glibc build? What error do you see under valgrind?
Comment 104 Petr 2024-10-11 14:51:47 UTC
If you use a distribution such as Gentoo with AVX-512 enabled globally you basically cannot use valgrind as it doesn't matter what valgrind advertises - all the code is already using AVX-512. And this of course also happens when you use `-march=naive`, etc...

Still not sure why AVX-512 is so big deal. I mean is it the decoding or supporting wider operations? It's like extending AVX2 to use wider vectors and masking - that would cover maybe 95% of AVX-512.
Comment 105 Petr 2024-10-11 15:09:41 UTC
BTW can anyone clarify the following:

  - What happened to the work Tanya did / what happened to Tanya?
  - If Intel wanted to sponsor AVX-512 support and the one who was working on it lost interest or cannot do it anymore (Russia?); is Intel still willing to sponsor somebody else who would be able to do it?
  - If Intel is not interested anymore, is there any other company (or multiple companies) willing to sponsor the AVX-512 support?

I don't think this is about manpower, it's more about sponsoring the effort as AVX-512 is pretty comprehensive ISA.

What I personally like on valgrind is that it runs unmodified binaries - it can even run JIT generated code, which is amazing for my own use cases, but the lack of AVX-512 support is getting in the way more and more as a lot of existing software adds support for AVX-512. You can say "compile without AVX-512", but that means you cannot test AVX-512 code, which means valgrind cannot be used to find problems in software that supports AVX-512.
Comment 106 tanya_mineeva@bk.ru 2024-10-11 15:36:12 UTC
Hi, this is Tanya.

> What happened to the work Tanya did?
The patches I did are attached to the bug report, but they must be really outdated now.

> What happened to Tanya?
I could no longer contribute (Russia, you guessed it). I've relocated, but have also changed the project, unfortunately.

> If Intel wanted to sponsor AVX-512 support and the one who was working on it lost interest or cannot do it anymore (Russia?); is Intel still willing to sponsor somebody else who would be able to do it?
I'll try to reach out to the people who were previously interested in it, but so much have changed, I'm not sure I'll find the right people.
Comment 107 Michał Dec 2024-10-11 16:30:20 UTC
(In reply to Mark Wielaard from comment #103)
> Normally glibc uses ifuncs which check whether avx512 is available. Since
> valgrind doesn't advertise avx512 being available glibc should normally not
> use it. Do you know why it does? How is your glibc build? What error do you
> see under valgrind?

> Do you know why it does?

I don't know, that's why I'm asking. From what I understand in my next answer, I should disable multiarch and recompile glibc without AVX-512. But I don't wanna reboot for nothing. I have an AM5 system and I really hate rebooting it, it takes forever.

> How is your glibc build?

At the time of writing, this is the recipe for building the version of glibc considered stable on amd64: https://gitweb.gentoo.org/repo/gentoo.git/tree/sys-libs/glibc/glibc-2.39-r6.ebuild?id=9adf7dbf124f3810b77c28a04f7dd7995d238854
The flags I have enabled in my build are pretty common ones: cet multiarch ssp stack-realign

> What error do you see under valgrind?

```
vex amd64->IR: unhandled instruction bytes: 0x62 0xD2 0xFD 0x28 0x7C 0xC0 0x48 0x89 0xC6 0x31
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==28914== valgrind: Unrecognised instruction at address 0x484f0cf.
==28914==    at 0x484F0CF: memset (vg_replace_strmem.c:1390)
==28914==    by 0x4A47F71: memset (string_fortified.h:59)
==28914==    by 0x4A47F71: CRYPTO_zalloc (mem.c:224)
==28914==    by 0x4A5A220: CRYPTO_THREAD_lock_new (threads_pthread.c:684)
==28914==    by 0x4B330FC: do_rand_init (rand_lib.c:51)
==28914==    by 0x4B330FC: do_rand_init_ossl_ (rand_lib.c:48)
==28914==    by 0x4DE340F: __pthread_once_slow (pthread_once.c:116)
==28914==    by 0x4DE3526: pthread_once@@GLIBC_2.34 (pthread_once.c:143)
==28914==    by 0x4A5A31C: CRYPTO_THREAD_run_once (threads_pthread.c:786)
==28914==    by 0x4B33B4C: RAND_get_rand_method (rand_lib.c:190)
==28914==    by 0x4B347DC: RAND_bytes_ex (rand_lib.c:368)
==28914==    by 0x10996B: (a function that calls RAND_bytes) (src.c:191)
==28914==    by 0x109427: main (src.c:516)
```

I've ran this program through valgrind on a machine that doesn't have AVX-512, and it runs fine. No leaks, not even a single complaint about uninitialized variables being read.
Comment 108 Paul Floyd 2024-10-12 05:06:58 UTC
(In reply to tanya_mineeva@bk.ru from comment #106)
> Hi, this is Tanya.
> 
> > What happened to the work Tanya did?
> The patches I did are attached to the bug report, but they must be really
> outdated now.
> 
> > What happened to Tanya?
> I could no longer contribute (Russia, you guessed it). I've relocated, but
> have also changed the project, unfortunately.
> 
> > If Intel wanted to sponsor AVX-512 support and the one who was working on it lost interest or cannot do it anymore (Russia?); is Intel still willing to sponsor somebody else who would be able to do it?
> I'll try to reach out to the people who were previously interested in it,
> but so much have changed, I'm not sure I'll find the right people.

Good to hear from you Tanya. It would be great if we could get some help from Intel again, though I understand Intel is having a tough time at the moment.
Comment 109 Paul Floyd 2024-10-12 05:07:37 UTC
(In reply to tanya_mineeva@bk.ru from comment #106)
> Hi, this is Tanya.
> 
> > What happened to the work Tanya did?
> The patches I did are attached to the bug report, but they must be really
> outdated now.
> 
> > What happened to Tanya?
> I could no longer contribute (Russia, you guessed it). I've relocated, but
> have also changed the project, unfortunately.
> 
> > If Intel wanted to sponsor AVX-512 support and the one who was working on it lost interest or cannot do it anymore (Russia?); is Intel still willing to sponsor somebody else who would be able to do it?
> I'll try to reach out to the people who were previously interested in it,
> but so much have changed, I'm not sure I'll find the right people.

Good to hear from you Tanya. It would be great if we could get some help from Intel again, though I understand Intel is having a tough time at the moment.
Comment 110 Paul Floyd 2024-10-12 05:13:06 UTC
(In reply to Petr from comment #104)
> If you use a distribution such as Gentoo with AVX-512 enabled globally you
> basically cannot use valgrind as it doesn't matter what valgrind advertises
> - all the code is already using AVX-512. And this of course also happens
> when you use `-march=naive`, etc...
> 
> Still not sure why AVX-512 is so big deal. I mean is it the decoding or
> supporting wider operations? It's like extending AVX2 to use wider vectors
> and masking - that would cover maybe 95% of AVX-512.

It’s everything. VEX isn’t my strong point, by my guess is that this would be one of the biggest changes to Valgrind for the last 15 years.
Comment 111 Paul Floyd 2024-10-12 05:24:26 UTC
(In reply to Petr from comment #105)
> BTW can anyone clarify the following:
> 
>   - What happened to the work Tanya did / what happened to Tanya?
>   - If Intel wanted to sponsor AVX-512 support and the one who was working
> on it lost interest or cannot do it anymore (Russia?); is Intel still
> willing to sponsor somebody else who would be able to do it?
>   - If Intel is not interested anymore, is there any other company (or
> multiple companies) willing to sponsor the AVX-512 support?
> 
> I don't think this is about manpower, it's more about sponsoring the effort
> as AVX-512 is pretty comprehensive ISA.

There are two things that would be required:
1. The actual development.
2. Getting it reviewed and accepted.

To complicate matters there are also arm and riscv vector extensions. I don’t want to see 3 incompatible implementations being added.
Comment 112 Olivier Huber 2024-10-14 18:28:52 UTC
(In reply to Michał Dec from comment #107)
> (In reply to Mark Wielaard from comment #103)
> > Normally glibc uses ifuncs which check whether avx512 is available. Since
> > valgrind doesn't advertise avx512 being available glibc should normally not
> > use it. Do you know why it does? How is your glibc build? What error do you
> > see under valgrind?
> 
> > Do you know why it does?
> 
> I don't know, that's why I'm asking. From what I understand in my next
> answer, I should disable multiarch and recompile glibc without AVX-512. But
> I don't wanna reboot for nothing. I have an AM5 system and I really hate
> rebooting it, it takes forever.
> 
> > How is your glibc build?
> 
> At the time of writing, this is the recipe for building the version of glibc
> considered stable on amd64:
> https://gitweb.gentoo.org/repo/gentoo.git/tree/sys-libs/glibc/glibc-2.39-r6.
> ebuild?id=9adf7dbf124f3810b77c28a04f7dd7995d238854
> The flags I have enabled in my build are pretty common ones: cet multiarch
> ssp stack-realign
> 
> > What error do you see under valgrind?
> 
> ```
> vex amd64->IR: unhandled instruction bytes: 0x62 0xD2 0xFD 0x28 0x7C 0xC0
> 0x48 0x89 0xC6 0x31
> vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
> vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
> vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
> ==28914== valgrind: Unrecognised instruction at address 0x484f0cf.
> ==28914==    at 0x484F0CF: memset (vg_replace_strmem.c:1390)
> ==28914==    by 0x4A47F71: memset (string_fortified.h:59)
> ==28914==    by 0x4A47F71: CRYPTO_zalloc (mem.c:224)
> ==28914==    by 0x4A5A220: CRYPTO_THREAD_lock_new (threads_pthread.c:684)
> ==28914==    by 0x4B330FC: do_rand_init (rand_lib.c:51)
> ==28914==    by 0x4B330FC: do_rand_init_ossl_ (rand_lib.c:48)
> ==28914==    by 0x4DE340F: __pthread_once_slow (pthread_once.c:116)
> ==28914==    by 0x4DE3526: pthread_once@@GLIBC_2.34 (pthread_once.c:143)
> ==28914==    by 0x4A5A31C: CRYPTO_THREAD_run_once (threads_pthread.c:786)
> ==28914==    by 0x4B33B4C: RAND_get_rand_method (rand_lib.c:190)
> ==28914==    by 0x4B347DC: RAND_bytes_ex (rand_lib.c:368)
> ==28914==    by 0x10996B: (a function that calls RAND_bytes) (src.c:191)
> ==28914==    by 0x109427: main (src.c:516)
>```

I am also on Gentoo and I have the same issue since I got my machine with an AMD Ryzen 7 7840U.

I didn't really want to rebuild glibc with AVX-512 disabled just for valgrind. I tried disabling AVX512 support by using `glibc.cpu.hwcaps`, to no avail.

The error I get is
===============================================================================
$ valgrind ls
==28786== Memcheck, a memory error detector
==28786== Copyright (C) 2002-2024, and GNU GPL'd, by Julian Seward et al.
==28786== Using Valgrind-3.23.1.GIT and LibVEX; rerun with -h for copyright info
==28786== Command: ls
==28786== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0x7F 0x48 0x7F 0x84 0x24 0x30 0x0 0x0
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==28786== valgrind: Unrecognised instruction at address 0x401d9cb.
==28786==    at 0x401D9CB: _dl_start (rtld.c:563)
==28786==    by 0x401C987: ??? (in /lib64/ld-linux-x86-64.so.2)
==28786== Your program just tried to execute an instruction that Valgrind
==28786== did not recognise.  There are two possible reasons for this.
==28786== 1. Your program has a bug and erroneously jumped to a non-code
==28786==    location.  If you are running Memcheck and you just saw a
==28786==    warning about a bad jump, it's probably your program's fault.
==28786== 2. The instruction is legitimate but Valgrind doesn't handle it,
==28786==    i.e. it's Valgrind's fault.  If you think this is the case or
==28786==    you are not sure, please let us know and we'll try to fix it.
==28786== Either way, Valgrind will now raise a SIGILL signal which will
==28786== probably kill your program.
==28786== 
==28786== Process terminating with default action of signal 4 (SIGILL)
==28786==  Illegal opcode at address 0x401D9CB
==28786==    at 0x401D9CB: _dl_start (rtld.c:563)
==28786==    by 0x401C987: ??? (in /lib64/ld-linux-x86-64.so.2)
==28786== 
==28786== HEAP SUMMARY:
==28786==     in use at exit: 0 bytes in 0 blocks
==28786==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==28786== 
==28786== All heap blocks were freed -- no leaks are possible
==28786== 
==28786== For lists of detected and suppressed errors, rerun with: -s
==28786== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
===============================================================================

It's unclear that there is a way to not execute the instruction. The line reported by valgrind is https://github.com/bminor/glibc/blob/5c06c6e0b5078ffb0aa0c09bac79f086145e0897/elf/rtld.c#L563

Thanks for everyone working on this. Is it possible to develop a workaround to be able to execute valgrind on a system where the glibc code contains AVX512 instructions? It's easy to compile the code begin developed without AVX512, it's different to recompile the whole system just for the sake of using valgrind.
Comment 113 Paul Floyd 2024-10-14 19:30:41 UTC
(In reply to Olivier Huber from comment #112)

> It's unclear that there is a way to not execute the instruction. The line
> reported by valgrind is

The AVX512 instructions fall into this category

==28786== 2. The instruction is legitimate but Valgrind doesn't handle it,
==28786==    i.e. it's Valgrind's fault.  If you think this is the case or
==28786==    you are not sure, please let us know and we'll try to fix it.
==28786== Either way, Valgrind will now raise a SIGILL signal which will
==28786== probably kill your program.

"probably" here means "almost certainly".

To survive a SIGILL your application or library would have to do something like
1. call setjump to prepare for recovery
2. install a SIGILL handler
3. run the code that hits SIGILL
4. in the handler set some flag so different code paths will get used
5. in the handler longjump back to the recovery point.

I have only seen code like that once.
Comment 114 Thiago Macieira 2024-10-29 14:41:13 UTC
(In reply to tanya_mineeva@bk.ru from comment #106)
> > If Intel wanted to sponsor AVX-512 support and the one who was working on it lost interest or cannot do it anymore (Russia?); is Intel still willing to sponsor somebody else who would be able to do it?
> I'll try to reach out to the people who were previously interested in it,
> but so much have changed, I'm not sure I'll find the right people.

There's interest, but I don't know what level of investment there is. I'll try to reach out too (I work for Intel and I am affected by this too)
Comment 115 Thiago Macieira 2024-10-29 14:43:43 UTC
(In reply to Paul Floyd from comment #110)
> It’s everything. VEX isn’t my strong point, by my guess is that this would
> be one of the biggest changes to Valgrind for the last 15 years.

To be clear, it's the EVEX prefix that is the problem. Valgrind has supported VEX for a while. But since it hasn't supported EVEX, it has also not added support for FP16, which is a massive addition to the instruction set from two years ago.

And then things are going to get worse with the VEX2 prefix for APX in a couple of years. That is going to affect everything, not just the vector code.
Comment 116 Paul Floyd 2024-10-29 15:57:03 UTC
(In reply to Thiago Macieira from comment #115)
> (In reply to Paul Floyd from comment #110)
> > It’s everything. VEX isn’t my strong point, by my guess is that this would
> > be one of the biggest changes to Valgrind for the last 15 years.
> 
> To be clear, it's the EVEX prefix that is the problem. Valgrind has
> supported VEX for a while. But since it hasn't supported EVEX, it has also
> not added support for FP16, which is a massive addition to the instruction
> set from two years ago.
> 
> And then things are going to get worse with the VEX2 prefix for APX in a
> couple of years. That is going to affect everything, not just the vector
> code.

Thiago, I really appreciate any help that you can offer.

But PLEASE don't try to explain things like this to me.

When I said VEX I meant VEX. That's this one here
https://sourceware.org/git/?p=valgrind.git;a=tree;f=VEX;h=e080b4ada344869c7446d9bca8ab2aac832c2aaa;hb=HEAD
Possibly "Valgrind Emulation", but not "Vector Extensions". It's the largest part of Valgrind that does all of the CPU emulation.
Comment 117 Thiago Macieira 2024-10-29 16:40:51 UTC
I stand corrected. Thanks for pointing out.