Bug 485148 - vfmadd213ss instruction is instrumented incorrectly (the remaining part of the register is cleared instead of kept unmodified)
Summary: vfmadd213ss instruction is instrumented incorrectly (the remaining part of th...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.23 GIT
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-06 19:23 UTC by Petr
Modified: 2024-04-14 19:56 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Patch which remove the high lane explicitly to zero. (1.29 KB, patch)
2024-04-10 15:10 UTC, Bruno Lathuilière
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Petr 2024-04-06 19:23:34 UTC
I have found the following problem when instrumenting vfmadd213ss instruction.

Assembly:

```
.section .text {#0}
L1:                                         ; L1: void Func(uint64@rdi dstPtr, uint64@rsi src1Ptr, uint64@rdx src2Ptr, uint64@rcx src3Ptr)
vmovups xmm2, xmmword ptr [rsi]             ; vmovups src1Vec, xmmword ptr [src1Ptr]
vmovups xmm3, xmmword ptr [rdx]             ; vmovups src2Vec, xmmword ptr [src2Ptr]
vmovups xmm1, xmmword ptr [rcx]             ; vmovups src3Vec, xmmword ptr [src3Ptr]
vmovdqa xmm0, xmm2                          ; vmovdqa dstVec, src1Vec
vfmadd213ss xmm0, xmm3, xmm1                ; vfmadd213ss dstVec, src2Vec, src3Vec
vmovups xmmword ptr [rdi], xmm0             ; vmovups xmmword ptr [dstPtr], dstVec
vzeroupper
ret
```

This is actually JIT generated, but the intention is to test the generation of VFMADD213SS instruction, which is specified to store the scalar result in DST[31:0] and to keep the rest of the register unmodified. However, valgrind modifies the rest of the register.

The following test case fails:

```
  REASON: Operation 'v_madd_f32s' (variation 0) failed
      Input #1: {0.00000000000000000000, 0.00000000000000000000, 1274722.87500000000000000000, 4955372.50000000000000000000}
      Input #2: {5325978.00000000000000000000, 1.00000000000000000000, 0.00000000000000000000, 8228577.50000000000000000000}
      Input #3: {251207.64062500000000000000, 0.00000999999974737875, 2.00000000000000000000, -inf}

      // This is the expected result (the first source is the destination, the 3 remaining elements must be unmodified):
      Expected: {251207.64062500000000000000, 0.00000000000000000000, 1274722.87500000000000000000, 4955372.50000000000000000000}

      // This is the result of the instrumentation:
      Observed: {251207.64062500000000000000, 0.00000000000000000000, 0.00000000000000000000, 0.00000000000000000000}
```

I believe this is a valgrind bug as this test case only fails when running with valgrind - without valgrind the observed value matches the expected one.
Comment 1 Petr 2024-04-06 19:36:39 UTC
I can repro this on the latest master checkout as well: valgrind-3.23.0.GIT
Comment 2 Paul Floyd 2024-04-07 05:26:21 UTC
Can you share a testcase?
Comment 3 Petr 2024-04-07 07:43:54 UTC
Yeah, I'm not sorry for providing a C++ code.

Here is a very simple repro:

```
    #include <x86intrin.h>
    #include <stdio.h>

    static __attribute__((noinline)) void test_fma_ss(float dst[4], const float a[4], const float b[4], const float c[4]) {
      __m128 av = _mm_loadu_ps(a);
      __m128 bv = _mm_loadu_ps(b);
      __m128 cv = _mm_loadu_ps(c);

      __m128 dv = _mm_fmadd_ss(av, bv, cv);
      _mm_storeu_ps(dst, dv);
    }

    int main() {
      float a[4] = { 1, 2, 3, 4 };
      float b[4] = { 3, 11, 35, 1 };
      float c[4] = { -1, -2, -19, 0 };

      float dst[4];
      test_fma_ss(dst, a, b, c);

      printf("[%f %f %f %f]\n", dst[0], dst[1], dst[2], dst[3]);
      return 0;
    }
```

Which can be compiled as

    gcc valgrind_bug.cpp -O2 -mfma -mavx2 -o valgrind_bug

When executed without valgrind:
    ./valgrind_bug

It prints:

    [2.000000 2.000000 3.000000 4.000000]

When executed with valgrind:
    valgrind ./valgrind_bug

In prints:

==109113== Memcheck, a memory error detector
==109113== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==109113== Using Valgrind-3.23.0.GIT and LibVEX; rerun with -h for copyright info
==109113== Command: ./valgrind_bug
==109113== 
[2.000000 0.000000 0.000000 0.000000]
==109113== 
==109113== HEAP SUMMARY:
==109113==     in use at exit: 0 bytes in 0 blocks
==109113==   total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated
==109113== 
==109113== All heap blocks were freed -- no leaks are possible
==109113== 
==109113== For lists of detected and suppressed errors, rerun with: -s
==109113== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The result is different - only the first value is correct, the remaining 3 values were zeroed.
Comment 4 Petr 2024-04-07 07:44:38 UTC
LOL "I'm not sorry" -> "I'm sorry" - haha that looked funny :)
Comment 5 Paul Floyd 2024-04-07 09:45:27 UTC
If anything, I prefer C++.

For reference, here it is in godbolt https://godbolt.org/z/bM5brqWM6

And here is what I see in gdb with vgdb before and after single stepping the vfmadd132ss instruction

(gdb) p	$xmm0
$1 = {v4_float = {1, 2, 3, 4}, v2_double = {2.0000004731118679, 512.00012254714966}, v16_int8 = {0, 0, -128, 63, 0, 0, 0, 64, 0, 0, 64, 64, 0, 0, -128, 64}, v8_int16 = {0, 16256, 0, 16384, 0, 16448, 0, 16512}, v4_int32 = {1065353216, 1073741824,
    1077936128, 1082130432}, v2_int64 = {4611686019492741120, 4647714816524288000}, uint128 = 85735205748011485687928662073142149120}
(gdb) p $xmm1
$2 = {v4_float = {-1, -2, -19, 0}, v2_double = {-2.0000014267861843, 1.6047075103796218e-314}, v16_int8 = {0, 0, -128, -65, 0, 0, 0, -64, 0, 0, -104, -63, 0, 0, 0, 0}, v8_int16 = {0, -16512, 0, -16384, 0, -15976, 0, 0}, v4_int32 = {-1082130432,
    -1073741824, -1047003136, 0}, v2_int64 = {-4611686015214551040, 3247964160}, uint128 = 59914363633936079956933083136}
(gdb) p $xmm2
$3 = {v4_float = {3, 11, 35, 1}, v2_double = {1048576.2509765625, 0.0078125019222170522}, v16_int8 = {0, 0, 64, 64, 0, 0, 48, 65, 0, 0, 12, 66, 0, 0, -128, 63}, v8_int16 = {0, 16448, 0, 16688, 0, 16908, 0, 16256}, v4_int32 = {1077936128, 1093664768,
    1108082688, 1065353216}, v2_int64 = {4697254412425363456, 4575657222516506624}, uint128 = 84405977752782675692133152826159267840}
(gdb) si
(gdb) p $xmm0
$4 = {v4_float = {2, 0, 0, 0}, v2_double = {5.3049894774131808e-315, 0}, v16_int8 = {0, 0, 0, 64, 0 <repeats 12 times>}, v8_int16 = {0, 16384, 0, 0, 0, 0, 0, 0}, v4_int32 = {1073741824, 0, 0, 0}, v2_int64 = {1073741824, 0}, uint128 = 1073741824}
Comment 6 Petr 2024-04-07 10:10:36 UTC
I'm not sure I follow your last comment.

The best would be to just compile the code locally and to run it with and without valgrind - the results will be different, and  this is what I'm trying to point out. The instrumentation is incorrect in this case and it happens only with FMA - for example vpaddss is instrumented correctly (the remaining 3 elements are not zeroed).

I'm not sure following vgdb makes sense in this case if you cannot step over a single instruction in disassembly (if it's instrumented wrong you would have wrong results). You can step over in gdb without valgrind to have a comparison.
Comment 7 Paul Floyd 2024-04-07 10:13:17 UTC
I wanted to make sure that it is only a problem with vfmadd132ss .
Comment 8 Petr 2024-04-07 11:00:03 UTC
(In reply to Paul Floyd from comment #7)
> I wanted to make sure that it is only a problem with vfmadd132ss .

I didn't test others as this was the first test failing, but I would assume that the whole family of FMA SS/SD instructions would be affected.
Comment 9 Bruno Lathuilière 2024-04-10 15:10:19 UTC
Created attachment 168346 [details]
Patch which remove the high lane explicitly to zero.

It should solve the problem. I hope there are no bad side effects. As I did not see this bug with verrou, it mean I have to modify the verrou test suite.
Comment 10 Petr 2024-04-10 17:34:58 UTC
BTW are you sure about the patch?

Basically what SS operations do:

  DST[31..0] = result
  DST[127..32] = unchanged
  DST[vLen..128] = 0

So the high part of the YMM register should be zeroed, but the zeroing seems to be commented out. Similarly, SD operation would preserve the high part of XMM register, but would zero the high part of YMM register.
Comment 11 Paul Floyd 2024-04-10 18:48:56 UTC
Maybe this then

--- a/VEX/priv/guest_amd64_toIR.c
+++ b/VEX/priv/guest_amd64_toIR.c
@@ -27987,10 +27987,9 @@ static Long dis_FMA ( const VexAbiInfo* vbi, Prefix pfx, Long delta, UChar opc )
    for (i = 0; i < count; i++) {
       putYMMRegLane(rG, i, res[i]);
    }
-
    switch (vty) {
-      case Ity_F32:  putYMMRegLane32(rG, 1, mkU32(0)); /*fallthru*/
-      case Ity_F64:  putYMMRegLane64(rG, 1, mkU64(0)); /*fallthru*/
+      case Ity_F32:
+      case Ity_F64:
       case Ity_V128: putYMMRegLane128(rG, 1, mkV128(0)); /*fallthru*/
       case Ity_V256: break;
       default: vassert(0);
Comment 12 Petr 2024-04-10 21:27:30 UTC
That looks much better.

I will not be able to test it until Sunday I think, so I'm not able to confirm that this fixes all the issues, but I will test on Sunday with this patch.
Comment 13 Bruno Lathuilière 2024-04-12 17:07:13 UTC
I added in verrou ( branch bl/fmaAvxSse_llo ) test cases  for fma low lane only  (float and double) with SSE and AVX  registers (Thanks the Petr remark). For AVX, I was not able to write it with intrinsics so I use asm. If someone know how to do it, it could be slightly cleaner.
I test only the 213 version.

With these tests, I get failures without patch and with my patch. The patch of Paul Floyd works well. 

The bad news is my test cases can not be integrated directly in "none" test as I do not have the same test infrastructure.
Comment 14 Petr 2024-04-14 13:49:07 UTC
I can confirm the latest patch from Paul Floyd works for me. I have a test suite that should verify ALL cases and the tests pass with this patch.
Comment 15 Paul Floyd 2024-04-14 15:12:30 UTC
Can you share any of those tests?
Comment 16 Paul Floyd 2024-04-14 19:37:31 UTC
commit 219998aeb64e28391861006e9eb1ea4b96d69083 (HEAD -> master, origin/users/paulf/try-bug485148, origin/master, origin/HEAD, bug485148)
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Sun Apr 14 17:59:30 2024 +0200

    Bug 485148 - vfmadd213ss instruction is instrumented incorrectly (the remaining part of the register is cleared instead of kept unmodified)
    
    Initial version contributed by Bruno Lathuili<C3><A8>re <bruno.lathuiliere@edf.fr>
    Initial test contributed by Petr <kobalicek.petr@gmail.com>
Comment 17 Petr 2024-04-14 19:56:45 UTC
Yeah, I have discovered this bug originally by running this test suite:

  - https://github.com/blend2d/blend2d/blob/aarch64_jit/src/blend2d/pipeline/jit/pipecompiler_test.cpp

However, this link will be dead once the aarch64_jit branch is merged into master.

One nice property of the test suite is that it should be able to test all FMA instruction variations.

My only remaining wish now is a proper AVX-512 support :)

Thanks a lot for merging the patch.