I have found the following problem when instrumenting vfmadd213ss instruction. Assembly: ``` .section .text {#0} L1: ; L1: void Func(uint64@rdi dstPtr, uint64@rsi src1Ptr, uint64@rdx src2Ptr, uint64@rcx src3Ptr) vmovups xmm2, xmmword ptr [rsi] ; vmovups src1Vec, xmmword ptr [src1Ptr] vmovups xmm3, xmmword ptr [rdx] ; vmovups src2Vec, xmmword ptr [src2Ptr] vmovups xmm1, xmmword ptr [rcx] ; vmovups src3Vec, xmmword ptr [src3Ptr] vmovdqa xmm0, xmm2 ; vmovdqa dstVec, src1Vec vfmadd213ss xmm0, xmm3, xmm1 ; vfmadd213ss dstVec, src2Vec, src3Vec vmovups xmmword ptr [rdi], xmm0 ; vmovups xmmword ptr [dstPtr], dstVec vzeroupper ret ``` This is actually JIT generated, but the intention is to test the generation of VFMADD213SS instruction, which is specified to store the scalar result in DST[31:0] and to keep the rest of the register unmodified. However, valgrind modifies the rest of the register. The following test case fails: ``` REASON: Operation 'v_madd_f32s' (variation 0) failed Input #1: {0.00000000000000000000, 0.00000000000000000000, 1274722.87500000000000000000, 4955372.50000000000000000000} Input #2: {5325978.00000000000000000000, 1.00000000000000000000, 0.00000000000000000000, 8228577.50000000000000000000} Input #3: {251207.64062500000000000000, 0.00000999999974737875, 2.00000000000000000000, -inf} // This is the expected result (the first source is the destination, the 3 remaining elements must be unmodified): Expected: {251207.64062500000000000000, 0.00000000000000000000, 1274722.87500000000000000000, 4955372.50000000000000000000} // This is the result of the instrumentation: Observed: {251207.64062500000000000000, 0.00000000000000000000, 0.00000000000000000000, 0.00000000000000000000} ``` I believe this is a valgrind bug as this test case only fails when running with valgrind - without valgrind the observed value matches the expected one.
I can repro this on the latest master checkout as well: valgrind-3.23.0.GIT
Can you share a testcase?
Yeah, I'm not sorry for providing a C++ code. Here is a very simple repro: ``` #include <x86intrin.h> #include <stdio.h> static __attribute__((noinline)) void test_fma_ss(float dst[4], const float a[4], const float b[4], const float c[4]) { __m128 av = _mm_loadu_ps(a); __m128 bv = _mm_loadu_ps(b); __m128 cv = _mm_loadu_ps(c); __m128 dv = _mm_fmadd_ss(av, bv, cv); _mm_storeu_ps(dst, dv); } int main() { float a[4] = { 1, 2, 3, 4 }; float b[4] = { 3, 11, 35, 1 }; float c[4] = { -1, -2, -19, 0 }; float dst[4]; test_fma_ss(dst, a, b, c); printf("[%f %f %f %f]\n", dst[0], dst[1], dst[2], dst[3]); return 0; } ``` Which can be compiled as gcc valgrind_bug.cpp -O2 -mfma -mavx2 -o valgrind_bug When executed without valgrind: ./valgrind_bug It prints: [2.000000 2.000000 3.000000 4.000000] When executed with valgrind: valgrind ./valgrind_bug In prints: ==109113== Memcheck, a memory error detector ==109113== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==109113== Using Valgrind-3.23.0.GIT and LibVEX; rerun with -h for copyright info ==109113== Command: ./valgrind_bug ==109113== [2.000000 0.000000 0.000000 0.000000] ==109113== ==109113== HEAP SUMMARY: ==109113== in use at exit: 0 bytes in 0 blocks ==109113== total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated ==109113== ==109113== All heap blocks were freed -- no leaks are possible ==109113== ==109113== For lists of detected and suppressed errors, rerun with: -s ==109113== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) The result is different - only the first value is correct, the remaining 3 values were zeroed.
LOL "I'm not sorry" -> "I'm sorry" - haha that looked funny :)
If anything, I prefer C++. For reference, here it is in godbolt https://godbolt.org/z/bM5brqWM6 And here is what I see in gdb with vgdb before and after single stepping the vfmadd132ss instruction (gdb) p $xmm0 $1 = {v4_float = {1, 2, 3, 4}, v2_double = {2.0000004731118679, 512.00012254714966}, v16_int8 = {0, 0, -128, 63, 0, 0, 0, 64, 0, 0, 64, 64, 0, 0, -128, 64}, v8_int16 = {0, 16256, 0, 16384, 0, 16448, 0, 16512}, v4_int32 = {1065353216, 1073741824, 1077936128, 1082130432}, v2_int64 = {4611686019492741120, 4647714816524288000}, uint128 = 85735205748011485687928662073142149120} (gdb) p $xmm1 $2 = {v4_float = {-1, -2, -19, 0}, v2_double = {-2.0000014267861843, 1.6047075103796218e-314}, v16_int8 = {0, 0, -128, -65, 0, 0, 0, -64, 0, 0, -104, -63, 0, 0, 0, 0}, v8_int16 = {0, -16512, 0, -16384, 0, -15976, 0, 0}, v4_int32 = {-1082130432, -1073741824, -1047003136, 0}, v2_int64 = {-4611686015214551040, 3247964160}, uint128 = 59914363633936079956933083136} (gdb) p $xmm2 $3 = {v4_float = {3, 11, 35, 1}, v2_double = {1048576.2509765625, 0.0078125019222170522}, v16_int8 = {0, 0, 64, 64, 0, 0, 48, 65, 0, 0, 12, 66, 0, 0, -128, 63}, v8_int16 = {0, 16448, 0, 16688, 0, 16908, 0, 16256}, v4_int32 = {1077936128, 1093664768, 1108082688, 1065353216}, v2_int64 = {4697254412425363456, 4575657222516506624}, uint128 = 84405977752782675692133152826159267840} (gdb) si (gdb) p $xmm0 $4 = {v4_float = {2, 0, 0, 0}, v2_double = {5.3049894774131808e-315, 0}, v16_int8 = {0, 0, 0, 64, 0 <repeats 12 times>}, v8_int16 = {0, 16384, 0, 0, 0, 0, 0, 0}, v4_int32 = {1073741824, 0, 0, 0}, v2_int64 = {1073741824, 0}, uint128 = 1073741824}
I'm not sure I follow your last comment. The best would be to just compile the code locally and to run it with and without valgrind - the results will be different, and this is what I'm trying to point out. The instrumentation is incorrect in this case and it happens only with FMA - for example vpaddss is instrumented correctly (the remaining 3 elements are not zeroed). I'm not sure following vgdb makes sense in this case if you cannot step over a single instruction in disassembly (if it's instrumented wrong you would have wrong results). You can step over in gdb without valgrind to have a comparison.
I wanted to make sure that it is only a problem with vfmadd132ss .
(In reply to Paul Floyd from comment #7) > I wanted to make sure that it is only a problem with vfmadd132ss . I didn't test others as this was the first test failing, but I would assume that the whole family of FMA SS/SD instructions would be affected.
Created attachment 168346 [details] Patch which remove the high lane explicitly to zero. It should solve the problem. I hope there are no bad side effects. As I did not see this bug with verrou, it mean I have to modify the verrou test suite.
BTW are you sure about the patch? Basically what SS operations do: DST[31..0] = result DST[127..32] = unchanged DST[vLen..128] = 0 So the high part of the YMM register should be zeroed, but the zeroing seems to be commented out. Similarly, SD operation would preserve the high part of XMM register, but would zero the high part of YMM register.
Maybe this then --- a/VEX/priv/guest_amd64_toIR.c +++ b/VEX/priv/guest_amd64_toIR.c @@ -27987,10 +27987,9 @@ static Long dis_FMA ( const VexAbiInfo* vbi, Prefix pfx, Long delta, UChar opc ) for (i = 0; i < count; i++) { putYMMRegLane(rG, i, res[i]); } - switch (vty) { - case Ity_F32: putYMMRegLane32(rG, 1, mkU32(0)); /*fallthru*/ - case Ity_F64: putYMMRegLane64(rG, 1, mkU64(0)); /*fallthru*/ + case Ity_F32: + case Ity_F64: case Ity_V128: putYMMRegLane128(rG, 1, mkV128(0)); /*fallthru*/ case Ity_V256: break; default: vassert(0);
That looks much better. I will not be able to test it until Sunday I think, so I'm not able to confirm that this fixes all the issues, but I will test on Sunday with this patch.
I added in verrou ( branch bl/fmaAvxSse_llo ) test cases for fma low lane only (float and double) with SSE and AVX registers (Thanks the Petr remark). For AVX, I was not able to write it with intrinsics so I use asm. If someone know how to do it, it could be slightly cleaner. I test only the 213 version. With these tests, I get failures without patch and with my patch. The patch of Paul Floyd works well. The bad news is my test cases can not be integrated directly in "none" test as I do not have the same test infrastructure.
I can confirm the latest patch from Paul Floyd works for me. I have a test suite that should verify ALL cases and the tests pass with this patch.
Can you share any of those tests?
commit 219998aeb64e28391861006e9eb1ea4b96d69083 (HEAD -> master, origin/users/paulf/try-bug485148, origin/master, origin/HEAD, bug485148) Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Sun Apr 14 17:59:30 2024 +0200 Bug 485148 - vfmadd213ss instruction is instrumented incorrectly (the remaining part of the register is cleared instead of kept unmodified) Initial version contributed by Bruno Lathuili<C3><A8>re <bruno.lathuiliere@edf.fr> Initial test contributed by Petr <kobalicek.petr@gmail.com>
Yeah, I have discovered this bug originally by running this test suite: - https://github.com/blend2d/blend2d/blob/aarch64_jit/src/blend2d/pipeline/jit/pipecompiler_test.cpp However, this link will be dead once the aarch64_jit branch is merged into master. One nice property of the test suite is that it should be able to test all FMA instruction variations. My only remaining wish now is a proper AVX-512 support :) Thanks a lot for merging the patch.