Created attachment 165702 [details] fma patch Fused multiply-add is widely used by various deep learning related applications. This patch speeds up the runtime of such applications multiple times.
A similar suggestion came up in the Valgrind developers list recently. Posted by Bruno Lathuliere, developer of the verou tool for rounding error analysis. I haven't looked in detail at either. This looks a bit more portable as it doesn't use intrinsics.
Since my post on valgrind mailing list, I modify my patch to take into account the remarks of Tom Hugues. The pros of my patch ( https://github.com/edf-hpc/verrou/blob/master/valgrind.fma_amd64.diff ) are : - should take into account fma3 and fma4 (not tested on fma4 : as I do not have access to these architectures) - speed up F32 and F64 FMA The pros of this patch : - It remove the dirty call (it should be faster) I don't know if it is easy to have the better of both world.
Created attachment 167413 [details] Improve the previous patch to take into account double.
I was able to test my patch https://github.com/edf-hpc/verrou/blob/master/valgrind.fma_amd64.diff with fma4 hardware : it works. I also improve the patch of "notasas" to take into account double precision (cf attachment) and it is faster than the previous patch. It should not be difficult to merge both: if we have a chance to integrate it, I can do it.
Should we also support F16? Does this also work with the other permutatons 132 and 231? Lastly, do packed and scalar make any difference? This will need a regression test as well.
(In reply to Paul Floyd from comment #5) > Should we also support F16? No, there are no Iop_MAddF16 or IopMSubF16. And to my knowledge AVX512 is the only way to generate half floating-point operations. And AVX512 is not yet supported by valgrind. > Does this also work with the other permutations 132 and 231? The patch concern only the backend so we can choose the permutation. The frontend has to treat correctly all the permutations, this already tested by none/tests/fma.c. > Lastly, do packed and scalar make any difference? In the amd64 frontend the vectorized fma are unvectorized. So in the backend we do not have to treat packed version of fma. I think it would be nice to introduce new IOp in VEX : Iop_MAdd_F64x2 > This will need a regression test as well.
Sorry I did'nt finish my message (bad keyboard shorkey ...) I think we will need new IR to take into account vectorized fma : Iop_MAdd_F64x2, Iop_MAdd_F64x4, Iop_MAdd_F32x4, Iop_MAdd_F32x8, Iop_MSub_F64x2, Iop_MSub_F64x4, Iop_MSub_F32x4, Iop_MSub_F32x8 But this is outside the scope of this patch. > > This will need a regression test as well. For the proposed patch, we do not need new tests as we modify only the backend. The test fma.c and fma4.c provides already a good coverage. Do you see something missing? In term of tests, it could be interesting to test the patches on various amd64 architecture. Do you have continuous integration with several architectures?
The merge between the two last version is done: https://raw.githubusercontent.com/edf-hpc/verrou/bl/test_merge_fma/valgrind.fma_amd64.diff
On a distro compiled with gcc -march=x86-64-v3 this patch fixes: bug #463463 none/tests/amd64/fma bug #463458 memcheck/tests/vcpu_fnfns
Created attachment 168440 [details] valgrind.fma_arm64.diff This is the variant of the patch that I tested. It looks good to me. How should we credit this when applying?
For the attribution I would just put Patch contributed by Gražvydas Ignotas <notasas@gmail.com> and Bruno Lathuilière <bruno.lathuiliere@edf.fr> (with the right accents if possible)
commit a5693c1203c3a26443af13182a8082c2e9152f6c Author: Mark Wielaard <mark@klomp.org> Date: Sat Apr 13 14:33:19 2024 +0200 amd64: Implement VFMADD213 for Iop_MAddF32 and Iop_MAddF64 Speed up F32 and F64 FMA on amd64. Add priv/host_amd64_maddf.c implementing h_amd64_calc_MAddF32_fma4 and h_amd64_calc_MAddF64_fma4 to be used instead of the generic variants h_generic_calc_MAddF32 and h_generic_calc_MAddF64 when host has VEX_HWCAPS_AMD64_FMA4. Add fma3 and fma4 detection m_machine.c (machine_get_hwcaps). This patch also fixes the memcheck/tests/vcpu_fnfns and none/tests/amd64/fma testcases when run on a x86-64-v3 system. Patch contributed by Grazvydas Ignotas <notasas@gmail.com> and Bruno Lathuilière <bruno.lathuiliere@edf.fr> https://bugs.kde.org/show_bug.cgi?id=481127 https://bugs.kde.org/show_bug.cgi?id=463463 https://bugs.kde.org/show_bug.cgi?id=463458