Bug 481127 - [PATCH] amd64: Implement VFMADD213 for Iop_MAddF32
Summary: [PATCH] amd64: Implement VFMADD213 for Iop_MAddF32
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (other bugs)
Version First Reported In: unspecified
Platform: Other All
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-09 17:07 UTC by notasas
Modified: 2025-05-09 12:16 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
fma patch (10.89 KB, patch)
2024-02-09 17:07 UTC, notasas
Details
Improve the previous patch to take into account double. (13.09 KB, patch)
2024-03-18 15:45 UTC, Bruno Lathuilière
Details
valgrind.fma_arm64.diff (20.24 KB, text/plain)
2024-04-13 00:24 UTC, Mark Wielaard
Details

Note You need to log in before you can comment on or make changes to this bug.
Description notasas 2024-02-09 17:07:42 UTC
Created attachment 165702 [details]
fma patch

Fused multiply-add is widely used by various deep learning related applications. This patch speeds up the runtime of such applications multiple times.
Comment 1 Paul Floyd 2024-02-09 17:38:22 UTC
A similar suggestion came up in the Valgrind developers list recently. Posted by Bruno Lathuliere, developer of the verou tool for rounding error analysis.

I haven't looked in detail at either. This looks a bit more portable as it doesn't use intrinsics.
Comment 2 Bruno Lathuilière 2024-02-16 21:38:23 UTC
Since my post on valgrind mailing list, I modify my patch to take into account the remarks of Tom Hugues.
The pros of my patch  ( https://github.com/edf-hpc/verrou/blob/master/valgrind.fma_amd64.diff ) are :
- should take into account fma3 and fma4 (not tested on fma4 : as I do not have access to these architectures)
- speed up F32 and F64 FMA

The pros of this patch :
- It remove the dirty call (it should be faster)

I don't know if it is easy to have the better of both world.
Comment 3 Bruno Lathuilière 2024-03-18 15:45:38 UTC
Created attachment 167413 [details]
Improve the previous patch to take into account double.
Comment 4 Bruno Lathuilière 2024-03-18 15:52:40 UTC
I was able to test my patch https://github.com/edf-hpc/verrou/blob/master/valgrind.fma_amd64.diff  with fma4 hardware : it works. 
I also improve the patch of "notasas" to take into account double precision (cf attachment) and it is faster than the previous patch.

It should not be difficult to merge both: if we have a chance to integrate it, I can do it.
Comment 5 Paul Floyd 2024-03-19 06:18:24 UTC
Should we also support F16?

Does this also work with the other permutatons 132 and 231?

Lastly, do packed and scalar make any difference?

This will need a regression test as well.
Comment 6 Bruno Lathuilière 2024-03-25 12:37:05 UTC
(In reply to Paul Floyd from comment #5)
> Should we also support F16?
No, there are no Iop_MAddF16 or IopMSubF16.   
And to my knowledge AVX512 is the only way to generate half floating-point operations. And AVX512 is not yet supported by valgrind.

> Does this also work with the other permutations 132 and 231?
The patch concern only the backend so we can choose the permutation. The frontend has to treat correctly all the permutations, this already  tested by none/tests/fma.c.

> Lastly, do packed and scalar make any difference?
In the amd64 frontend the vectorized fma are unvectorized. So in the backend we do not have to treat packed version of fma.
I think it would be nice to introduce new IOp in VEX :
 Iop_MAdd_F64x2


> This will need a regression test as well.
Comment 7 Bruno Lathuilière 2024-03-25 12:47:42 UTC
Sorry I did'nt finish my message (bad keyboard shorkey ...)

I think we will need new IR to take into account vectorized fma :
  Iop_MAdd_F64x2,  Iop_MAdd_F64x4,   Iop_MAdd_F32x4,   Iop_MAdd_F32x8,   Iop_MSub_F64x2,  Iop_MSub_F64x4,   Iop_MSub_F32x4,   Iop_MSub_F32x8 
But this is outside the scope of this patch.

> > This will need a regression test as well.
For the proposed patch, we do not need new tests as we modify only the backend. The test fma.c and fma4.c provides already a good coverage. Do you see something missing?

In term of tests, it could be interesting to test the patches on various amd64 architecture. Do you have continuous integration with several architectures?
Comment 8 Bruno Lathuilière 2024-03-25 13:04:27 UTC
The merge between the two last version is done:
https://raw.githubusercontent.com/edf-hpc/verrou/bl/test_merge_fma/valgrind.fma_amd64.diff
Comment 9 Mark Wielaard 2024-04-12 23:36:49 UTC
On a distro compiled with gcc -march=x86-64-v3 this patch fixes:
bug #463463 none/tests/amd64/fma
bug #463458 memcheck/tests/vcpu_fnfns
Comment 10 Mark Wielaard 2024-04-13 00:24:38 UTC
Created attachment 168440 [details]
valgrind.fma_arm64.diff

This is the variant of the patch that I tested.
It looks good to me.
How should we credit this when applying?
Comment 11 Paul Floyd 2024-04-13 09:50:40 UTC
For the attribution I would just put

Patch contributed by Gražvydas Ignotas <notasas@gmail.com> and Bruno Lathuilière <bruno.lathuiliere@edf.fr>

(with the right accents if possible)
Comment 12 Mark Wielaard 2024-04-13 13:06:17 UTC
commit a5693c1203c3a26443af13182a8082c2e9152f6c
Author: Mark Wielaard <mark@klomp.org>
Date:   Sat Apr 13 14:33:19 2024 +0200

    amd64: Implement VFMADD213 for Iop_MAddF32 and Iop_MAddF64
    
    Speed up F32 and F64 FMA on amd64. Add priv/host_amd64_maddf.c
    implementing h_amd64_calc_MAddF32_fma4 and h_amd64_calc_MAddF64_fma4
    to be used instead of the generic variants h_generic_calc_MAddF32
    and h_generic_calc_MAddF64 when host has VEX_HWCAPS_AMD64_FMA4.
    Add fma3 and fma4 detection m_machine.c (machine_get_hwcaps).
    
    This patch also fixes the memcheck/tests/vcpu_fnfns and
    none/tests/amd64/fma testcases when run on a x86-64-v3 system.
    
    Patch contributed by Grazvydas Ignotas <notasas@gmail.com> and
    Bruno Lathuilière <bruno.lathuiliere@edf.fr>
    
    https://bugs.kde.org/show_bug.cgi?id=481127
    https://bugs.kde.org/show_bug.cgi?id=463463
    https://bugs.kde.org/show_bug.cgi?id=463458