Bug 136779 - Adding rounding mode support to Valgrind
Summary: Adding rounding mode support to Valgrind
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: unspecified
Platform: Fedora RPMs Linux
: NOR wishlist
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
: 302234 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-11-03 16:54 UTC by Daniel Russel
Modified: 2023-11-08 09:34 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
amd64 (only) proper FP rounding implementation. Apply in the valgrind tree itself (48 bytes, patch)
2011-11-14 20:45 UTC, Thomas Rast
Details
amd64 (only) proper FP rounding implementation. Apply in the VEX subdirectory (34 bytes, patch)
2011-11-14 20:45 UTC, Thomas Rast
Details
Make FP rounding changes faster (valgrind part) (43 bytes, patch)
2011-11-14 20:46 UTC, Thomas Rast
Details
Make FP rounding changes faster (VEX part) (38 bytes, patch)
2011-11-14 20:47 UTC, Thomas Rast
Details
Proper FP rounding for amd64 (VEX part) (56.52 KB, patch)
2013-03-04 11:29 UTC, Thomas Rast
Details
Proper FP rounding for amd64 (memcheck part) (5.58 KB, text/plain)
2013-03-04 11:30 UTC, Thomas Rast
Details
Smarter FP rounding (VEX part) (11.49 KB, patch)
2013-03-04 11:30 UTC, Thomas Rast
Details
Smarter FP rounding (valgrind part) (2.18 KB, patch)
2013-03-04 11:31 UTC, Thomas Rast
Details
Proper FP rounding for amd64 (VEX part) (57.58 KB, patch)
2013-05-04 09:24 UTC, Thomas Rast
Details
Stub rounding mode support for x86 SSE instructions (33.44 KB, patch)
2013-05-04 09:25 UTC, Thomas Rast
Details
Smarter FP rounding (VEX part) (11.57 KB, patch)
2013-05-04 09:25 UTC, Thomas Rast
Details
AVX rounding mode support (45.67 KB, patch)
2013-05-04 09:27 UTC, Thomas Rast
Details
Proper FP rounding for amd64 (memcheck part) (5.58 KB, patch)
2013-05-04 09:27 UTC, Thomas Rast
Details
Smarter FP rounding (valgrind part) (2.17 KB, patch)
2013-05-04 09:28 UTC, Thomas Rast
Details
Add a simple rounding support test (6.84 KB, patch)
2013-05-04 09:28 UTC, Thomas Rast
Details
All vex-side changes merged into a single patch (131.80 KB, patch)
2013-05-08 09:18 UTC, Julian Seward
Details
All valgrind-side changes merged into a single patch (12.40 KB, patch)
2013-05-08 09:20 UTC, Julian Seward
Details
AMD64 rounding mode support (18.37 KB, patch)
2013-05-15 16:38 UTC, Thomas Rast
Details
VEX side of AMD64 rounding mode support (135.14 KB, patch)
2013-05-15 16:39 UTC, Thomas Rast
Details
AMD64 rounding mode support (67.52 KB, patch)
2013-05-17 14:48 UTC, Thomas Rast
Details
VEX side of AMD64 rounding mode support (135.19 KB, patch)
2013-05-17 14:50 UTC, Thomas Rast
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Russel 2006-11-03 16:54:08 UTC
Version:           3.1 (using KDE KDE 3.5.5)
Installed from:    Fedora RPMs
OS:                Linux

It would be really nice to having round mode support in valgrind. Our programs depend on using interval arithmetic (see http://www.boost.org/libs/numeric/interval/doc/interval.htm for a similar interval arithmetic library--written by the same people as what we use), which in turn requires that the floating point rounding mode be set to always round up. Since Valgrind doesn't support this (or at least the querying the round mode fails to report that it was set to round up), we can't use valgrind on our programs. Thanks.
Comment 1 Barak Atzmon-Simon 2010-02-11 01:47:10 UTC
*** This bug has been confirmed by popular vote. ***
Comment 2 Tom Hughes 2011-08-19 15:32:26 UTC
Modern VEX based valgrind versions have partial support for changing the rounding mode. As the comment in the code says:

   FPROUND[1:0] is the FPU's notional rounding mode, encoded as per
   the IRRoundingMode type (see libvex_ir.h).  This just happens to be
   the Intel encoding.  Note carefully, the rounding mode is only
   observed on float-to-int conversions, and on float-to-float
   rounding, but not for general float-to-float operations, which are
   always rounded-to-nearest.
Comment 3 Thomas Rast 2011-11-14 20:45:10 UTC
Created attachment 65676 [details]
amd64 (only) proper FP rounding implementation. Apply in the valgrind tree itself

note: some fixes compared to version on developers list
Comment 4 Thomas Rast 2011-11-14 20:45:54 UTC
Created attachment 65677 [details]
amd64 (only) proper FP rounding implementation. Apply in the VEX subdirectory
Comment 5 Thomas Rast 2011-11-14 20:46:52 UTC
Created attachment 65678 [details]
Make FP rounding changes faster (valgrind part)

As per Julian's suggestion, this uses the same approach as in the PPC backend to make rounding state changes less frequent and thus faster.
Comment 6 Thomas Rast 2011-11-14 20:47:14 UTC
Created attachment 65679 [details]
Make FP rounding changes faster (VEX part)

... note: this and the previous patch go on top of the first two.
Comment 7 Tom Hughes 2012-06-20 12:58:22 UTC
*** Bug 302234 has been marked as a duplicate of this bug. ***
Comment 8 Thomas Rast 2013-03-04 11:29:25 UTC
Created attachment 77733 [details]
Proper FP rounding for amd64 (VEX part)

Forward-port of the earlier version
Comment 9 Thomas Rast 2013-03-04 11:30:15 UTC
Created attachment 77734 [details]
Proper FP rounding for amd64 (memcheck part)
Comment 10 Thomas Rast 2013-03-04 11:30:51 UTC
Created attachment 77735 [details]
Smarter FP rounding (VEX part)
Comment 11 Thomas Rast 2013-03-04 11:31:26 UTC
Created attachment 77736 [details]
Smarter FP rounding (valgrind part)
Comment 12 Thomas Rast 2013-03-04 11:32:26 UTC
The four patches I just posted are the result of forward-porting the earlier changes to the pre-3.9 SVN codebase.
What can I do to move this forward?
Comment 13 Julian Seward 2013-03-04 13:34:27 UTC
(In reply to comment #12)
Thanks for the rebase.

> What can I do to move this forward?

Generate some numbers indicating how much of a performance hit we
are going to take as a result of this.
Comment 14 Thomas Rast 2013-03-04 21:22:13 UTC
(In reply to comment #13)
> Generate some numbers indicating how much of a performance hit we
> are going to take as a result of this.

VG's own perf test (--reps=3):
-- bigcode1 --
bigcode1 vg-upstream:0.15s  no: 2.1s (14.1x, -----)  me: 4.3s (28.8x, -----)
bigcode1 vg-round   :0.15s  no: 2.1s (14.3x, -1.4%)  me: 4.3s (28.9x, -0.5%)
-- bigcode2 --
bigcode2 vg-upstream:0.15s  no: 5.2s (34.9x, -----)  me:11.2s (74.9x, -----)
bigcode2 vg-round   :0.15s  no: 5.2s (34.8x,  0.2%)  me:11.2s (74.6x,  0.4%)
-- bz2 --
bz2      vg-upstream:0.99s  no: 3.1s ( 3.2x, -----)  me: 8.7s ( 8.8x, -----)
bz2      vg-round   :0.99s  no: 3.2s ( 3.2x, -1.6%)  me: 8.9s ( 9.0x, -2.4%)
-- fbench --
fbench   vg-upstream:0.31s  no: 1.3s ( 4.2x, -----)  me: 4.7s (15.2x, -----)
fbench   vg-round   :0.31s  no: 1.3s ( 4.3x, -3.9%)  me: 4.8s (15.6x, -2.5%)
-- ffbench --
ffbench  vg-upstream:0.27s  no: 1.2s ( 4.3x, -----)  me: 3.8s (14.2x, -----)
ffbench  vg-round   :0.27s  no: 1.2s ( 4.4x, -0.9%)  me: 3.9s (14.3x, -0.5%)
-- heap --
heap     vg-upstream:0.18s  no: 1.0s ( 5.8x, -----)  me: 7.3s (40.6x, -----)
heap     vg-round   :0.18s  no: 1.0s ( 5.7x,  1.0%)  me: 7.3s (40.7x, -0.3%)
-- heap_pdb4 --
heap_pdb4 vg-upstream:0.19s  no: 1.1s ( 5.8x, -----)  me:12.2s (63.9x, -----)
heap_pdb4 vg-round   :0.19s  no: 1.1s ( 5.9x, -0.9%)  me:11.9s (62.4x,  2.4%)
-- many-loss-records --
many-loss-records vg-upstream:0.01s  no: 0.3s (28.0x, -----)  me: 1.7s (172.0x, -----)
many-loss-records vg-round   :0.01s  no: 0.3s (27.0x,  3.6%)  me: 1.7s (169.0x,  1.7%)
-- many-xpts --
many-xpts vg-upstream:0.06s  no: 0.3s ( 5.8x, -----)  me: 2.3s (38.7x, -----)
many-xpts vg-round   :0.06s  no: 0.4s ( 6.2x, -5.7%)  me: 2.3s (38.8x, -0.4%)
-- sarp --
sarp     vg-upstream:0.03s  no: 0.3s ( 9.0x, -----)  me: 3.3s (109.7x, -----)
sarp     vg-round   :0.03s  no: 0.3s ( 9.0x,  0.0%)  me: 3.3s (111.3x, -1.5%)
-- tinycc --
tinycc   vg-upstream:0.24s  no: 1.8s ( 7.6x, -----)  me:11.6s (48.2x, -----)
tinycc   vg-round   :0.24s  no: 1.8s ( 7.7x, -0.5%)  me:11.6s (48.3x, -0.2%)

That kind of numbers?
I'd love to compare it on the actual programs we use it for, but they all use the CGAL library which heavily relies on rounding modes. I can't vouch that it takes the same code paths even in cases where it fortunately gives the same result. In one case where the result is "fixed" by the patches (and thus the paths taken definitely different), the runtime went from 15.378s to 15.637s (+1.6%).
Comment 15 Thomas Rast 2013-05-04 09:24:53 UTC
Created attachment 79684 [details]
Proper FP rounding for amd64 (VEX part)
Comment 16 Thomas Rast 2013-05-04 09:25:27 UTC
Created attachment 79685 [details]
Stub rounding mode support for x86 SSE instructions
Comment 17 Thomas Rast 2013-05-04 09:25:52 UTC
Created attachment 79686 [details]
Smarter FP rounding (VEX part)
Comment 18 Thomas Rast 2013-05-04 09:27:02 UTC
Created attachment 79687 [details]
AVX rounding mode support
Comment 19 Thomas Rast 2013-05-04 09:27:34 UTC
Created attachment 79688 [details]
Proper FP rounding for amd64 (memcheck part)
Comment 20 Thomas Rast 2013-05-04 09:28:02 UTC
Created attachment 79689 [details]
Smarter FP rounding (valgrind part)
Comment 21 Thomas Rast 2013-05-04 09:28:26 UTC
Created attachment 79690 [details]
Add a simple rounding support test
Comment 22 Julian Seward 2013-05-08 09:18:25 UTC
Created attachment 79774 [details]
All vex-side changes merged into a single patch
Comment 23 Julian Seward 2013-05-08 09:20:36 UTC
Created attachment 79775 [details]
All valgrind-side changes merged into a single patch

I merged all vex- and valgrind- side changes, reducing the number of patches
to 2, for easier review & logistics.
Comment 24 Julian Seward 2013-05-08 13:51:10 UTC
(In reply to comment #22)
> All vex-side changes merged into a single patch

Mostly looks OK.  There are some comments below about stylistic
issues, and a couple of other things, but nothing serious.

Let me ask, for the patch set as a whole:

(1) what platforms (architectures) have you tested it on?

(2) are you expecting other archs (ppc?  mips?) to work OK
    still, or will they need adjustment?




+++ priv/ir_defs.c      (working copy)
       case Iop_Rsqrte32x4:
-         UNARY(Ity_V128, Ity_V128);
+        BINARY(ity_RMode, Ity_V128, Ity_V128);

House style is no tabs; please use spaces instead, in the entire
patch-set.  Also, 3 space indents.



+++ priv/host_x86_isel.c        (working copy)
+      case Iop_32UtoV128: {
+         HReg      dst  = newVRegV(env);
+         X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP());
+         X86RMI*   rmi  = iselIntExpr_RMI(env, e->Iex.Unop.arg);
+         addInstr(env, X86Instr_Push(rmi));
+        addInstr(env, X86Instr_SseLdzLO(4, dst, esp0));
+         add_to_esp(env, 4);
+         return dst;
+      }
+
+      case Iop_64UtoV128: {
+         HReg      rHi, rLo;
+         HReg      dst  = newVRegV(env);
+         X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP());
+         iselInt64Expr(&rHi, &rLo, env, e->Iex.Unop.arg);
+         addInstr(env, X86Instr_Push(X86RMI_Reg(rHi)));
+         addInstr(env, X86Instr_Push(X86RMI_Reg(rLo)));
+        addInstr(env, X86Instr_SseLdzLO(8, dst, esp0));
+         add_to_esp(env, 8);
+         return dst;
+      }
+

Why did these get moved in the file?  They are unrelated to FP.  I
would prefer to have an absolutely minimal diff.



+++ priv/host_amd64_isel.c      (working copy)

+   vassert(typeOfIRExpr(env->type_env,mode) == Ity_I32);
+   vassert(env);
+
+   /* Do we need to do anything? */
+   if (!mode)
+          return;

The  "if (!mode)" is redundant, because if mode was NULL, the
first assertion would segfault.  Either check that mode is never
NULL, or put vassert(mode) before the existing two assertions.



+      case Iop_32UtoV128: {
+      case Iop_64UtoV128: {
+      case Iop_V256toV128_0:
+      case Iop_V256toV128_1: {

As with the x86 isel, I would prefer if these did not appear in the
diff.



+         HReg dst = newVRegV(env);
+        set_SSE_rounding_mode(env, triop->arg1);
+         addInstr(env, mk_vMOVsd_RR(argL, dst));

More tab-vs-space inconsistencies ..



+++ priv/guest_x86_toIR.c       (working copy)

+UInt dis_SSE_E_to_G_all_rnd ( UChar sorb, Int delta, const HChar* opname, IROp op )
(and various other places): Please keep line length to 80.



+++ priv/guest_amd64_toIR.c       (working copy)

-   assign( addV, binop(Iop_Add64Fx4, mkexpr(dV), mkexpr(sV)) );
-   assign( subV, binop(Iop_Sub64Fx4, mkexpr(dV), mkexpr(sV)) );
+   assign( addV, triop(Iop_Add64Fx4, get_sse_roundingmode(), mkexpr(dV), mkexpr(sV)) );
+   assign( subV, triop(Iop_Sub64Fx4, get_sse_roundingmode(), mkexpr(dV), mkexpr(sV)) );

I would prefer in these cases where get_sse_roundingmode() is called
twice, that you call it once, assign the result to an IRTemp, and
use the temp.  The IR optimiser (ir_opt.c) will do CSE and hence
common up the duplicate IR in some situations, but CSE is expensive
and I prefer not to rely on it happening in general.


-             uses_vvvv, vbi, pfx, delta, name, op, NULL, False, False);
+          uses_vvvv, vbi, pfx, delta, name, op, NULL, False, False);

-         delta = dis_SSE_E_to_G_all( vbi, pfx, delta, "divpd", Iop_Div64Fx2 );
+             delta = dis_SSE_E_to_G_all( vbi, pfx, delta, "divpd", Iop_Div64Fx2, get_sse_roundingmode() );

Please no whitespace changes ..
Comment 25 Julian Seward 2013-05-08 14:11:55 UTC
(In reply to comment #23)
> All valgrind-side changes merged into a single patch

+++ memcheck/mc_translate.c     (working copy)

+      case Iop_Add64Fx2:
+         return binary64Fx2(mce, vatom2, vatom3);
(and all the IROps where you have added a rounding mode):

This isn't really right, because it assumes that the first
argument -- the rounding mode, vatom1 -- is defined.  Or, more
specifically, ignores its definedness in the computation of the
definedness of the result.

If you look at an existing case that has a rounding mode field, eg
Iop_AddF64, we have:

      case Iop_AddF64:
      ...
         /* I32(rm) x F64/D64 x F64/D64 -> F64/D64 */
         return mkLazy3(mce, Ity_I64, vatom1, vatom2, vatom3);

so it takes into account the definedness of the rounding mode too.
Now, I do agree that it is very unlikely that the RM is undefined.
In such cases, the post-instrumentation IR optimisation pass will
remove all computation relating to vatom1 if it is known to be
constant (defined or undefined) at JIT time.

Unfortunately we can't simply use mkLazy3 in this case, because it
ignores the fact that the 2 SIMD lanes (in this example) are 
independent.  We'll need to do something better.  For the 64x2 cases
I guess we could do something along the lines of

   UifU(binary64Fx2(vatom2, vatom3), PCast-to-V128(vatom1))

For the 64F0x2, hmm.  Need to force the lower lane to be undefined
if the RM is undefined, but leave the upper lane unchanged.  Basically
do 

   UifU(binary64F0x2(vatom2, vatom3), pessimising-term)
   where
     pessimising-term = 64HLtoV128( <64 bits indicating "defined">,
                                    PCast-to-U64(vatom1) )

and then fix up ir_opt.c to fold out all the resulting junk (that
part at least is easy).

For some background on PCast and UifU, 
see http://valgrind.org/docs/memcheck2005.pdf



+++ none/tests/amd64/Makefile.am        (working copy)
+round_CFLAGS           = -frounding-math


What does -frounding-math do?  Is it necessary?  Will we need to
add a configure test to work with the oldest supported gcc (4.0.x) ?



+++ coregrind/m_dispatch/dispatch-amd64-linux.S (working copy)

+       /* no invariants are checked right now, but this is how the
+       error exit should look like */
+#if 0

Since there are no invariants to check now, just remove the #if 0'd
code and the comment.
Comment 26 Thomas Rast 2013-05-15 15:29:16 UTC
(In reply to comment #25)
> (In reply to comment #23)
> +++ memcheck/mc_translate.c     (working copy)
> 
> +      case Iop_Add64Fx2:
> +         return binary64Fx2(mce, vatom2, vatom3);
> (and all the IROps where you have added a rounding mode):
> 
> This isn't really right, because it assumes that the first
> argument -- the rounding mode, vatom1 -- is defined.  Or, more
> specifically, ignores its definedness in the computation of the
> definedness of the result. [...]

I stole your suggestions. To make sure that it actually at least emits valid IR, I copied the avx and avx2 tests into the memcheck tests dir too (so they will be run under memcheck).

> and then fix up ir_opt.c to fold out all the resulting junk (that
> part at least is easy).

As per IRC discussion -- it seems to work well with a simple change to handle Iop_Or* and Iop_And* folding for one constant (all bits 0 or 1) argument at all argument widths.

> +++ none/tests/amd64/Makefile.am        (working copy)
> +round_CFLAGS           = -frounding-math
> 
> What does -frounding-math do?  Is it necessary?  Will we need to
> add a configure test to work with the oldest supported gcc (4.0.x) ?

It's needed to prevent the compiler from rearranging or folding computations under the assumption that rounding mode is at the default.

Is there a simple way of figuring out what gcc 4.0 supported? The only thing I can find is a complaint about some issues with -frounding-math on 4.0.1, which leads me to believe that it was already supported at that point.
Comment 27 Thomas Rast 2013-05-15 16:36:57 UTC
The old patch assumed that anything float-ish that is not obviously unaffected by rounding (e.g. max) would need the rounding args. But it turns out that according to intel, the Recip and RSqrt families are unaffected. So this version does not carry around any rounding mode args for those.

(In reply to comment #24)
> Let me ask, for the patch set as a whole:
> 
> (1) what platforms (architectures) have you tested it on?

amd64 and x86, using a combined build on a sandy bridge CPU, so I have also tested AVX as far as the test coverage for that goes.

> (2) are you expecting other archs (ppc?  mips?) to work OK
>     still, or will they need adjustment?

Anything that uses the SSE-style opFxN/opF0xN instructions will need adapting, as those grew an extra argument. It will be immediately visible if there is any test coverage for those instructions, as the IR sanity checker will fail on the generated IR (wrong arity op), and later the instruction selector will fail to work with the IR (assuming it does a toplevel switch on the arity).

> +++ priv/host_x86_isel.c        (working copy)
> +      case Iop_32UtoV128: {
> +         HReg      dst  = newVRegV(env);
> +         X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP());
> +         X86RMI*   rmi  = iselIntExpr_RMI(env, e->Iex.Unop.arg);
> +         addInstr(env, X86Instr_Push(rmi));
> +        addInstr(env, X86Instr_SseLdzLO(4, dst, esp0));
> +         add_to_esp(env, 4);
> +         return dst;
> +      }
> +
> +      case Iop_64UtoV128: {
> +         HReg      rHi, rLo;
> +         HReg      dst  = newVRegV(env);
> +         X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP());
> +         iselInt64Expr(&rHi, &rLo, env, e->Iex.Unop.arg);
> +         addInstr(env, X86Instr_Push(X86RMI_Reg(rHi)));
> +         addInstr(env, X86Instr_Push(X86RMI_Reg(rLo)));
> +        addInstr(env, X86Instr_SseLdzLO(8, dst, esp0));
> +         add_to_esp(env, 8);
> +         return dst;
> +      }
> +
> 
> Why did these get moved in the file?  They are unrelated to FP.  I
> would prefer to have an absolutely minimal diff.

They didn't; some other blocks moved over the above, to fall under a different arm of the switch(arity), and the minimal -- if a bit unhelpful -- diff is the above.

It's actually "fixed" in the new patch because the fix on the part of RSqrt/Recip switched this change from a lot of movement to a lot of duplication.

> +   vassert(typeOfIRExpr(env->type_env,mode) == Ity_I32);
> +   vassert(env);
> +
> +   /* Do we need to do anything? */
> +   if (!mode)
> +          return;
> 
> The  "if (!mode)" is redundant, because if mode was NULL, the
> first assertion would segfault.  Either check that mode is never
> NULL, or put vassert(mode) before the existing two assertions.

Err, oops. vassert(mode) is actually the correct choice: the IR sanity checker does not let pass missing arguments, so it was never optional.

> +++ priv/guest_x86_toIR.c       (working copy)
> 
> +UInt dis_SSE_E_to_G_all_rnd ( UChar sorb, Int delta, const HChar* opname,
> IROp op )
> (and various other places): Please keep line length to 80.

Hopefully fixed. A small number of places remain that are just above 80, but where I figured it wasn't worth the break in "pace" of the code.
Comment 28 Thomas Rast 2013-05-15 16:38:53 UTC
Created attachment 79902 [details]
AMD64 rounding mode support
Comment 29 Thomas Rast 2013-05-15 16:39:29 UTC
Created attachment 79903 [details]
VEX side of AMD64 rounding mode support
Comment 30 Thomas Rast 2013-05-15 18:00:17 UTC
New perf numbers:

-- bigcode1 --
bigcode1 vg-upstream:0.10s  no: 1.4s (14.4x, -----)  me: 2.7s (27.2x, -----)
bigcode1 valgrind  :0.10s  no: 1.5s (14.6x, -1.4%)  me: 2.8s (27.8x, -2.2%)
-- bigcode2 --
bigcode2 vg-upstream:0.11s  no: 3.5s (31.4x, -----)  me: 7.2s (65.0x, -----)
bigcode2 valgrind  :0.11s  no: 3.4s (31.0x,  1.2%)  me: 7.1s (64.7x,  0.4%)
-- bz2 --
bz2      vg-upstream:0.65s  no: 2.3s ( 3.5x, -----)  me: 7.1s (10.9x, -----)
bz2      valgrind  :0.65s  no: 2.3s ( 3.5x,  0.4%)  me: 7.2s (11.1x, -1.7%)
-- fbench --
fbench   vg-upstream:0.22s  no: 0.9s ( 4.3x, -----)  me: 3.7s (17.0x, -----)
fbench   valgrind  :0.22s  no: 1.0s ( 4.4x, -1.1%)  me: 3.8s (17.3x, -1.6%)
-- ffbench --
ffbench  vg-upstream:0.20s  no: 0.8s ( 4.2x, -----)  me: 2.7s (13.3x, -----)
ffbench  valgrind  :0.20s  no: 0.8s ( 4.2x,  0.0%)  me: 2.7s (13.3x,  0.0%)
-- heap --
heap     vg-upstream:0.12s  no: 0.7s ( 5.8x, -----)  me: 4.8s (39.8x, -----)
heap     valgrind  :0.12s  no: 0.7s ( 6.1x, -4.3%)  me: 4.8s (39.6x,  0.6%)
-- heap_pdb4 --
heap_pdb4 vg-upstream:0.13s  no: 0.8s ( 5.8x, -----)  me: 7.0s (54.2x, -----)
heap_pdb4 valgrind  :0.13s  no: 0.8s ( 6.1x, -3.9%)  me: 7.1s (54.6x, -0.7%)
-- many-loss-records --
many-loss-records vg-upstream:0.01s  no: 0.2s (20.0x, -----)  me: 1.1s (110.0x, -----)
many-loss-records valgrind  :0.01s  no: 0.2s (19.0x,  5.0%)  me: 1.1s (110.0x,  0.0%)
-- many-xpts --
many-xpts vg-upstream:0.05s  no: 0.3s ( 5.8x, -----)  me: 1.7s (34.0x, -----)
many-xpts valgrind  :0.05s  no: 0.3s ( 5.6x,  3.4%)  me: 1.7s (33.6x,  1.2%)
-- sarp --
sarp     vg-upstream:0.02s  no: 0.2s (10.5x, -----)  me: 1.9s (93.5x, -----)
sarp     valgrind  :0.02s  no: 0.2s (10.5x,  0.0%)  me: 1.9s (93.5x,  0.0%)
-- tinycc --
tinycc   vg-upstream:0.16s  no: 1.5s ( 9.1x, -----)  me: 8.4s (52.7x, -----)
tinycc   valgrind  :0.16s  no: 1.5s ( 9.1x,  0.0%)  me: 8.4s (52.6x,  0.4%)
Comment 31 Thomas Rast 2013-05-15 18:50:08 UTC
(In reply to comment #27)
> (In reply to comment #24)
> > (2) are you expecting other archs (ppc?  mips?) to work OK
> >     still, or will they need adjustment?
> 
> Anything that uses the SSE-style opFxN/opF0xN instructions will need
> adapting, as those grew an extra argument. It will be immediately visible if
> there is any test coverage for those instructions, as the IR sanity checker
> will fail on the generated IR (wrong arity op), and later the instruction
> selector will fail to work with the IR (assuming it does a toplevel switch
> on the arity).

Some grepping actually shows that both PPC and ARM have grep hits for Add32F* and Mul32F* instructions. Since those change arity, these archs will break with these patches and need to be fixed. (Or we could think about splitting the ops into several versions.)

I find this especially odd since my impression was that the PPC implementation correctly rounds. Am I mistaken? Does PPC's Add32Fx4 not respect rounding mode? Or does the PPC valgrind correctly round *except for* SIMD FP?
Comment 32 Thomas Rast 2013-05-17 14:48:49 UTC
Created attachment 79934 [details]
AMD64 rounding mode support

Revised tests: this version has a more elaborate test written using the GCC __attribute__((vector_size(N))) to generate SIMD instructions. In this way exercises (I believe -- this is not verified automatically) all rounded Iop_*Fx* flavors that the amd64 backend uses. The same test should also readily adapt to other platforms.
Comment 33 Thomas Rast 2013-05-17 14:50:34 UTC
Created attachment 79935 [details]
VEX side of AMD64 rounding mode support

In the last version, the *32Fx8 assembly generator did not actually generate a rounding-mode setting instruction. This has been fixed.
Comment 34 Niklas Hambüchen 2018-11-13 20:20:41 UTC
Is there any chance this patch could be updated to the latest valgrind, and perhaps even upstreamed?

I'm facing issues with the cgal library, which needs rounding mode support.
Comment 35 Niklas Hambüchen 2018-11-14 03:05:28 UTC
I've given this forward-port a shot:

https://github.com/nh2/valgrind/compare/amd64-rounding-vex%5E%5E...amd64-rounding-vex?expand=1

Haven't tested it yet (but will try as part of https://github.com/CGAL/cgal/issues/3180#issuecomment-438518909).