Version: 3.1 (using KDE KDE 3.5.5) Installed from: Fedora RPMs OS: Linux It would be really nice to having round mode support in valgrind. Our programs depend on using interval arithmetic (see http://www.boost.org/libs/numeric/interval/doc/interval.htm for a similar interval arithmetic library--written by the same people as what we use), which in turn requires that the floating point rounding mode be set to always round up. Since Valgrind doesn't support this (or at least the querying the round mode fails to report that it was set to round up), we can't use valgrind on our programs. Thanks.
*** This bug has been confirmed by popular vote. ***
Modern VEX based valgrind versions have partial support for changing the rounding mode. As the comment in the code says: FPROUND[1:0] is the FPU's notional rounding mode, encoded as per the IRRoundingMode type (see libvex_ir.h). This just happens to be the Intel encoding. Note carefully, the rounding mode is only observed on float-to-int conversions, and on float-to-float rounding, but not for general float-to-float operations, which are always rounded-to-nearest.
Created attachment 65676 [details] amd64 (only) proper FP rounding implementation. Apply in the valgrind tree itself note: some fixes compared to version on developers list
Created attachment 65677 [details] amd64 (only) proper FP rounding implementation. Apply in the VEX subdirectory
Created attachment 65678 [details] Make FP rounding changes faster (valgrind part) As per Julian's suggestion, this uses the same approach as in the PPC backend to make rounding state changes less frequent and thus faster.
Created attachment 65679 [details] Make FP rounding changes faster (VEX part) ... note: this and the previous patch go on top of the first two.
*** Bug 302234 has been marked as a duplicate of this bug. ***
Created attachment 77733 [details] Proper FP rounding for amd64 (VEX part) Forward-port of the earlier version
Created attachment 77734 [details] Proper FP rounding for amd64 (memcheck part)
Created attachment 77735 [details] Smarter FP rounding (VEX part)
Created attachment 77736 [details] Smarter FP rounding (valgrind part)
The four patches I just posted are the result of forward-porting the earlier changes to the pre-3.9 SVN codebase. What can I do to move this forward?
(In reply to comment #12) Thanks for the rebase. > What can I do to move this forward? Generate some numbers indicating how much of a performance hit we are going to take as a result of this.
(In reply to comment #13) > Generate some numbers indicating how much of a performance hit we > are going to take as a result of this. VG's own perf test (--reps=3): -- bigcode1 -- bigcode1 vg-upstream:0.15s no: 2.1s (14.1x, -----) me: 4.3s (28.8x, -----) bigcode1 vg-round :0.15s no: 2.1s (14.3x, -1.4%) me: 4.3s (28.9x, -0.5%) -- bigcode2 -- bigcode2 vg-upstream:0.15s no: 5.2s (34.9x, -----) me:11.2s (74.9x, -----) bigcode2 vg-round :0.15s no: 5.2s (34.8x, 0.2%) me:11.2s (74.6x, 0.4%) -- bz2 -- bz2 vg-upstream:0.99s no: 3.1s ( 3.2x, -----) me: 8.7s ( 8.8x, -----) bz2 vg-round :0.99s no: 3.2s ( 3.2x, -1.6%) me: 8.9s ( 9.0x, -2.4%) -- fbench -- fbench vg-upstream:0.31s no: 1.3s ( 4.2x, -----) me: 4.7s (15.2x, -----) fbench vg-round :0.31s no: 1.3s ( 4.3x, -3.9%) me: 4.8s (15.6x, -2.5%) -- ffbench -- ffbench vg-upstream:0.27s no: 1.2s ( 4.3x, -----) me: 3.8s (14.2x, -----) ffbench vg-round :0.27s no: 1.2s ( 4.4x, -0.9%) me: 3.9s (14.3x, -0.5%) -- heap -- heap vg-upstream:0.18s no: 1.0s ( 5.8x, -----) me: 7.3s (40.6x, -----) heap vg-round :0.18s no: 1.0s ( 5.7x, 1.0%) me: 7.3s (40.7x, -0.3%) -- heap_pdb4 -- heap_pdb4 vg-upstream:0.19s no: 1.1s ( 5.8x, -----) me:12.2s (63.9x, -----) heap_pdb4 vg-round :0.19s no: 1.1s ( 5.9x, -0.9%) me:11.9s (62.4x, 2.4%) -- many-loss-records -- many-loss-records vg-upstream:0.01s no: 0.3s (28.0x, -----) me: 1.7s (172.0x, -----) many-loss-records vg-round :0.01s no: 0.3s (27.0x, 3.6%) me: 1.7s (169.0x, 1.7%) -- many-xpts -- many-xpts vg-upstream:0.06s no: 0.3s ( 5.8x, -----) me: 2.3s (38.7x, -----) many-xpts vg-round :0.06s no: 0.4s ( 6.2x, -5.7%) me: 2.3s (38.8x, -0.4%) -- sarp -- sarp vg-upstream:0.03s no: 0.3s ( 9.0x, -----) me: 3.3s (109.7x, -----) sarp vg-round :0.03s no: 0.3s ( 9.0x, 0.0%) me: 3.3s (111.3x, -1.5%) -- tinycc -- tinycc vg-upstream:0.24s no: 1.8s ( 7.6x, -----) me:11.6s (48.2x, -----) tinycc vg-round :0.24s no: 1.8s ( 7.7x, -0.5%) me:11.6s (48.3x, -0.2%) That kind of numbers? I'd love to compare it on the actual programs we use it for, but they all use the CGAL library which heavily relies on rounding modes. I can't vouch that it takes the same code paths even in cases where it fortunately gives the same result. In one case where the result is "fixed" by the patches (and thus the paths taken definitely different), the runtime went from 15.378s to 15.637s (+1.6%).
Created attachment 79684 [details] Proper FP rounding for amd64 (VEX part)
Created attachment 79685 [details] Stub rounding mode support for x86 SSE instructions
Created attachment 79686 [details] Smarter FP rounding (VEX part)
Created attachment 79687 [details] AVX rounding mode support
Created attachment 79688 [details] Proper FP rounding for amd64 (memcheck part)
Created attachment 79689 [details] Smarter FP rounding (valgrind part)
Created attachment 79690 [details] Add a simple rounding support test
Created attachment 79774 [details] All vex-side changes merged into a single patch
Created attachment 79775 [details] All valgrind-side changes merged into a single patch I merged all vex- and valgrind- side changes, reducing the number of patches to 2, for easier review & logistics.
(In reply to comment #22) > All vex-side changes merged into a single patch Mostly looks OK. There are some comments below about stylistic issues, and a couple of other things, but nothing serious. Let me ask, for the patch set as a whole: (1) what platforms (architectures) have you tested it on? (2) are you expecting other archs (ppc? mips?) to work OK still, or will they need adjustment? +++ priv/ir_defs.c (working copy) case Iop_Rsqrte32x4: - UNARY(Ity_V128, Ity_V128); + BINARY(ity_RMode, Ity_V128, Ity_V128); House style is no tabs; please use spaces instead, in the entire patch-set. Also, 3 space indents. +++ priv/host_x86_isel.c (working copy) + case Iop_32UtoV128: { + HReg dst = newVRegV(env); + X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP()); + X86RMI* rmi = iselIntExpr_RMI(env, e->Iex.Unop.arg); + addInstr(env, X86Instr_Push(rmi)); + addInstr(env, X86Instr_SseLdzLO(4, dst, esp0)); + add_to_esp(env, 4); + return dst; + } + + case Iop_64UtoV128: { + HReg rHi, rLo; + HReg dst = newVRegV(env); + X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP()); + iselInt64Expr(&rHi, &rLo, env, e->Iex.Unop.arg); + addInstr(env, X86Instr_Push(X86RMI_Reg(rHi))); + addInstr(env, X86Instr_Push(X86RMI_Reg(rLo))); + addInstr(env, X86Instr_SseLdzLO(8, dst, esp0)); + add_to_esp(env, 8); + return dst; + } + Why did these get moved in the file? They are unrelated to FP. I would prefer to have an absolutely minimal diff. +++ priv/host_amd64_isel.c (working copy) + vassert(typeOfIRExpr(env->type_env,mode) == Ity_I32); + vassert(env); + + /* Do we need to do anything? */ + if (!mode) + return; The "if (!mode)" is redundant, because if mode was NULL, the first assertion would segfault. Either check that mode is never NULL, or put vassert(mode) before the existing two assertions. + case Iop_32UtoV128: { + case Iop_64UtoV128: { + case Iop_V256toV128_0: + case Iop_V256toV128_1: { As with the x86 isel, I would prefer if these did not appear in the diff. + HReg dst = newVRegV(env); + set_SSE_rounding_mode(env, triop->arg1); + addInstr(env, mk_vMOVsd_RR(argL, dst)); More tab-vs-space inconsistencies .. +++ priv/guest_x86_toIR.c (working copy) +UInt dis_SSE_E_to_G_all_rnd ( UChar sorb, Int delta, const HChar* opname, IROp op ) (and various other places): Please keep line length to 80. +++ priv/guest_amd64_toIR.c (working copy) - assign( addV, binop(Iop_Add64Fx4, mkexpr(dV), mkexpr(sV)) ); - assign( subV, binop(Iop_Sub64Fx4, mkexpr(dV), mkexpr(sV)) ); + assign( addV, triop(Iop_Add64Fx4, get_sse_roundingmode(), mkexpr(dV), mkexpr(sV)) ); + assign( subV, triop(Iop_Sub64Fx4, get_sse_roundingmode(), mkexpr(dV), mkexpr(sV)) ); I would prefer in these cases where get_sse_roundingmode() is called twice, that you call it once, assign the result to an IRTemp, and use the temp. The IR optimiser (ir_opt.c) will do CSE and hence common up the duplicate IR in some situations, but CSE is expensive and I prefer not to rely on it happening in general. - uses_vvvv, vbi, pfx, delta, name, op, NULL, False, False); + uses_vvvv, vbi, pfx, delta, name, op, NULL, False, False); - delta = dis_SSE_E_to_G_all( vbi, pfx, delta, "divpd", Iop_Div64Fx2 ); + delta = dis_SSE_E_to_G_all( vbi, pfx, delta, "divpd", Iop_Div64Fx2, get_sse_roundingmode() ); Please no whitespace changes ..
(In reply to comment #23) > All valgrind-side changes merged into a single patch +++ memcheck/mc_translate.c (working copy) + case Iop_Add64Fx2: + return binary64Fx2(mce, vatom2, vatom3); (and all the IROps where you have added a rounding mode): This isn't really right, because it assumes that the first argument -- the rounding mode, vatom1 -- is defined. Or, more specifically, ignores its definedness in the computation of the definedness of the result. If you look at an existing case that has a rounding mode field, eg Iop_AddF64, we have: case Iop_AddF64: ... /* I32(rm) x F64/D64 x F64/D64 -> F64/D64 */ return mkLazy3(mce, Ity_I64, vatom1, vatom2, vatom3); so it takes into account the definedness of the rounding mode too. Now, I do agree that it is very unlikely that the RM is undefined. In such cases, the post-instrumentation IR optimisation pass will remove all computation relating to vatom1 if it is known to be constant (defined or undefined) at JIT time. Unfortunately we can't simply use mkLazy3 in this case, because it ignores the fact that the 2 SIMD lanes (in this example) are independent. We'll need to do something better. For the 64x2 cases I guess we could do something along the lines of UifU(binary64Fx2(vatom2, vatom3), PCast-to-V128(vatom1)) For the 64F0x2, hmm. Need to force the lower lane to be undefined if the RM is undefined, but leave the upper lane unchanged. Basically do UifU(binary64F0x2(vatom2, vatom3), pessimising-term) where pessimising-term = 64HLtoV128( <64 bits indicating "defined">, PCast-to-U64(vatom1) ) and then fix up ir_opt.c to fold out all the resulting junk (that part at least is easy). For some background on PCast and UifU, see http://valgrind.org/docs/memcheck2005.pdf +++ none/tests/amd64/Makefile.am (working copy) +round_CFLAGS = -frounding-math What does -frounding-math do? Is it necessary? Will we need to add a configure test to work with the oldest supported gcc (4.0.x) ? +++ coregrind/m_dispatch/dispatch-amd64-linux.S (working copy) + /* no invariants are checked right now, but this is how the + error exit should look like */ +#if 0 Since there are no invariants to check now, just remove the #if 0'd code and the comment.
(In reply to comment #25) > (In reply to comment #23) > +++ memcheck/mc_translate.c (working copy) > > + case Iop_Add64Fx2: > + return binary64Fx2(mce, vatom2, vatom3); > (and all the IROps where you have added a rounding mode): > > This isn't really right, because it assumes that the first > argument -- the rounding mode, vatom1 -- is defined. Or, more > specifically, ignores its definedness in the computation of the > definedness of the result. [...] I stole your suggestions. To make sure that it actually at least emits valid IR, I copied the avx and avx2 tests into the memcheck tests dir too (so they will be run under memcheck). > and then fix up ir_opt.c to fold out all the resulting junk (that > part at least is easy). As per IRC discussion -- it seems to work well with a simple change to handle Iop_Or* and Iop_And* folding for one constant (all bits 0 or 1) argument at all argument widths. > +++ none/tests/amd64/Makefile.am (working copy) > +round_CFLAGS = -frounding-math > > What does -frounding-math do? Is it necessary? Will we need to > add a configure test to work with the oldest supported gcc (4.0.x) ? It's needed to prevent the compiler from rearranging or folding computations under the assumption that rounding mode is at the default. Is there a simple way of figuring out what gcc 4.0 supported? The only thing I can find is a complaint about some issues with -frounding-math on 4.0.1, which leads me to believe that it was already supported at that point.
The old patch assumed that anything float-ish that is not obviously unaffected by rounding (e.g. max) would need the rounding args. But it turns out that according to intel, the Recip and RSqrt families are unaffected. So this version does not carry around any rounding mode args for those. (In reply to comment #24) > Let me ask, for the patch set as a whole: > > (1) what platforms (architectures) have you tested it on? amd64 and x86, using a combined build on a sandy bridge CPU, so I have also tested AVX as far as the test coverage for that goes. > (2) are you expecting other archs (ppc? mips?) to work OK > still, or will they need adjustment? Anything that uses the SSE-style opFxN/opF0xN instructions will need adapting, as those grew an extra argument. It will be immediately visible if there is any test coverage for those instructions, as the IR sanity checker will fail on the generated IR (wrong arity op), and later the instruction selector will fail to work with the IR (assuming it does a toplevel switch on the arity). > +++ priv/host_x86_isel.c (working copy) > + case Iop_32UtoV128: { > + HReg dst = newVRegV(env); > + X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP()); > + X86RMI* rmi = iselIntExpr_RMI(env, e->Iex.Unop.arg); > + addInstr(env, X86Instr_Push(rmi)); > + addInstr(env, X86Instr_SseLdzLO(4, dst, esp0)); > + add_to_esp(env, 4); > + return dst; > + } > + > + case Iop_64UtoV128: { > + HReg rHi, rLo; > + HReg dst = newVRegV(env); > + X86AMode* esp0 = X86AMode_IR(0, hregX86_ESP()); > + iselInt64Expr(&rHi, &rLo, env, e->Iex.Unop.arg); > + addInstr(env, X86Instr_Push(X86RMI_Reg(rHi))); > + addInstr(env, X86Instr_Push(X86RMI_Reg(rLo))); > + addInstr(env, X86Instr_SseLdzLO(8, dst, esp0)); > + add_to_esp(env, 8); > + return dst; > + } > + > > Why did these get moved in the file? They are unrelated to FP. I > would prefer to have an absolutely minimal diff. They didn't; some other blocks moved over the above, to fall under a different arm of the switch(arity), and the minimal -- if a bit unhelpful -- diff is the above. It's actually "fixed" in the new patch because the fix on the part of RSqrt/Recip switched this change from a lot of movement to a lot of duplication. > + vassert(typeOfIRExpr(env->type_env,mode) == Ity_I32); > + vassert(env); > + > + /* Do we need to do anything? */ > + if (!mode) > + return; > > The "if (!mode)" is redundant, because if mode was NULL, the > first assertion would segfault. Either check that mode is never > NULL, or put vassert(mode) before the existing two assertions. Err, oops. vassert(mode) is actually the correct choice: the IR sanity checker does not let pass missing arguments, so it was never optional. > +++ priv/guest_x86_toIR.c (working copy) > > +UInt dis_SSE_E_to_G_all_rnd ( UChar sorb, Int delta, const HChar* opname, > IROp op ) > (and various other places): Please keep line length to 80. Hopefully fixed. A small number of places remain that are just above 80, but where I figured it wasn't worth the break in "pace" of the code.
Created attachment 79902 [details] AMD64 rounding mode support
Created attachment 79903 [details] VEX side of AMD64 rounding mode support
New perf numbers: -- bigcode1 -- bigcode1 vg-upstream:0.10s no: 1.4s (14.4x, -----) me: 2.7s (27.2x, -----) bigcode1 valgrind :0.10s no: 1.5s (14.6x, -1.4%) me: 2.8s (27.8x, -2.2%) -- bigcode2 -- bigcode2 vg-upstream:0.11s no: 3.5s (31.4x, -----) me: 7.2s (65.0x, -----) bigcode2 valgrind :0.11s no: 3.4s (31.0x, 1.2%) me: 7.1s (64.7x, 0.4%) -- bz2 -- bz2 vg-upstream:0.65s no: 2.3s ( 3.5x, -----) me: 7.1s (10.9x, -----) bz2 valgrind :0.65s no: 2.3s ( 3.5x, 0.4%) me: 7.2s (11.1x, -1.7%) -- fbench -- fbench vg-upstream:0.22s no: 0.9s ( 4.3x, -----) me: 3.7s (17.0x, -----) fbench valgrind :0.22s no: 1.0s ( 4.4x, -1.1%) me: 3.8s (17.3x, -1.6%) -- ffbench -- ffbench vg-upstream:0.20s no: 0.8s ( 4.2x, -----) me: 2.7s (13.3x, -----) ffbench valgrind :0.20s no: 0.8s ( 4.2x, 0.0%) me: 2.7s (13.3x, 0.0%) -- heap -- heap vg-upstream:0.12s no: 0.7s ( 5.8x, -----) me: 4.8s (39.8x, -----) heap valgrind :0.12s no: 0.7s ( 6.1x, -4.3%) me: 4.8s (39.6x, 0.6%) -- heap_pdb4 -- heap_pdb4 vg-upstream:0.13s no: 0.8s ( 5.8x, -----) me: 7.0s (54.2x, -----) heap_pdb4 valgrind :0.13s no: 0.8s ( 6.1x, -3.9%) me: 7.1s (54.6x, -0.7%) -- many-loss-records -- many-loss-records vg-upstream:0.01s no: 0.2s (20.0x, -----) me: 1.1s (110.0x, -----) many-loss-records valgrind :0.01s no: 0.2s (19.0x, 5.0%) me: 1.1s (110.0x, 0.0%) -- many-xpts -- many-xpts vg-upstream:0.05s no: 0.3s ( 5.8x, -----) me: 1.7s (34.0x, -----) many-xpts valgrind :0.05s no: 0.3s ( 5.6x, 3.4%) me: 1.7s (33.6x, 1.2%) -- sarp -- sarp vg-upstream:0.02s no: 0.2s (10.5x, -----) me: 1.9s (93.5x, -----) sarp valgrind :0.02s no: 0.2s (10.5x, 0.0%) me: 1.9s (93.5x, 0.0%) -- tinycc -- tinycc vg-upstream:0.16s no: 1.5s ( 9.1x, -----) me: 8.4s (52.7x, -----) tinycc valgrind :0.16s no: 1.5s ( 9.1x, 0.0%) me: 8.4s (52.6x, 0.4%)
(In reply to comment #27) > (In reply to comment #24) > > (2) are you expecting other archs (ppc? mips?) to work OK > > still, or will they need adjustment? > > Anything that uses the SSE-style opFxN/opF0xN instructions will need > adapting, as those grew an extra argument. It will be immediately visible if > there is any test coverage for those instructions, as the IR sanity checker > will fail on the generated IR (wrong arity op), and later the instruction > selector will fail to work with the IR (assuming it does a toplevel switch > on the arity). Some grepping actually shows that both PPC and ARM have grep hits for Add32F* and Mul32F* instructions. Since those change arity, these archs will break with these patches and need to be fixed. (Or we could think about splitting the ops into several versions.) I find this especially odd since my impression was that the PPC implementation correctly rounds. Am I mistaken? Does PPC's Add32Fx4 not respect rounding mode? Or does the PPC valgrind correctly round *except for* SIMD FP?
Created attachment 79934 [details] AMD64 rounding mode support Revised tests: this version has a more elaborate test written using the GCC __attribute__((vector_size(N))) to generate SIMD instructions. In this way exercises (I believe -- this is not verified automatically) all rounded Iop_*Fx* flavors that the amd64 backend uses. The same test should also readily adapt to other platforms.
Created attachment 79935 [details] VEX side of AMD64 rounding mode support In the last version, the *32Fx8 assembly generator did not actually generate a rounding-mode setting instruction. This has been fixed.
Is there any chance this patch could be updated to the latest valgrind, and perhaps even upstreamed? I'm facing issues with the cgal library, which needs rounding mode support.
I've given this forward-port a shot: https://github.com/nh2/valgrind/compare/amd64-rounding-vex%5E%5E...amd64-rounding-vex?expand=1 Haven't tested it yet (but will try as part of https://github.com/CGAL/cgal/issues/3180#issuecomment-438518909).