Summary: | 80-bit floats are not supported on x86 and x86-64 | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Nicholas Nethercote <njn> |
Component: | general | Assignee: | Julian Seward <jseward> |
Status: | ASSIGNED --- | ||
Severity: | normal | CC: | abominable-snowman, baldrick, bangerth, bdavis9659, bugdal, chris, ianb, imbaczek, ivosh, m.doppler, mark, mfranc, pdbarnes, philippe.waroquiers, phma, risto.vanhanen, rudolf.hornig, sam, tom.vercauteren, tom, vincent-kde |
Priority: | HI | ||
Version: | 3.5 SVN | ||
Target Milestone: | --- | ||
Platform: | unspecified | ||
OS: | All | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: | demonstrate valgrind accuracy problem with libm acos() |
Description
Nicholas Nethercote
2009-06-26 02:37:16 UTC
*** Bug 121029 has been marked as a duplicate of this bug. *** *** Bug 147241 has been marked as a duplicate of this bug. *** *** Bug 117742 has been marked as a duplicate of this bug. *** *** Bug 201670 has been marked as a duplicate of this bug. *** The following test case shows a minimal portable code example for which limiting the precision of long doubles to 64 bits on a system that has 80-bit long doubles implies changing some standard semantics: #include <cassert> #include <limits> int main() { // std::numeric_limits<long double>::min() is typically close to // 3.3621e-4932 assert( std::numeric_limits<long double>::min() > 0.0L ); } This code only asserts within valgrind. More discussion can be found on the mailing list: http://sourceforge.net/mailarchive/forum.php?thread_name=28392e8b0908120240lbf4c314qb4086e6c17731f7e%40mail.gmail.com&forum_name=valgrind-users *** Bug 188984 has been marked as a duplicate of this bug. *** (In reply to comment #6) > *** Bug 188984 has been marked as a duplicate of this bug. *** That bug has a small test case that is worth looking at. Created attachment 43276 [details]
demonstrate valgrind accuracy problem with libm acos()
Even for programs that don't explicitly use variables of type 'long double' there are visible consequences of this omission.
on x86, glibc's acos(x) basically computes fpatan(fsqrt(1-x*x),x) [where fpatan and fsqrt are the x87 instructions]. acos(x) = atan2(sqrt(1-x*x),x) is a trigonometric identity, and when the temporaries are stored with 'long double' precision it also gives accurate results for all 'double' arguments.
When the subexpression 1-x*x is computed with only "double" precision, acos() becomes very inaccurate for values close to 1. In the following example, 18 LSBs differ when running on valgrind.
$ g++ -O2 ac.cc
$ ./a.out .999999; valgrind --log-file=/dev/null ./a.out .999999
acos( 0.999999) -> 1.4142136802445865e-03 [0X1.72BA46065AF1CP-10]
acos( 0.999999) -> 1.4142136802524064e-03 [0X1.72BA460663BFBP-10]
differing LSBs ^^^^^
As per comment #0, adding support for 80-bit floats is low priority, because (1) AIUI the majority of floating point code is portable and restricts itself to 64-bit values, and (2) doing 80-bit support will soak up a considerable amount of engineering effort. So it's not an easy case to make, and we are already extremely resource-constrained w.r.t. development effort. If anyone wants to hack up a patch to do this I would be at least willing to review it and provide feedback. Another possibility is to add support for a mode where 80 bit float operations are executed natively, i.e. valgrind does not try to track uninitialized bits etc in floats. Hopefully this would be simpler to implement. In my case this would be helpful because the problem I have is not that valgrind isn't catching use of uninitialized float values. The problem is that due to valgrind rounding to 64 bits, programs run under valgrind behave differently to programs run natively: if they run long enough different code paths are taken. It can occur that when run natively there is a memory error (such as use after free) that does not show up under when run under valgrind because of this. Just used a day to find this out the hard way. I am willing to use another if it gets fixed. So, can a know-nothing-about-how-valgrind-works implement support in a day? :) Another instance/example of this bug hitting/confusing people can be found in the fedora bug tracker: https://bugzilla.redhat.com/show_bug.cgi?id=837650 *** Bug 130358 has been marked as a duplicate of this bug. *** (In reply to comment #13) > *** Bug 130358 has been marked as a duplicate of this bug. *** This bug has a small Ada test case. A fortran example: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46703 note that the 80 bit floating format is popular in the fortran community (aka REAL*10) I would like to ask that this issue be prioritized, as it's a show-stopper for using valgrind on programs linked with musl libc that perform floating point/decimal conversions with printf/scanf/strtod-family functions. musl always performs such computations in long double, and has very good reason to do so: on targets where FLT_EVAL_METHOD==2, it's impossible to utilize rounding with float or double in a predictable, portable manner, since the computations will actually take place in long double, and the results will be double-rounded (rounded twice). By using long double and the float.h LDBL_* macros describing the long double representation, we're able to give correct results with a single unified implementation on all targets where long double has IEEE semantics (which is a documented requirement for musl). However, since valgrind changes the properties of long double not to match the compile-time properties reported by float.h, our (valid) assumptions are broken and the code does not behave as expected. I have not performed testing (presently, I believe there may still be some other obstacles to using valgrind with musl) to determine how badly the results are affected, but it's conceivable that they could be completely wrong (as opposed to just lacking precision or correct rounding). I'd like to echo Rich Felker's comment #16: we also have code that is sensitive to the available precision. We use std::numeric_limits<long double>::epsilon () to condition on that, but, as Tom Vercauteren (comment #5) puts it, valgrind silently using 64-bit doubles means the semantics are wrong. +20 votes A question for my edification: if valgrind is silently replacing 80-bit floating point variables with 64-bits, doesn't that change the memory layout? In the case of adjacent data members not 64 bit aligned, doesn't that change their offsets? and hence the validity of valgrind results? (it's checking some other code base, with 64-bit floats, not my code base, with 80-bit floats). Or does it keep the 80-bit variable, but only do 64-bit arithmetic and function calls? Obviously it can't change the memory layout. In theory it could write doubles padded with 2 zero/junk bytes, but of course that would also break a lot of software (anything inspecting the representation). As far as I know what it really does is simply perform the arithmetic in doubles, but load/store the correct ld80 representation when requested to. I would actually like to get this fixed, at least for 64-bit processes. The resulting non-usefulness of the tools for some groups of users bothers me, but we are extremely resource constrained, and this is a significant amount of work. Would anybody be willing to help out by writing a proper test program? It would need to exercise all the relevant x87 instructions individually, in such a way that it makes clear when the implementation of an instruction is correct (no accuracy loss) vs when it is incorrect. It would have to be fairly convincing, along the lines of none/tests/amd64/sse4-64.c, perhaps. If such a test case did exist, I would be a lot more motivated to grapple with the compilation pipeline (VEX) aspects of the fix. Or, for that matter, to help out anybody who wanted to try fixing it themselves. Do you have any objection if I forward this e-mail to the gfortran mailing list ? There might be a user interested enough to help out in that group. regards, Bud Davis ________________________________ From: Julian Seward <jseward@acm.org> To: bdavis9659@sbcglobal.net Sent: Friday, May 9, 2014 7:36 AM Subject: [valgrind] [Bug 197915] 80-bit floats are not supported on x86 and x86-64 https://bugs.kde.org/show_bug.cgi?id=197915 --- Comment #20 from Julian Seward <jseward@acm.org> --- I would actually like to get this fixed, at least for 64-bit processes. The resulting non-usefulness of the tools for some groups of users bothers me, but we are extremely resource constrained, and this is a significant amount of work. Would anybody be willing to help out by writing a proper test program? It would need to exercise all the relevant x87 instructions individually, in such a way that it makes clear when the implementation of an instruction is correct (no accuracy loss) vs when it is incorrect. It would have to be fairly convincing, along the lines of none/tests/amd64/sse4-64.c, perhaps. If such a test case did exist, I would be a lot more motivated to grapple with the compilation pipeline (VEX) aspects of the fix. Or, for that matter, to help out anybody who wanted to try fixing it themselves. (In reply to comment #21) > Do you have any objection if I forward this e-mail to the gfortran mailing > list ? > There might be a user interested enough to help out in that group. Valgrind bugzilla is publicly accessible, so no problem. Note that it is better to have the link to this bug forwarded, so that suggestions, helps, ... and so on are recorded here. Hi Julian, Testing 80-bit x87 FPU instructions. I would like to have a try at writing a basic version of this program. Is the reason for the 64-bit restriction because it limits the x87 instructions to their latest, and probably final, versions? Which is fine. I propose to write a stand-alone program, in C (GCC 4.9) with some inline asm, that excercises all the FPU related instructions as documented in Intel® 64 and IA-32 Architectures Software Developer’s Manual, February 2014 For each instruction the results would be compared to the known correct value as obtained from a prior run without valgrind. A range of values should be tried along with some edge cases such as subnormals, infinities and nan. Where applicable, each of the four rounding modes would be tested. The documented FPU flags and the floating-point exceptions will be checked (but not the Protected Mode exceptions or the Real-address mode exceptions). In the first cut anyway, the exceptions would be checked by simply examining the status word. The state of the FPU stack should be checked. Let me know if you think this is useful and I'll make a start. Its quite a lot of work. Regards Jeremy On 9 May 2014 12:36, Julian Seward <jseward@acm.org> wrote: > https://bugs.kde.org/show_bug.cgi?id=197915 > > --- Comment #20 from Julian Seward <jseward@acm.org> --- > I would actually like to get this fixed, at least for 64-bit > processes. The resulting non-usefulness of the tools for some groups > of users bothers me, but we are extremely resource constrained, and > this is a significant amount of work. > > Would anybody be willing to help out by writing a proper test program? > It would need to exercise all the relevant x87 instructions > individually, in such a way that it makes clear when the > implementation of an instruction is correct (no accuracy loss) vs when > it is incorrect. It would have to be fairly convincing, along the > lines of none/tests/amd64/sse4-64.c, perhaps. > > If such a test case did exist, I would be a lot more motivated to > grapple with the compilation pipeline (VEX) aspects of the fix. Or, > for that matter, to help out anybody who wanted to try fixing it > themselves. > > -- > You are receiving this mail because: > You voted for the bug. > Re: the 64-bit remarks, is this about doing the tests only on 64-bit, or about only supporting correct x87 fpu emulation on x86_64 (and not 32-bit x86)? To address our usage case (programs using musl libc) it's highly desirable for both 32-bit and 64-bit apps to work. While I agree having instruction-level tests would be nice, I questions whether it's really essential to a first effort at fixing this. There are plenty of existing x87 software implementations (libgcc, Linux kernel [before it was removed], qemu, dosbox, etc.) that could be used as a guide for implementation or even reused directly if there are not license problems. Tests could then be used to look for errors and tweak the behavior: first, high-level tests compiled from C sources testing floating point behavior, and later, instruction-level tests. (In reply to comment #23) > Is the reason for the 64-bit restriction because it limits the x87 > instructions to their latest, and probably final, versions? AFAIK there only has ever been one version of the x87 instruction set. The restriction to 64 bit accuracy is because implementing 80 bit is a lot of hassle and it isn't necessary for the majority of (portable, IEEE754 compliant) code. > I propose to write a stand-alone program, in C (GCC 4.9) with some inline > asm, that excercises all the FPU related instructions as documented in > > Intel® 64 and IA-32 Architectures Software Developer’s Manual, February > 2014 > > For each instruction the results would be compared to the known correct > value as obtained from a prior run without valgrind. > A range of values should be tried along with some edge cases such as > subnormals, infinities and nan. Where applicable, each of > the four rounding modes would be tested. The rounding modes are only applicable for integer-fp conversions and for fp-fp format conversions (maybe). For normal math (+, -, etc) they are ignored, at least on the x86 and x86_64 implementation. > The documented FPU flags and the floating-point exceptions will be > checked Don't bother with checking exceptions. V doesn't simulate FP exceptions. > Let me know if you think this is useful and I'll make a start. Its quite a > lot of work. It does sound useful. I would recommend you study some of the other test programs, especially the 64-bit SSE4 test program, to see the general style. In general, for each insn and each test case, you need to do: get the FPU in a known state (FINIT); load operands; do the instruction; dump the FPU state. (In reply to comment #24) > x86)? To address our usage case (programs using musl libc) it's highly > desirable for both 32-bit and 64-bit apps to work. I propose only to implement this for 64 bit apps. I regard 32 bit x86 as legacy and in maintenance mode only. Even now, the 32 bit coverage is far behind the 64 bit case. > While I agree having instruction-level tests would be nice, Not nice -- essential. If we compile from source, (1) there's no guarantee that gcc will produce the exactly the instruction you want to test, and no other junk, and (2) there's no way to test instructions that gcc doesn't generate. > There are plenty of existing x87 software implementations (libgcc, > Linux kernel [before it was removed], qemu, dosbox, etc.) that could > be used as a guide for implementation What I'm looking for is a way to verify that the implementation is correct and remains correct in future. How to actually implement this stuff is not a problem. On Mon, May 19, 2014 at 09:16:09AM +0000, Julian Seward wrote: > I propose only to implement this for 64 bit apps. I regard 32 bit x86 > as legacy and in maintenance mode only. Even now, the 32 bit coverage > is far behind the 64 bit case. Since the exact same code should be usable for both, I don't see any motivation for omitting use of it on 32-bit machines. Usage of the x87 fpu is actually much more common in 32-bit code. > > While I agree having instruction-level tests would be nice, > > Not nice -- essential. If we compile from source, (1) there's no > guarantee that gcc will produce the exactly the instruction you want > to test, and no other junk, and (2) there's no way to test > instructions that gcc doesn't generate. I agree completely, with points 1 and 2, but I don't think that makes it essential. Statistically all the important instructions are extremely likely to get coverage, and the ones that compilers don't generate are not relevant to the vast vast majority of real-world code. > > There are plenty of existing x87 software implementations (libgcc, > > Linux kernel [before it was removed], qemu, dosbox, etc.) that could > > be used as a guide for implementation > > What I'm looking for is a way to verify that the implementation is > correct and remains correct in future. How to actually implement this > stuff is not a problem. It would be hard to be worse than the implementation right now, which is just completely wrong. That's why I'm saying an incremental approach to testing is worth considering. (In reply to comment #27) > On Mon, May 19, 2014 at 09:16:09AM +0000, Julian Seward wrote: > > I propose only to implement this for 64 bit apps. I regard 32 bit x86 > > as legacy and in maintenance mode only. Even now, the 32 bit coverage > > is far behind the 64 bit case. > > Since the exact same code should be usable for both, I don't see any > motivation for omitting use of it on 32-bit machines. Usage of the x87 > fpu is actually much more common in 32-bit code. No, the same code will not be usable because the amd64 code has been completely refactored to support newer instruction encodings and the x86 code has not so there are large differences between them which means code cannot simply be copied from one to the other. > > > While I agree having instruction-level tests would be nice, > > > > Not nice -- essential. If we compile from source, (1) there's no > > guarantee that gcc will produce the exactly the instruction you want > > to test, and no other junk, and (2) there's no way to test > > instructions that gcc doesn't generate. > > I agree completely, with points 1 and 2, but I don't think that makes > it essential. Statistically all the important instructions are > extremely likely to get coverage, and the ones that compilers don't > generate are not relevant to the vast vast majority of real-world > code. That's fine in theory, but in practice when somebody does hit one of those rare instructions we want it to fail hard, not just produce subtly wrong results - a bug which causes us to silently execute code incorrectly is at best a major pain to debug and at worst may never even get noticed. > > > There are plenty of existing x87 software implementations (libgcc, > > > Linux kernel [before it was removed], qemu, dosbox, etc.) that could > > > be used as a guide for implementation > > > > What I'm looking for is a way to verify that the implementation is > > correct and remains correct in future. How to actually implement this > > stuff is not a problem. > > It would be hard to be worse than the implementation right now, which > is just completely wrong. That's why I'm saying an incremental > approach to testing is worth considering. But a software emulation is not what valgrind does or is looking for - you seem to be very confused about what valgrind actually does. What valgrind does is to decompile the code, instrument it, and then turn it back into native code. It doesn't emulate the decompiled code using a software FP library. On Mon, May 19, 2014 at 12:57:55PM +0000, Tom Hughes wrote: > > > I propose only to implement this for 64 bit apps. I regard 32 bit x86 > > > as legacy and in maintenance mode only. Even now, the 32 bit coverage > > > is far behind the 64 bit case. > > > > Since the exact same code should be usable for both, I don't see any > > motivation for omitting use of it on 32-bit machines. Usage of the x87 > > fpu is actually much more common in 32-bit code. > > No, the same code will not be usable because the amd64 code has been completely > refactored to support newer instruction encodings and the x86 code has not so > there are large differences between them which means code cannot simply be > copied from one to the other. Perhaps I misunderstand you, but my idea is that you would not be modifying the instruction decoding/handling code at all, but rather replacing the backend it calls for floating point. Even if it's all inline right now, the new correct 80-bit code would be large enough that it should probably be factored (at least at the source level, even if it's still macros/inlines) and bolting it onto both the 32-bit and 64-bit versions should not be significantly more work... > > I agree completely, with points 1 and 2, but I don't think that makes > > it essential. Statistically all the important instructions are > > extremely likely to get coverage, and the ones that compilers don't > > generate are not relevant to the vast vast majority of real-world > > code. > > That's fine in theory, but in practice when somebody does hit one of those rare > instructions we want it to fail hard, not just produce subtly wrong results - a > bug which causes us to silently execute code incorrectly is at best a major > pain to debug and at worst may never even get noticed. Right now it's producing not-so-subtly wrong results, just silently doing the wrong thing. So in principle it wouldn't be any worse than the status quo. Of course someone could introduce an even bigger error somewhere but that's always a possibility. > But a software emulation is not what valgrind does or is looking for - you seem > to be very confused about what valgrind actually does. Yes, somewhat. I was under the impression that it was emulating the fpu with C code using doubles. > What valgrind does is to decompile the code, instrument it, and then turn it > back into native code. It doesn't emulate the decompiled code using a software > FP library. Then how is it throwing away proper 80-bit support? Just by generating gratuitous load/store at low precision? It seems like this should be much easier to fix... > AFAIK there only has ever been one version of the x87 instruction set. There has been some additions, the most recent being FISTTP which came in with sse3. The others (FCMOVE etc) came in for early Pentiums as did the accuracy and range improvements to the transcendentals. GCC uses FISTTP where available for the float to int truncating cast, but I guess its preferable to hand code it in asm, using cpuid to check for sse3 beforehand. The older changes must be present in 64-bit CPU's. > Don't bother with checking exceptions. V doesn't simulate FP exceptions. But does it set the cumulative exception flags in the status word?? If not the test code could simply zero those flags when it dumps the state. The transcendentals, and load constants, are affected by rounding modes too. > In general, for each insn and each test case, you > need to do: get the FPU in a known state (FINIT); load operands; > do the instruction; dump the FPU state. Thanks. Yes FNSAVE dumps it all, including the stack - and kindly does an FINIT afterwards to set up for the next instruction test. When its done, I'll send you:- fpu-64.c fpu-64.stderr.exp /* empty*/ fpu-64.stdout.exp fpu-64.vgtest Can I assume C99 by the way? Regards, Jeremy On 19 May 2014 10:09, Julian Seward <jseward@acm.org> wrote: > https://bugs.kde.org/show_bug.cgi?id=197915 > > --- Comment #25 from Julian Seward <jseward@acm.org> --- > (In reply to comment #23) > > Is the reason for the 64-bit restriction because it limits the x87 > > instructions to their latest, and probably final, versions? > > AFAIK there only has ever been one version of the x87 instruction set. > The restriction to 64 bit accuracy is because implementing 80 bit is a > lot of hassle and it isn't necessary for the majority of (portable, > IEEE754 compliant) code. > > > I propose to write a stand-alone program, in C (GCC 4.9) with some inline > > asm, that excercises all the FPU related instructions as documented in > > > > Intel® 64 and IA-32 Architectures Software Developer’s Manual, > February > > 2014 > > > > For each instruction the results would be compared to the known correct > > value as obtained from a prior run without valgrind. > > A range of values should be tried along with some edge cases such as > > subnormals, infinities and nan. Where applicable, each of > > the four rounding modes would be tested. > > The rounding modes are only applicable for integer-fp conversions and > for fp-fp format conversions (maybe). For normal math (+, -, etc) > they are ignored, at least on the x86 and x86_64 implementation. > > > The documented FPU flags and the floating-point exceptions will be > > checked > > Don't bother with checking exceptions. V doesn't simulate FP exceptions. > > > Let me know if you think this is useful and I'll make a start. Its > quite a > > lot of work. > > It does sound useful. I would recommend you study some of the other > test programs, especially the 64-bit SSE4 test program, to see the > general style. In general, for each insn and each test case, you > need to do: get the FPU in a known state (FINIT); load operands; > do the instruction; dump the FPU state. > > -- > You are receiving this mail because: > You voted for the bug. > (In reply to comment #30) > GCC uses FISTTP where available for the float to int truncating cast, > but I guess its preferable to hand code it in asm, Definitely. Better not to assume gcc will generate any given instruction. > > Don't bother with checking exceptions. V doesn't simulate FP > exceptions. > But does it set the cumulative exception flags in the status word?? No. V doesn't have any awareness of FP exceptions. It's as if it lives in a world where such things don't exist. > Can I assume C99 by the way? Yes. Ping. Today in #musl we had another user who was experiencing 1.2==atof("1.2") evaluating to false. After spending a while trying to diagnose it, it turned out they were running under valgrind. Is something blocking fixing this issue still? (In reply to Rich Felker from comment #32) > Is something blocking fixing this issue still? Lack of skilled manpower :-) Feel free to work this issue! I just ran into this while debugging Bezitopo. I wrote the isTooCurly method and found that it takes about 100 times as long when compiled by gcc as when compiled by clang. So I ran it in Valgrind (callgrind) to see what's different. It produced bizarre output. I debugged the code in Valgrind and found that the line precision=nextafterl(bigpart,2*bigpart)-bigpart; in spiral.cpp produced 0 when bigpart is about 1e80. Dumping the variables internal to the cornu function showed numbers like 3.33236731613775325228e+4605 (garbage), but it appears unable to compute a number bigger than about 5e307. Versions: valgrind-3.15.0 Linux puma 5.3.0-7625-generic #27~1576774560~19.10~f432cd8-Ubuntu SMP Thu Dec 19 20:35:37 UTC x86_64 x86_64 x86_64 GNU/Linux gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008 clang version 9.0.0-2 (tags/RELEASE_900/final) Eoan Ermine. *** Bug 421262 has been marked as a duplicate of this bug. *** |