Valgrind is valgrind-3.6.1 from OpenSuSE 11.4
Created attachment 65934 [details] Sample program
Valgrind is valgrind-3.6.1 from OpenSuSE 11.4 Compiler is gcc version 4.5.1 20101208 [gcc-4_5-branch revision 167585] (SUSE Linux) Target: x86_64-suse-linux Test program being attached. welinder@sequoia:~> valgrind ./a.out 45 ==23829== Memcheck, a memory error detector ==23829== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==23829== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info ==23829== Command: ./a.out 45 ==23829== ==23829== Conditional jump or move depends on uninitialised value(s) ==23829== at 0x4E703C0: vfprintf (in /lib64/libc-2.11.3.so) ==23829== by 0x4E74A3F: buffered_vfprintf (in /lib64/libc-2.11.3.so) ==23829== by 0x4E6F85D: vfprintf (in /lib64/libc-2.11.3.so) ==23829== by 0x4E7A287: fprintf (in /lib64/libc-2.11.3.so) ==23829== by 0x4005B4: main (vvv.c:14) [...] The generated assembler code for line 14 is... .loc 1 14 0 movq stderr@GOTPCREL(%rip), %rax movq (%rsp), %rdx movq (%rax), %rdi .LVL2: xorl %eax, %eax addl %edx, %edx sarl $22, %edx call fprintf@PLT Wild guess as to cause: "addl %edx, %edx" doesn't track bit-level initialization information correctly. It is used here in place of a one-bit shift.
(In reply to comment #2) > > Wild guess as to cause: "addl %edx, %edx" doesn't track bit-level > initialization information correctly. It is used here in place of > a one-bit shift. It is correct that valgrind does not track defined-ness at the bit level through addition. Any undefined input bit will cause all output bits to be undefined. That is the default behaviour. It's a performance / precision trade-off which works well in practice. Whether the imprecision is the cause of what you observe, I do not know. You can apply the patch below and see whether it helps. It will use a more expensive (precise) propagation scheme for some (but not all) operations. Index: memcheck/mc_translate.c =================================================================== --- memcheck/mc_translate.c (revision 12348) +++ memcheck/mc_translate.c (working copy) @@ -4944,7 +4944,7 @@ } - mce.bogusLiterals = bogus; + mce.bogusLiterals = True; /* Copy verbatim any IR preamble preceding the first IMark */
Florian: what's wierd about this is, it's a known problem, and there are a bunch of folding rules in ir_opt.c (fold_Expr) that do Add32(t,t) ==> t << 1 etc. Yet it does not appear to have had any effect here: 0x80484BB: movl 28(%esp,,),%eax ------ IMark(0x80484BB, 4, 0) ------ PUT(60) = 0x80484BB:I32 t18 = Add32(GET:I32(16),0x1C:I32) PUT(0) = LDle:I32(t18) 0x80484BF: addl %eax,%eax ------ IMark(0x80484BF, 2, 0) ------ PUT(60) = 0x80484BF:I32 t21 = GET:I32(0) t20 = GET:I32(0) t19 = Add32(t21,t20) PUT(32) = 0x3:I32 PUT(36) = t21 PUT(40) = t20 PUT(44) = 0x0:I32 PUT(0) = t19 becomes ------ IMark(0x80484BB, 4, 0) ------ PUT(60) = 0x80484BB:I32 t47 = LDle:I32(Add32(t5,0x1C:I32)) ------ IMark(0x80484BF, 2, 0) ------ t19 = Add32(t47,t47) That should have been transformed into "t19 = Shl32(t47, 1)". I have no idea offhand why that didn't happen. FTR, the relevant folding rule is at ir_opt.c:1628. Maybe it's a phase ordering problem, in the sense that the relevant preconditions to make the folding rule work only appear too late in the IR optimisation process. One way to check that is to ask for an extra, final run of the relevant folding pass (cprop_BB) by adding a final call to it in cheap_transformations(). I suspect that won't help though. It would be good to get to the bottom of this. It makes me nervous when the IR optimiser doesn't work like how I expect.
(In reply to comment #4) > Maybe it's a phase ordering problem, No it's not. The control flow in fold_Expr has issues :) Cut'n pasted form ir_opt.c: --> if (e->Iex.Binop.op == Iop_Add32 || e->Iex.Binop.op == Iop_Or32 || e->Iex.Binop.op == Iop_Max32U) { if (isZeroU32(e->Iex.Binop.arg2)) e2 = e->Iex.Binop.arg1; else if (isZeroU32(e->Iex.Binop.arg1)) e2 = e->Iex.Binop.arg2; } else /* Sub64(x,0) ==> x */ --> if (e->Iex.Binop.op == Iop_Sub64 && isZeroU64(e->Iex.Binop.arg2)) { e2 = e->Iex.Binop.arg1; } else /* Add32(t,t) ==> t << 1. Memcheck doesn't understand that x+x produces a defined least significant bit, and it seems simplest just to get rid of the problem by rewriting it out, since the opportunity to do so exists. */ --> if (e->Iex.Binop.op == Iop_Add32 The three conditions (-->) appear chained, but they are not. It's the indentation that is fooling us. In fact they are nested if (e->Iex.Binop.op == Iop_Add32 ....) { ... } else if (e->Iex.Binop.op == Iop_Sub64 && isZeroU64(e->Iex.Binop.arg2)) { ... } else if (e->Iex.Binop.op == Iop_Add32 ... Since the 1st condition is true we never do the optimization And32(t,t) -> t Stared at it for a while before I saw it. I guess you only see what you want to see. We should rewrite the code and use a switch stmt instead.
> Stared at it for a while before I saw it. I guess you only see what you > want to see. We should rewrite the code and use a switch stmt instead. Oh, well spotted. Yep, I never noticed this before :-( Basically a stupidity in the pattern matching "logic". If we don't want to rewrite this with a switch (which I suspect is a bit more complex than it sounds) then another alternative is to transform the current scheme if (pattern1 applies) { if (further checks on pattern1) e2 = replacement expression; } else if (pattern2 applies) { e2 = replacement expression; } with if (pattern1 applies) { if (further checks on patter1) { e2 = replacement expression; goto out; } } if (pattern2 applies) { e2 = replacement expression; goto out; } where out: is a label near the end of the procedure. Why this has failed is that we have a "pattern1 applies == True" but "further checks pattern1 == False" (e->Iex.Binop.op == Iop_Add32, but neither of the args are zero), which stops it trying later to match pattern2 (Add32(x,x)). The version with "goto out;" and no else-chaining would avoid this problem. Well, not bad. This code has been in production use for .. 7 ish years before anybody noticed the problem :-)
(In reply to comment #6) > > If we don't want to rewrite this with a switch (which I suspect is > a bit more complex than it sounds) then another alternative is to > transform the current scheme > [snip] > The version with "goto out;" and no else-chaining would > avoid this problem. > Yes, it would. And transforming the status quo into what you propose would be more mechanical than the conversion to a switch stmt. Using a switch has two advantages, though: (1) All rules related to a certain IROp are grouped together in a single spot. That is good for extensibility. You only have to look in one place to see whether you already covered a certain pattern. It's also easier to proofread and understand. (2) Avoid repeated checking of the same condition. If we transformed the snippet from comment #5 we'd be checking for Add32 twice (if the first pattern does not evaluate to true). Since the IRops we deal with here are encoded with contiguous integers the compiler can use a branch table to optimize the switch. Should help squeeze a cycle here and there. I could cook up a patch using a switch, if you agree we should take that route.
(In reply to comment #7) > I could cook up a patch using a switch, if you agree we should take > that route. Well, if you have enthusiasm, yes definitely. Obviously the switch thing is the right way to go -- my proposal in comment #6 is at best a half-hearted kludge. (random comment) One way to sanity check that at least as much folding is going on afterwards compared to before is to run --tool=none --stats=yes and look at these lines --13925-- transtab: new 3,165 (72,315 -> 470,579; ratio 65:10) [0 scs] and check that the number after the arrow (the size of generated code) is the same or smaller than when you started.
(In reply to comment #8) > (In reply to comment #7) > > One way to sanity check that at least as much folding is going on > afterwards compared to before is to run --tool=none --stats=yes and > look at these lines > > --13925-- transtab: new 3,165 (72,315 -> 470,579; ratio 65:10) [0 scs] > > and check that the number after the arrow (the size of generated code) > is the same or smaller than when you started. Thanks for the tip. Below is a patch. Here are the transtab results. The number right from ==> is the new number and as expected is always smaller. none: ----- bz2: 3,883 (110,915 -> 543,146 ==> 543,127); bigcode1: 29,942 (376,394 -> 2,472,272 ==> 2,470,386; bigcode2: 114,315 (1,392,631 -> 9,267,107 ==> 9,259,596; fbench: 2,309 (50,316 -> 294,823 ==> 294,810; ffbench: 1,987 (41,529 -> 233,856 ==> 233,831; heap: 1,908 (39,245 -> 216,857 ==> 216,850; sarp: 1,536 (32,130 -> 176,665 ==> 176,658; tinycc: 5,168 (121,998 -> 669,416 ==> 669,393; memcheck: --------- bz2: 3,879 (110,539 -> 1,471,728 ==> 1,471,618; bigcode1: 30,104 (379,444 -> 5,755,550 ==> 5,751,738; bigcode2: 114,477 (1,395,681 -> 21,212,777 ==> 21,197,715; fbench: 2,402 (52,278 -> 795,091 ==> 794,997; ffbench: 2,048 (42,412 -> 637,931 ==> 637,827; heap: 1,902 (38,812 -> 581,547 ==> 581,485; sarp: 1,754 (36,172 -> 540,144 ==> 540,082; tinycc: 4,766 (110,618 -> 1,609,684 ==> 1,609,502; I could observe the OPs problem on my machine as well and with this patch memcheck no longer reports anything. The patch introduces a new function isZeroU which allows the grouping of like IROps. That makes sense because we're dealing with algebraic transformations here where the width of operands is irrelevant. The patch is careful to not introduce any new patterns or apply patterns to new IROPs. That is something for a followup patch. I've tested the patch on x96-linux with no new regressions.
Created attachment 68282 [details] rewrite algebraic transformations patch
Thanks for looking into this. Just out of curiosity, why does And32 get checks for and(x,0) and and(x,-1) when the other sizes do not? And why is there a or(x,0), but no xor(x,0) check? Ditto shl(0,x) vs shr(0,x).
(In reply to comment #10) > Created an attachment (id=68282) [details] > rewrite algebraic transformations Excellent! Thanks for doing that. Please commit.
(In reply to comment #11) > Just out of curiosity, why does And32 get checks for and(x,0) and > and(x,-1) when the other sizes do not? > > And why is there a or(x,0), but no xor(x,0) check? > > Ditto shl(0,x) vs shr(0,x). Because the addition of the folding rules was done on an ad-hoc basis, by profiling a run (--profile-flags=10001000), looking for missing foldings in the output, and adding rules. That has the advantage (from a safety point of view) of only adding rules when (in effect) a test case for them is known to exist. I'm sure there are missing folding opportunities, but also I'd bet they are pretty rare.
(In reply to comment #12) > (In reply to comment #10) > > Created an attachment (id=68282) [details] [details] > > rewrite algebraic transformations > > Excellent! Thanks for doing that. Please commit. Done in VEX r2244 and valgrind r12359.
For some systems VEX r2244 and valgrind r12359 was not good enough and the memcheck messages were still issued. This is now fixed for good in VEX r2246.. That's what we think at least.... See also the thread "valgrind: r12359 - in trunk: . memcheck/tests" on valgrind-developers for more info.