trippels@gcc2-power8 ~ % cat example.c #include <string.h> int main () { char str1[15]; char str2[15]; strcpy(str1, "abcdef"); strcpy(str2, "ABCDEF"); return strcmp(str1, str2); } trippels@gcc2-power8 ~ % gcc -O2 -g example.c trippels@gcc2-power8 ~ % valgrind ./a.out ==28988== Memcheck, a memory error detector ==28988== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==28988== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info ==28988== Command: ./a.out ==28988== ==28988== Conditional jump or move depends on uninitialised value(s) ==28988== at 0x100004E8: main (example.c:9) A new strcmp optimization was added to gcc-7. Since then valgrind shows the bogus errors. So some kind of suppression would be needed.
Created attachment 108876 [details] assembler code Also happens with v3.13: ==142441== Memcheck, a memory error detector ==142441== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==142441== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==142441== Command: ./a.out ==142441== ==142441== Conditional jump or move depends on uninitialised value(s) ==142441== at 0x10000508: main (example.c:9) See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479#c10 for analysis.
This is caused by insufficiently precise definedness analysis for (1) count leading zero operations (Iop_Clz32, Iop_Clz64) (2) PPC integer compares (Iop_CmpORD family) (3) integer subtract Here's an initial, hacky patch that fixes (1) and (2). Along with use of the flag --expensive-definedness-checks=yes, which fixes (3), it makes the test case run quiet. Can you give it a try on something larger?
Created attachment 108905 [details] Initial attempt at a fix
Created attachment 108906 [details] Initial attempt at a fix Second attempt, attaching the correct patch this time.
(In reply to Julian Seward from comment #2) > This is caused by insufficiently precise definedness analysis for > (1) count leading zero operations (Iop_Clz32, Iop_Clz64) > (2) PPC integer compares (Iop_CmpORD family) > (3) integer subtract > > Here's an initial, hacky patch that fixes (1) and (2). Along with > use of the flag --expensive-definedness-checks=yes, which fixes (3), > it makes the test case run quiet. > > Can you give it a try on something larger? Unfortunately it does't work yet. I still get "Invalid read of size 4" and "Conditional jump or move depends on uninitialised value(s)" on many strcmp() invokations. Plus the patch corrupts memory and gcc ICEs now. (Testcase is running g++ under valgrind when compiling a small C++ program.)
Created attachment 108913 [details] Next attempt at a fix * fixes crashing (from my buggy implementation of cntlzd) * removes all (I think) invalid read warnings (I reimplemented ldbrx) Still invalidly reports conditional branches on uninit data. Something is not right somewhere. Really, how Valgrind handles integer conditional branches on POWER isn't good (too complex, hard to analyse) and should be reimplemented.
Thanks. The gcc crash is gone, but I still get lots of invalid read warnings. However the amount of errors is much lower now: From (valgrind trunk): ERROR SUMMARY: 89817 errors from 298 contexts To (your patch and --expensive-definedness-checks=yes): ERROR SUMMARY: 6495 errors from 85 contexts BTW valgrind trunk (without any patch) doesn't like --expensive-definedness-checks=yes: ==64698== Memcheck, a memory error detector ==64698== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==64698== Using Valgrind-3.14.0.GIT and LibVEX; rerun with -h for copyright info ==64698== Command: /home/trippels/gcc_test/usr/local/bin/../libexec/gcc/powerpc64le-unknown-linux-gnu/8.0.0/cc1plus -fpreprocessed bench.ii -quiet -dumpbase bench.cpp -auxbase ben ch -O2 -version -o bench.s ==64698== vex: priv/host_ppc_isel.c:1531 (iselWordExpr_R_wrk): Assertion `0' failed. vex storage: T total 666718712 bytes allocated vex storage: P total 192 bytes allocated valgrind: the 'impossible' happened: LibVEX called failure_exit(). host stacktrace: ==64698== at 0x58084148: show_sched_status_wrk (m_libcassert.c:355) ==64698== by 0x5808430F: report_and_quit (m_libcassert.c:426) ==64698== by 0x5808458B: panic (m_libcassert.c:502) ==64698== by 0x5808458B: vgPlain_core_panic_at (m_libcassert.c:507) ==64698== by 0x580845DB: vgPlain_core_panic (m_libcassert.c:512) ==64698== by 0x580B1987: failure_exit (m_translate.c:751) ==64698== by 0x581CBA7B: vex_assert_fail (main_util.c:247) ==64698== by 0x5825AC83: iselWordExpr_R_wrk (host_ppc_isel.c:1531) ==64698== by 0x5825AC83: iselWordExpr_R (host_ppc_isel.c:1404) ==64698== by 0x5825E48F: iselWordExpr_RH_wrk (host_ppc_isel.c:2756) ==64698== by 0x5825E48F: iselWordExpr_RH (host_ppc_isel.c:2704) ==64698== by 0x58259B6B: iselWordExpr_R_wrk (host_ppc_isel.c:1477) ==64698== by 0x58259B6B: iselWordExpr_R (host_ppc_isel.c:1404) ==64698== by 0x58259E6B: iselWordExpr_R_wrk (host_ppc_isel.c:1505) ==64698== by 0x58259E6B: iselWordExpr_R (host_ppc_isel.c:1404) ==64698== by 0x5825E48F: iselWordExpr_RH_wrk (host_ppc_isel.c:2756) ==64698== by 0x5825E48F: iselWordExpr_RH (host_ppc_isel.c:2704) ==64698== by 0x58259AFB: iselWordExpr_R_wrk (host_ppc_isel.c:1481) ==64698== by 0x58259AFB: iselWordExpr_R (host_ppc_isel.c:1404) ==64698== by 0x58258363: iselCondCode_wrk (host_ppc_isel.c:2965) ==64698== by 0x5826AAF7: iselCondCode (host_ppc_isel.c:2915) ==64698== by 0x5826AAF7: iselStmt (host_ppc_isel.c:6385) ==64698== by 0x5826AAF7: iselSB_PPC (host_ppc_isel.c:6983) ==64698== by 0x581C859F: libvex_BackEnd (main_main.c:1049) ==64698== by 0x581C859F: LibVEX_Translate (main_main.c:1186) ==64698== by 0x580B5323: vgPlain_translate (m_translate.c:1805) ==64698== by 0x58110E9F: handle_chain_me (scheduler.c:1084) ==64698== by 0x58113FFF: vgPlain_scheduler (scheduler.c:1428) ==64698== by 0x5812F697: thread_wrapper (syswrap-linux.c:103) ==64698== by 0x5812F697: run_a_thread_NORETURN (syswrap-linux.c:156) sched status: running_tid=1 Thread 1: status = VgTs_Runnable (lwpid 64698) ==64698== at 0x10B22120: equal (hash-traits.h:221) ==64698== by 0x10B22120: equal_keys (hash-map-traits.h:57) ==64698== by 0x10B22120: equal (hash-map.h:44) ==64698== by 0x10B22120: hash_table<hash_map<nofree_string_hash, opt_pass*, simple_hashmap_traits<default_hash_traits<nofree_string_hash>, opt_pass*> >::hash_entry, xcallocator >::find_with_hash(char const* const&, unsigned int) (hash-table.h:846) ==64698== by 0x10B18C33: get (hash-map.h:161) ==64698== by 0x10B18C33: gcc::pass_manager::register_pass_name(opt_pass*, char const*) (passes.c:864) ==64698== by 0x10B19433: gcc::pass_manager::register_one_dump_file(opt_pass*) (passes.c:834) ==64698== by 0x10B19567: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:846) ==64698== by 0x10B195A7: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:849) ==64698== by 0x10B195A7: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:849) ==64698== by 0x10B1F7EF: gcc::pass_manager::pass_manager(gcc::context*) (passes.c:1615) ==64698== by 0x101C220B: general_init (toplev.c:1168) ==64698== by 0x101C220B: toplev::main(int, char**) (toplev.c:2148) ==64698== by 0x101C51A7: main (main.c:39)
(In reply to Markus Trippelsdorf from comment #7) > Thanks. > The gcc crash is gone, but I still get lots of invalid read warnings. > > However the amount of errors is much lower now: > > From (valgrind trunk): > ERROR SUMMARY: 89817 errors from 298 contexts > To (your patch and --expensive-definedness-checks=yes): > ERROR SUMMARY: 6495 errors from 85 contexts I re-checked my fixes. I cannot figure out why V still reports undefined value errors now. Can you point me at the bit of the gcc source that generates this code? Maybe that has some further explanation of how this should work. > I still get lots of invalid read warnings. I get those too (see below). These are misaligned (non-8-byte-aligned) ldbrx instructions (I checked) as part of this inlined strcmp. Is the strcmp-inlining expected to produce misaligned loads, or not?
Misaligned loads as referred to in comment 8: ==63134== Invalid read of size 8 ==63134== at 0xB163FE4: process_symtab (lto-plugin.c:905) ==63134== by 0xB16D897: simple_object_elf_find_sections (simple-object-elf.c:570) ==63134== by 0xB16BAF7: simple_object_find_sections (simple-object.c:194) ==63134== by 0xB1645A7: claim_file_handler (lto-plugin.c:1009) ==63134== by 0x1003A1BF: ??? (in /usr/bin/ld) ==63134== by 0x10035F9F: ??? (in /usr/bin/ld) ==63134== Address 0x4797413 is 243 bytes inside a block of size 249 alloc'd ==63134== at 0x4083F40: malloc (vg_replace_malloc.c:299) ==63134== by 0x41EC7FB: xmalloc (in /usr/lib64/libbfd-2.25.1-32.base.el7_4.1.so) ==63134== by 0xB16D7CF: simple_object_elf_find_sections (simple-object-elf.c:535) ==63134== by 0xB16BAF7: simple_object_find_sections (simple-object.c:194) ==63134== by 0xB1645A7: claim_file_handler (lto-plugin.c:1009) ==63134== by 0x1003A1BF: ??? (in /usr/bin/ld) ==63134== by 0x10035F9F: ??? (in /usr/bin/ld) ==63134== ==63134== Invalid read of size 8 ==63134== at 0xB1639F4: process_offload_section (lto-plugin.c:952) ==63134== by 0xB16D897: simple_object_elf_find_sections (simple-object-elf.c:570) ==63134== by 0xB16BAF7: simple_object_find_sections (simple-object.c:194) ==63134== by 0xB164757: claim_file_handler (lto-plugin.c:1022) ==63134== by 0x1003A1BF: ??? (in /usr/bin/ld) ==63134== by 0x10035F9F: ??? (in /usr/bin/ld) ==63134== Address 0x4797c53 is 243 bytes inside a block of size 249 alloc'd ==63134== at 0x4083F40: malloc (vg_replace_malloc.c:299) ==63134== by 0x41EC7FB: xmalloc (in /usr/lib64/libbfd-2.25.1-32.base.el7_4.1.so) ==63134== by 0xB16D7CF: simple_object_elf_find_sections (simple-object-elf.c:535) ==63134== by 0xB16BAF7: simple_object_find_sections (simple-object.c:194) ==63134== by 0xB164757: claim_file_handler (lto-plugin.c:1022) ==63134== by 0x1003A1BF: ??? (in /usr/bin/ld) ==63134== by 0x10035F9F: ??? (in /usr/bin/ld)
For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch: https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=244598
(In reply to Markus Trippelsdorf from comment #10) > For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch: I can't move this along without further input. Maybe Aaron doesn't have an account at the KDE bugzilla. In particular I'd like to understand why there are misaligned loads happening.
(In reply to Julian Seward from comment #11) > (In reply to Markus Trippelsdorf from comment #10) > > For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch: > > I can't move this along without further input. Maybe Aaron doesn't have > an account at the KDE bugzilla. > > In particular I'd like to understand why there are misaligned loads > happening. As already mentioned above, Aaron describes his algorithm here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479#c10 Quote: »If both strings have alignment > 8 we cannot inadvertently cross a page boundary doing 8B loads. For any argument that has smaller alignment, it emits a runtime check to see if the inline code would cross a 4k boundary. If so, the library function call is used instead of the inline code.«
The unaligned ldbrx are expected -- there is not much penalty for doing this on power7/power8 so it's faster for these short comparisons to just do it possibly unaligned rather than have the extra branches and code to do a partial comparison to get one/both arguments aligned. Targeting power8 we can generate overlapping unaligned ldbrx to avoid having to do multiple smaller loads or a shift/mask to compare something less than a doubleword at the end. Power7 doesn't like that so there we do the shift/mask. Doing the full doubleword compare first means that in the equality case I can check for zero byte with just one cmpb. In the case where the doublewords are not equal then two cmpb are needed, one to compare the string data and the other to check just one of them for a zero byte. The code that generates this is in expand_strn_compare() in rs6000.c (gcc 7) or rs6000-string.c (gcc 8).
Here's a program that computes the CR.{LT,GT,EQ} bits after "cmplw" using just AND, OR, XOR, NOT, SUB, SHL and SHR. For "cmplw" (the only case tested here so far), it produces the same results as the hardware. Since we have exact V-bit interpretations of all of the abovementioned operations (in mc_translate.c), we should be able to mechanically derive an exact V-bit interpretation of Iop_CmpORD32U. If we do the same for Iop_CmpORD64U and their signed equivalents, we'll be along way along the road to getting these false-positive problems with POWER condition codes fixed. Performance will probably be pretty horrible. But I am hoping that there will be a lot of duplicated expressions in the resulting V-bit DAG, so if those are commoned up it might not be quite so bad. ------------------------ #include <stdio.h> #include <assert.h> typedef unsigned int UInt; __attribute__((noinline)) UInt h_CmpORD32U ( UInt x, UInt y ) { UInt res; __asm__ __volatile__( "" "cmplw 0, %1, %2" "\n\t" "mfcr %0" : /*OUT*/ "=r"(res) : /*IN*/ "r"(x), "r"(y) : /*TRASH*/ "cc", "memory" ); return (res >> 28) & 0xF; } __attribute__((noinline)) UInt s_CmpORD32U ( UInt x, UInt y ) { #if 0 if (x < y) return 8; if (x > y) return 4; return 2; #else /* x-y unsignedly overflows ==> x < y */ UInt x_minus_y_UO = (~x & y) | ((~(x ^ y)) & (x-y)); x_minus_y_UO >>= 31; UInt y_minus_x_UO = (~y & x) | ((~(y ^ x)) & (y-x)); y_minus_x_UO >>= 31; //UInt res = (x_minus_y_UO << 3) | (y_minus_x_UO << 2); //if (res == 0) res = 2; UInt x_eq_y = (x_minus_y_UO | y_minus_x_UO) ^ 1; UInt res = (x_minus_y_UO << 3) | (y_minus_x_UO << 2) | (x_eq_y << 1); return res; #endif } const UInt vals[] = { 0x0, 0x1, 0x2, 0x3ffffffeU, 0x3fffffffU, 0x40000000U, 0x40000001U, 0x40000002U, 0x7ffffffeU, 0x7fffffffU, 0x80000000U, 0x80000001U, 0x80000002U, 0xbffffffeU, 0xbfffffffU, 0xc0000000U, 0xc0000001U, 0xc0000002U, 0xfffffffdU, 0xfffffffeU, 0xffffffffU }; int main ( void ) { UInt xi, yi; const UInt nElem = sizeof(vals) / sizeof(vals[0]); for (xi = 0; xi < nElem; xi++) { for (yi = 0; yi < nElem; yi++) { UInt x = vals[xi]; UInt y = vals[yi]; UInt resHU = h_CmpORD32U(x, y); UInt resSU = s_CmpORD32U(x, y); printf("%08x %08x -> %02x %02x\n", x, y, resHU, resSU); assert(resHU == resSU); } } return 0; }
I have a patch which I've been using for investigating this. It reduces the noise level significantly, but doesn't remove it entirely. I'll post it in a following comment. In the meantime I have a small failing testcase and a question about acsawdey's strcmp implementation. Here's the testcase: __attribute__((noinline)) char* setup() { return strndup("abcdef", 12); } int main (void) { char* x = setup(); return strcmp(x, "abcdef") == 0 ? 99 : 77; } |setup| is done out of line only so as to make the assembly for |main| easier to follow. Even with the abovementioned patch in place, Memcheck still reports a branch on undefined values in |main|, in the inlined |strcmp|. This is when compiled with gcc-7.3.0 at -O2. I single-stepped through this with GDB attached to Valgrind, so I can look at both the register values and what Memcheck thinks their definedness state is, after every instruction. The important part of the trace follows. A "." means the instruction was executed. ".nt" is "not taken" and ".t" is "taken". 0000000010000450 <main>: . 10000450: 03 10 40 3c lis r2,4099 . 10000454: 00 81 42 38 addi r2,r2,-32512 . 10000458: a6 02 08 7c mflr r0 . 1000045c: 10 00 01 f8 std r0,16(r1) . 10000460: a1 ff 21 f8 stdu r1,-96(r1) . 10000464: 25 03 00 48 bl 10000788 <setup+0x8> . 10000468: fe ff 82 3c addis r4,r2,-2 . 1000046c: 78 88 84 38 addi r4,r4,-30600 . 10000470: 20 05 69 78 clrldi r9,r3,52 . 10000474: c0 0f a9 2f cmpdi cr7,r9,4032 .nt 10000478: 5c 01 9c 40 bge cr7,100005d4 <main+0x184> . 1000047c: fe ff 42 3d addis r10,r2,-2 . 10000480: 28 1c 20 7d ldbrx r9,0,r3 . 10000484: 78 1b 68 7c mr r8,r3 . 10000488: 78 88 4a 39 addi r10,r10,-30600 . 1000048c: 28 54 40 7d ldbrx r10,0,r10 At this point, we've loaded r10 from presumably a constant pool, and it contains "abcdef\0\0", all bytes defined. Also, we've loaded r9 from the block allocated by strndup. It just so happens that r9 also now holds the value "abcdef\0\0", but because that block is only 7 bytes long (as we expect), that last \0 is marked as undefined. . 10000490: 51 48 6a 7c subf. r3,r10,r9 Now r3 contains zero, because r10 == r9, but the lowest 8 bits of r3 are marked as undefined, because the lowest 8 bits of r9 are undefined. .t 10000494: 2c 00 82 41 beq 100004c0 <main+0x70> ********* At this beq, Memcheck issues as error. It believes -- correctly, I think -- that the branch depends on uninitialised data. Specifically, we are comparing "abcdef\0\0" with "abcdef\0<undefined>", and since the 7 defined bytes are identical, the branch actually depends on the undefined lowest byte of r9. 10000498: 00 00 e0 38 li r7,0 1000049c: f8 53 28 7d cmpb r8,r9,r10 100004a0: f8 3b 27 7d cmpb r7,r9,r7 100004a4: 38 43 e8 7c orc r8,r7,r8 100004a8: 74 00 08 7d cntlzd r8,r8 100004ac: 08 00 08 39 addi r8,r8,8 100004b0: 30 46 23 79 rldcl r3,r9,r8,56 100004b4: 30 46 4a 79 rldcl r10,r10,r8,56 100004b8: 50 18 6a 7c subf r3,r10,r3 100004bc: 20 01 00 48 b 100005dc <main+0x18c> . 100004c0: f8 1b 29 7d cmpb r9,r9,r3 . 100004c4: 00 00 a9 2f cmpdi cr7,r9,0 .t 100004c8: 14 01 9e 40 bne cr7,100005dc <main+0x18c> So two questions: (1) Does the above analysis seem correct? (2) If so, am I correct to understand that the branch on uninitialised data is intended, and that this is "fixed up" later on (perhaps beginning at 100004c0) so that the correct answer is nevertheless obtained?
Created attachment 111056 [details] WIP patch, as described in comment 15
Julian, The analysis is ok. The code that follows on either branch looks for a zero byte in the process of producing the final result. What happens after the subf./beq is that a cmpb with 0 is done to check that we are not going past the end of the string: beq 0,.L19 .L13: li 7,0 cmpb 8,9,10 cmpb 7,9,7 orc 8,7,8 cntlzd 8,8 addi 8,8,8 rldcl 3,9,8,56 rldcl 10,10,8,56 subf 3,10,3 b .L10 .L19: cmpb 9,9,3 cmpdi 7,9,0 bne 7,.L10 If the beq is not taken, the cmpb/cmpb/orc sequence produces a result that has FF in the byte position of either the first difference, or the first zero byte. Then this identified byte is extracted from both strings and subtracted to produce the final result. We only need to look for zero in one string because: if one has zero and the other does not, this will be a difference. If both have a zero byte in the same position then comparing either one with all zeros will make that zero byte the one that is extracted and subtracted to produce the final result. If the beq is taken, then r3 contains zero and used to see if there is a zero byte in r9 (which must be equal to r10). If there is a zero byte, then the strings are equal and r3 is the result.
(In reply to Aaron Sawdey from comment #17) Aaron, Thanks for the details. It's not clear though to me what the answer to (2) If so, am I correct to understand that the branch on uninitialised data is intended, is. Can you give give me a yes/no answer on that?
Sorry if that was unclear. Yes, the branch on uninitialized data is intended and is caught and fixed by the cmpb with 0 that happens on both paths.
(In reply to Aaron Sawdey from comment #19) > Yes, the branch on uninitialized data is intended Thanks for the clarification. Unfortunately, the analysis framework has the deeply wired-in assumption that every conditional branch instruction is "important" for the final outcome of the program. So there's no easy way (AFAIK, at least) to fix it from the Memcheck side. This isn't the first time I've seen this kind of thing -- a conditional branch on uninitialised data, followed by suitable fixups afterwards. I think gzip/zlib used to do this, but were subsequently modified so as to not do that. Despite some study of the problem I haven't come up with a way to solve it. How practical is it for you to change the generated code so it doesn't do that? There's some related stuff in the talk at https://fosdem.org/2018/schedule/event/debugging_tools_memcheck/. See in particular the second half of slide 13.
(In reply to Julian Seward from comment #16) > Created attachment 111056 [details] > WIP patch, as described in comment 15 I should comment that this patch will still report invalid reads at the end of blocks (ends of strings in malloc'd storage), when they are misaligned but do not cross a page boundary, as described in comment 12. This would be easy enough to fix, if someone can get me a test case. Is the page size in the code hardwired at 4096? Or is it 4K for 4K page systems, 64K for 64K page systems, etc?
It is always hardwired to avoid 4k boundary crossings.
Any progress on this? I am seeing more and more reports of this issue with GCC 8.1 on ppc64le.
Created attachment 115701 [details] Small ppc64le binary with inlined string functions Here is an example with some inlined string functions on Fedora 28 ppc64le: $ cat foo.c #include <string.h> #include <stdio.h> __attribute__ ((weak)) void do_test (const char *left, const char *right) { printf ("result: %d\n", strcmp (left, right)); } int main (void) { do_test (strdup ("a"), strdup ("b")); } $ gcc --version | head -1 gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5) $ gcc -O2 -g -o foo foo.c $ valgrind -q ./foo 2>&1 | head -30 ==10495== Invalid read of size 4 ==10495== at 0x10000790: do_test (foo.c:7) ==10495== by 0x10000587: main (foo.c:13) ==10495== Address 0x4310044 is 2 bytes after a block of size 2 alloc'd ==10495== at 0x4093F6C: malloc (vg_replace_malloc.c:299) ==10495== by 0x4196F63: strdup (in /usr/lib64/libc-2.27.so) ==10495== by 0x10000563: main (foo.c:13) ==10495== ==10495== Invalid read of size 4 ==10495== at 0x10000794: do_test (foo.c:7) ==10495== by 0x10000587: main (foo.c:13) ==10495== Address 0x4310094 is 2 bytes after a block of size 2 alloc'd ==10495== at 0x4093F6C: malloc (vg_replace_malloc.c:299) ==10495== by 0x4196F63: strdup (in /usr/lib64/libc-2.27.so) ==10495== by 0x10000577: main (foo.c:13) ==10495== ==10495== Conditional jump or move depends on uninitialised value(s) ==10495== at 0x1000079C: do_test (foo.c:7) ==10495== by 0x10000587: main (foo.c:13) ==10495== ==10495== Conditional jump or move depends on uninitialised value(s) ==10495== at 0x4156044: vfprintf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so) ==10495== by 0x415DED7: printf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so) ==10495== by 0x100007D7: do_test (foo.c:7) ==10495== by 0x10000587: main (foo.c:13) ==10495== ==10495== Use of uninitialised value of size 8 ==10495== at 0x41522E8: _itoa_word (in /usr/lib64/libc-2.27.so) ==10495== by 0x41568B7: vfprintf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so) ==10495== by 0x100007D7: do_test (foo.c:7) objdump -d of do_test (): 0000000010000730 <do_test>: 10000730: 02 10 40 3c lis r2,4098 10000734: 00 7f 42 38 addi r2,r2,32512 10000738: a6 02 08 7c mflr r0 1000073c: 20 05 69 78 clrldi r9,r3,52 10000740: c0 0f a9 2f cmpdi cr7,r9,4032 10000744: 10 00 01 f8 std r0,16(r1) 10000748: a1 ff 21 f8 stdu r1,-96(r1) 1000074c: 10 00 9c 40 bge cr7,1000075c <do_test+0x2c> 10000750: 20 05 89 78 clrldi r9,r4,52 10000754: c0 0f a9 2f cmpdi cr7,r9,4032 10000758: 38 00 9c 41 blt cr7,10000790 <do_test+0x60> 1000075c: a5 fd ff 4b bl 10000500 <00000039.plt_call.strcmp@@GLIBC_2.17> 10000760: 18 00 41 e8 ld r2,24(r1) 10000764: b4 07 64 7c extsw r4,r3 10000768: fe ff 62 3c addis r3,r2,-2 1000076c: 98 8b 63 38 addi r3,r3,-29800 10000770: 51 fd ff 4b bl 100004c0 <00000039.plt_call.printf@@GLIBC_2.17> 10000774: 18 00 41 e8 ld r2,24(r1) 10000778: 60 00 21 38 addi r1,r1,96 1000077c: 10 00 01 e8 ld r0,16(r1) 10000780: a6 03 08 7c mtlr r0 10000784: 20 00 80 4e blr 10000788: 00 00 00 60 nop 1000078c: 00 00 42 60 ori r2,r2,0 10000790: 28 1c 40 7d ldbrx r10,0,r3 10000794: 28 24 00 7d ldbrx r8,0,r4 10000798: 51 50 28 7d subf. r9,r8,r10 1000079c: 54 00 82 41 beq 100007f0 <do_test+0xc0> 100007a0: 00 00 20 39 li r9,0 100007a4: f8 43 43 7d cmpb r3,r10,r8 100007a8: f8 4b 49 7d cmpb r9,r10,r9 100007ac: 38 1b 23 7d orc r3,r9,r3 100007b0: 74 00 63 7c cntlzd r3,r3 100007b4: 08 00 63 38 addi r3,r3,8 100007b8: 30 1e 4a 79 rldcl r10,r10,r3,56 100007bc: 30 1e 03 79 rldcl r3,r8,r3,56 100007c0: 50 50 23 7d subf r9,r3,r10 100007c4: 78 4b 23 7d mr r3,r9 100007c8: b4 07 64 7c extsw r4,r3 100007cc: fe ff 62 3c addis r3,r2,-2 100007d0: 98 8b 63 38 addi r3,r3,-29800 100007d4: ed fc ff 4b bl 100004c0 <00000039.plt_call.printf@@GLIBC_2.17> 100007d8: 18 00 41 e8 ld r2,24(r1) 100007dc: 60 00 21 38 addi r1,r1,96 100007e0: 10 00 01 e8 ld r0,16(r1) 100007e4: a6 03 08 7c mtlr r0 100007e8: 20 00 80 4e blr 100007ec: 00 00 42 60 ori r2,r2,0 100007f0: f8 4b 4a 7d cmpb r10,r10,r9 100007f4: 00 00 aa 2f cmpdi cr7,r10,0 100007f8: cc ff 9e 40 bne cr7,100007c4 <do_test+0x94> 100007fc: 08 00 23 39 addi r9,r3,8 10000800: 28 4c 40 7d ldbrx r10,0,r9 10000804: 08 00 24 39 addi r9,r4,8 10000808: 28 4c 00 7d ldbrx r8,0,r9 1000080c: 51 50 28 7d subf. r9,r8,r10 10000810: 90 ff 82 40 bne 100007a0 <do_test+0x70> 10000814: f8 4b 4a 7d cmpb r10,r10,r9 10000818: 00 00 aa 2f cmpdi cr7,r10,0 1000081c: a8 ff 9e 40 bne cr7,100007c4 <do_test+0x94> 10000820: 10 00 23 39 addi r9,r3,16 10000824: 28 4c 40 7d ldbrx r10,0,r9 10000828: 10 00 24 39 addi r9,r4,16 1000082c: 28 4c 00 7d ldbrx r8,0,r9 10000830: 51 50 28 7d subf. r9,r8,r10 10000834: 6c ff 82 40 bne 100007a0 <do_test+0x70> 10000838: f8 4b 4a 7d cmpb r10,r10,r9 1000083c: 00 00 aa 2f cmpdi cr7,r10,0 10000840: 84 ff 9e 40 bne cr7,100007c4 <do_test+0x94> 10000844: 18 00 23 39 addi r9,r3,24 10000848: 28 4c 40 7d ldbrx r10,0,r9 1000084c: 18 00 24 39 addi r9,r4,24 10000850: 28 4c 00 7d ldbrx r8,0,r9 10000854: 51 50 28 7d subf. r9,r8,r10 10000858: 48 ff 82 40 bne 100007a0 <do_test+0x70> 1000085c: f8 4b 4a 7d cmpb r10,r10,r9 10000860: 00 00 aa 2f cmpdi cr7,r10,0 10000864: 60 ff 9e 40 bne cr7,100007c4 <do_test+0x94> 10000868: 20 00 23 39 addi r9,r3,32 1000086c: 28 4c 40 7d ldbrx r10,0,r9 10000870: 20 00 24 39 addi r9,r4,32 10000874: 28 4c 00 7d ldbrx r8,0,r9 10000878: 51 50 28 7d subf. r9,r8,r10 1000087c: 24 ff 82 40 bne 100007a0 <do_test+0x70> 10000880: f8 4b 4a 7d cmpb r10,r10,r9 10000884: 00 00 aa 2f cmpdi cr7,r10,0 10000888: 3c ff 9e 40 bne cr7,100007c4 <do_test+0x94> 1000088c: 28 00 23 39 addi r9,r3,40 10000890: 28 4c 40 7d ldbrx r10,0,r9 10000894: 28 00 24 39 addi r9,r4,40 10000898: 28 4c 00 7d ldbrx r8,0,r9 1000089c: 51 50 28 7d subf. r9,r8,r10 100008a0: 00 ff 82 40 bne 100007a0 <do_test+0x70> 100008a4: f8 4b 4a 7d cmpb r10,r10,r9 100008a8: 00 00 aa 2f cmpdi cr7,r10,0 100008ac: 18 ff 9e 40 bne cr7,100007c4 <do_test+0x94> 100008b0: 30 00 23 39 addi r9,r3,48 100008b4: 28 4c 40 7d ldbrx r10,0,r9 100008b8: 30 00 24 39 addi r9,r4,48 100008bc: 28 4c 00 7d ldbrx r8,0,r9 100008c0: 51 50 28 7d subf. r9,r8,r10 100008c4: dc fe 82 40 bne 100007a0 <do_test+0x70> 100008c8: f8 4b 4a 7d cmpb r10,r10,r9 100008cc: 00 00 aa 2f cmpdi cr7,r10,0 100008d0: f4 fe 9e 40 bne cr7,100007c4 <do_test+0x94> 100008d4: 38 00 23 39 addi r9,r3,56 100008d8: 28 4c 40 7d ldbrx r10,0,r9 100008dc: 38 00 24 39 addi r9,r4,56 100008e0: 28 4c 00 7d ldbrx r8,0,r9 100008e4: 51 50 28 7d subf. r9,r8,r10 100008e8: b8 fe 82 40 bne 100007a0 <do_test+0x70> 100008ec: f8 4b 4a 7d cmpb r10,r10,r9 100008f0: 00 00 aa 2f cmpdi cr7,r10,0 100008f4: d0 fe 9e 40 bne cr7,100007c4 <do_test+0x94> 100008f8: 40 00 84 38 addi r4,r4,64 100008fc: 40 00 63 38 addi r3,r3,64 10000900: 5c fe ff 4b b 1000075c <do_test+0x2c> 10000904: 00 00 00 00 .long 0x0 10000908: 00 00 00 01 .long 0x1000000 1000090c: 80 00 00 00 .long 0x80
Created attachment 116201 [details] strcmp test case, ppc64le gpr version
Created attachment 116202 [details] strcmp test case, ppc64le p8 vsx version
Created attachment 116203 [details] strcmp test case, ppc64le p9 vsx version
I've just attached asm generated by current gcc trunk for this test case: #include <string.h> int main () { char str1[15]; char str2[15]; strcpy(str1, "abcdef"); strcpy(str2, "ABCDEF"); return strcmp(str1, str2); } They are compiled with -mno-vsx (gpr version), -mcpu=power8 (vsx p8), and -mcpu=power9 (vsx p9).
Created attachment 116331 [details] WIP patch, for testing WIP patch. Tested on 64-bit P8 LE. This runs clean with the two P8 test cases in comments 25 and 26: * strcmp test case, ppc64le gpr version * strcmp test case, ppc64le p8 vsx version I didn't test the P9 VSX version yet. I think the patch is more accurate and also reasonably correct, but may assert in a couple of cases due to missing code generation cases, which I will fix. It's worth trying as-is on P8-LE though. It applies on top of 1d42a625abd8b6d24186e6bda2b401ee24a118c4 (Philippe Waroquiers, Tue Nov 6 21:40:43 2018 +0100)
Created attachment 116337 [details] Patch for wider testing Here's a patch for wider testing. If it looks OK after testing, I'll divide it up into its various logically-independent pieces for landing. It looks OK to me in initial testing on: gcc135 (P9, LE64) gcc112 (P8, LE64) gcc110 (P7, BE64 and BE32) One thing I changed, but couldn't test, is the implementation of vpopcntd on 32 bit targets. So this really needs testing on a BE32 platform that can do vpopcntd (gcc110 can't).
I tested this on a target that does support 32-bit (power8 BE) and it looked good. Various strcmp/memcmp expansion tests do not see errors from valgrind. There were no additional regtest fails between unpatched and patched builds. What I don't know is if the regtest is a sufficient test of vpopcntd.
Aaron, it sounds like you ran make regtest with and without the patch and didn't see any errors. The regtest includes a fairly good test of vpopcntd in test jm_vec_isa_2_07. So as long as you didn't see any additional errors, specifically in jm_vec_isa_2_07 you should be good to go.
I assume the testing was done against GCC9/svn trunk? If so, are there any specific GCC9 patches that could/should be backported to GCC8?
Carl, yes I ran regtest on power8 BE (which supports 32 and 64 bit abi) with and without the patch and didn't see any differences. Mark, it's on my list to back port the gcc code gen change to gcc8. The patch involved is this one: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01589.html
Fixed: 27fe22378da38424102c5292b782cacdd9d7b9e4 Add support for Iop_{Sar,Shr}8 on ppc. --expensive-definedness-checks=yes needs them. cb5d7e047598bff6d0f1d707a70d9fb1a1c7f0e2 fold_Expr: transform PopCount64(And64(Add64(x,-1),Not64(x))) into CtzNat64(x). 81d9832226d6e3d1ee78ee3133189d7b520e7eea ppc front end: use new IROps added in 42719898. e221eca26be6b2396e3fcbf4117e630fc22e79f6 Add Memcheck support for IROps added in 42719898. 97d336b79e36f6c99d8b07f49ebc9b780e6df84e Add ppc host-side isel and instruction support for IROps added in previous commit. 4271989815b5fc933c1e29bc75507c2726dc3738 Add some new IROps to support improved Memcheck analysis of strlen etc. 7f1dd9d5aec1f1fd4eb0ae3a311358a914f1d73f get_otrack_shadow_offset_wrk for ppc32 and ppc64: add missing cases for XER_OV32, XER_CA32 and C_FPCC.
Seems only partially fixed? On power8, ppc64le, gcc 8.2, and current valgrind git, the following still fails: #include <stdlib.h> #include <string.h> int main() { char *foo = calloc(3, 1); return strcmp(foo, "a"); } ==32113== Memcheck, a memory error detector ==32113== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==32113== Using Valgrind-3.15.0.GIT and LibVEX; rerun with -h for copyright info ==32113== Command: ../valgrind-strcmp ==32113== ==32113== Invalid read of size 4 ==32113== at 0x1806F8: main (valgrind-strcmp.c:8) ==32113== Address 0x4b30044 is 1 bytes after a block of size 3 alloc'd ==32113== at 0x48973E8: calloc (vg_replace_malloc.c:752) ==32113== by 0x1806DF: main (valgrind-strcmp.c:6) ==32113== ==32113== Conditional jump or move depends on uninitialised value(s) ==32113== at 0x180700: main (valgrind-strcmp.c:8) ==32113== ==32113== ==32113== HEAP SUMMARY: ==32113== in use at exit: 3 bytes in 1 blocks ==32113== total heap usage: 1 allocs, 0 frees, 3 bytes allocated ==32113== ==32113== LEAK SUMMARY: ==32113== definitely lost: 3 bytes in 1 blocks ==32113== indirectly lost: 0 bytes in 0 blocks ==32113== possibly lost: 0 bytes in 0 blocks ==32113== still reachable: 0 bytes in 0 blocks ==32113== suppressed: 0 bytes in 0 blocks ==32113== Rerun with --leak-check=full to see details of leaked memory ==32113== ==32113== For counts of detected and suppressed errors, rerun with: -v ==32113== Use --track-origins=yes to see where uninitialised values come from ==32113== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
The fix is in gcc 9 trunk. I'm working on the gcc 8 back port now so it's definitely not in 8.2, hopefully will be in 8.3 when that comes out.
The fix for this is checked in to gcc-8-branch svn revision 266578. I believe this will show up in gcc 8.3.
With that gcc backport https://gcc.gnu.org/ml/gcc-patches/2018-11/msg02161.html and the valgrind fixes we get rid of all the Conditional jump or move depends on uninitialised value(s) issues, but unfortunately we still have an issue with the some Invalid read of size 4. The simplest example is the last C program: #include <stdlib.h> #include <string.h> int main() { char *foo = calloc(3, 1); return strcmp(foo, "a"); } gcc -g -O2 -o t t.c valgrind -q ./t ==31430== Invalid read of size 4 ==31430== at 0x10000510: main (t.c:8) ==31430== Address 0x42e0044 is 1 bytes after a block of size 3 alloc'd ==31430== at 0x40874C8: calloc (vg_replace_malloc.c:752) ==31430== by 0x100004FF: main (t.c:6) ==31430== The issue is the following ldbrx: Dump of assembler code for function main: 0x00000000100004e0 <+0>: lis r2,4098 0x00000000100004e4 <+4>: addi r2,r2,32512 0x00000000100004e8 <+8>: mflr r0 0x00000000100004ec <+12>: li r4,1 0x00000000100004f0 <+16>: li r3,3 0x00000000100004f4 <+20>: std r0,16(r1) 0x00000000100004f8 <+24>: stdu r1,-32(r1) 0x00000000100004fc <+28>: bl 0x10000480 <00000022.plt_call.calloc@@GLIBC_2.17> 0x0000000010000500 <+32>: ld r2,24(r1) 0x0000000010000504 <+36>: addis r4,r2,-2 0x0000000010000508 <+40>: li r10,0 0x000000001000050c <+44>: addi r4,r4,-30120 => 0x0000000010000510 <+48>: ldbrx r7,0,r3 0x0000000010000514 <+52>: ldbrx r8,0,r4 0x0000000010000518 <+56>: cmpb r10,r7,r10 0x000000001000051c <+60>: cmpb r9,r7,r8 0x0000000010000520 <+64>: orc. r10,r10,r9 0x0000000010000524 <+68>: bne 0x10000548 <main+104> 0x0000000010000528 <+72>: addi r9,r3,8 0x000000001000052c <+76>: ldbrx r7,0,r9 0x0000000010000530 <+80>: addi r9,r4,8 0x0000000010000534 <+84>: ldbrx r8,0,r9 0x0000000010000538 <+88>: cmpb r10,r7,r10 0x000000001000053c <+92>: cmpb r9,r7,r8 0x0000000010000540 <+96>: orc. r10,r10,r9 0x0000000010000544 <+100>: beq 0x10000570 <main+144> 0x0000000010000548 <+104>: cntlzd r9,r10 0x000000001000054c <+108>: addi r9,r9,8 0x0000000010000550 <+112>: rldcl r3,r7,r9,56 0x0000000010000554 <+116>: rldcl r9,r8,r9,56 0x0000000010000558 <+120>: subf r3,r9,r3 0x000000001000055c <+124>: addi r1,r1,32 0x0000000010000560 <+128>: extsw r3,r3 0x0000000010000564 <+132>: ld r0,16(r1) 0x0000000010000568 <+136>: mtlr r0 0x000000001000056c <+140>: blr 0x0000000010000570 <+144>: addi r9,r3,16 0x0000000010000574 <+148>: ldbrx r7,0,r9 0x0000000010000578 <+152>: addi r9,r4,16 0x000000001000057c <+156>: ldbrx r8,0,r9 0x0000000010000580 <+160>: cmpb r10,r7,r10 0x0000000010000584 <+164>: cmpb r9,r7,r8 0x0000000010000588 <+168>: orc. r10,r10,r9 0x000000001000058c <+172>: bne 0x10000548 <main+104> 0x0000000010000590 <+176>: addi r9,r3,24 0x0000000010000594 <+180>: ldbrx r7,0,r9 0x0000000010000598 <+184>: addi r9,r4,24 0x000000001000059c <+188>: ldbrx r8,0,r9 0x00000000100005a0 <+192>: cmpb r10,r7,r10 0x00000000100005a4 <+196>: cmpb r9,r7,r8 0x00000000100005a8 <+200>: orc. r10,r10,r9 0x00000000100005ac <+204>: bne 0x10000548 <main+104> 0x00000000100005b0 <+208>: addi r9,r3,32 0x00000000100005b4 <+212>: ldbrx r7,0,r9 0x00000000100005b8 <+216>: addi r9,r4,32 0x00000000100005bc <+220>: ldbrx r8,0,r9 0x00000000100005c0 <+224>: cmpb r10,r7,r10 0x00000000100005c4 <+228>: cmpb r9,r7,r8 0x00000000100005c8 <+232>: orc. r10,r10,r9 0x00000000100005cc <+236>: bne 0x10000548 <main+104> 0x00000000100005d0 <+240>: addi r9,r3,40 0x00000000100005d4 <+244>: ldbrx r7,0,r9 0x00000000100005d8 <+248>: addi r9,r4,40 0x00000000100005dc <+252>: ldbrx r8,0,r9 0x00000000100005e0 <+256>: cmpb r10,r7,r10 0x00000000100005e4 <+260>: cmpb r9,r7,r8 0x00000000100005e8 <+264>: orc. r10,r10,r9 0x00000000100005ec <+268>: bne 0x10000548 <main+104> 0x00000000100005f0 <+272>: addi r9,r3,48 0x00000000100005f4 <+276>: ldbrx r7,0,r9 0x00000000100005f8 <+280>: addi r9,r4,48 0x00000000100005fc <+284>: ldbrx r8,0,r9 0x0000000010000600 <+288>: cmpb r10,r7,r10 0x0000000010000604 <+292>: cmpb r9,r7,r8 0x0000000010000608 <+296>: orc. r10,r10,r9 0x000000001000060c <+300>: bne 0x10000548 <main+104> 0x0000000010000610 <+304>: addi r9,r3,56 0x0000000010000614 <+308>: ldbrx r7,0,r9 0x0000000010000618 <+312>: addi r9,r4,56 0x000000001000061c <+316>: ldbrx r8,0,r9 0x0000000010000620 <+320>: cmpb r10,r7,r10 0x0000000010000624 <+324>: cmpb r9,r7,r8 0x0000000010000628 <+328>: orc. r10,r10,r9 0x000000001000062c <+332>: bne 0x10000548 <main+104> 0x0000000010000630 <+336>: addi r4,r4,64 0x0000000010000634 <+340>: addi r3,r3,64 0x0000000010000638 <+344>: bl 0x100004c0 <00000022.plt_call.strcmp@@GLIBC_2.17> 0x000000001000063c <+348>: ld r2,24(r1) 0x0000000010000640 <+352>: b 0x1000055c <main+124> 0x0000000010000644 <+356>: .long 0x0 0x0000000010000648 <+360>: .long 0x1000000 0x000000001000064c <+364>: .long 0x80
Created attachment 116575 [details] t binary testcase The t binary from comment #39 to help with debugging.
Hoping that the JRS FIXME for ldbrx (Load Doubleword Byte-Reverse Indexed) was a hint that memcheck would be helped by a single double-word load plus Iop_Reverse I hacked up: DIP("ldbrx r%u,r%u,r%u\n", rD_addr, rA_addr, rB_addr); IRTemp dw1 = newTemp(Ity_I64); IRTemp dw2 = newTemp(Ity_I64); assign( dw1, load( Ity_I64, mkexpr(EA)) ); assign( dw2, unop(Iop_Reverse8sIn64_x1, mkexpr(dw1)) ); putIReg( rD_addr, mkexpr(dw2) ); But the ppc backend doesn't handle Iop_Reverse8sIn64_x1, so that didn't really go anywhere for now.
I think you're on the right track with ldbrx. The only difference between the code that gcc is generating and what is in glibc is ldbrx vs ld, here is what is in glibc trunk: /* For short string up to 16 bytes, load both s1 and s2 using unaligned dwords and compare. */ ld r8,0(r3) ld r10,0(r4) cmpb r12,r8,r0 cmpb r11,r8,r10 orc. r9,r12,r11 bne cr0,L(different_nocmpb) The glibc code has a couple more instructions afterwards because we don't have a count trailing zeros instruction that would make the most sense for the byte ordering that ld gives you on little endian.
Created attachment 116586 [details] PPC Iop_Reverse8sIn64_x1 implementation This compiles and runs, but hasn't been tested at all (combine with the patch in comment #41)
Created attachment 116593 [details] implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1 Real implementation of ldbrx as 64-bit load and Iop_Reverse8sIn64_x1. This works well and resolves the issue with (aligned) ldbrx instructions reading past a memory block. It has only be tested on a ppc64le power9 setup for now. Could someone who actually knows ppc64 review this implementation and say if it is good enough for other setups too?
Unfortunately even with the ldbrx patch (which seems an OK improvement in itself) we still have some issues. ldbrx is also used on unaligned addresses. In which case even --partial-loads-ok=yes doesn't help us. For example this little program which has a string table with some strings at odd addresses: #include <stdlib.h> #include <string.h> #include <stdio.h> int main (int argc, char **argv) { char *strtable = malloc (5); // First element is the empty string. strtable[0] = '\0'; // Second (and last) element is a short string. strcpy (&strtable[1], "abc"); // What was the second string again? int equal = strcmp (argv[0], &strtable[1]); if (equal == 0) puts ("yes"); else puts ("no"); free (strtable); return 0; } gcc -Wall -g -O2 -o table table.c valgrind -q ./table ==7902== Invalid read of size 8 ==7902== at 0x10000630: main (table.c:14) ==7902== Address 0x42e0041 is 1 bytes inside a block of size 5 alloc'd ==7902== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7902== by 0x100005A7: main (table.c:7) ==7902== no Dump of assembler code for function main: 0x0000000010000580 <+0>: lis r2,4098 0x0000000010000584 <+4>: addi r2,r2,32512 0x0000000010000588 <+8>: mflr r0 0x000000001000058c <+12>: std r30,-16(r1) 0x0000000010000590 <+16>: std r31,-8(r1) 0x0000000010000594 <+20>: li r3,5 0x0000000010000598 <+24>: mr r30,r4 0x000000001000059c <+28>: std r0,16(r1) 0x00000000100005a0 <+32>: stdu r1,-48(r1) 0x00000000100005a4 <+36>: bl 0x10000560 <00000022.plt_call.malloc@@GLIBC_2.17> 0x00000000100005a8 <+40>: ld r2,24(r1) 0x00000000100005ac <+44>: lis r10,99 0x00000000100005b0 <+48>: li r8,0 0x00000000100005b4 <+52>: ori r10,r10,25185 0x00000000100005b8 <+56>: mr r31,r3 0x00000000100005bc <+60>: ld r3,0(r30) 0x00000000100005c0 <+64>: addi r4,r31,1 0x00000000100005c4 <+68>: stb r8,0(r31) 0x00000000100005c8 <+72>: stw r10,1(r31) 0x00000000100005cc <+76>: clrldi r9,r3,52 0x00000000100005d0 <+80>: cmpdi cr7,r9,4032 0x00000000100005d4 <+84>: bge cr7,0x100005e4 <main+100> 0x00000000100005d8 <+88>: clrldi r9,r4,52 0x00000000100005dc <+92>: cmpdi cr7,r9,4032 0x00000000100005e0 <+96>: blt cr7,0x1000062c <main+172> 0x00000000100005e4 <+100>: bl 0x10000520 <00000022.plt_call.strcmp@@GLIBC_2.17> 0x00000000100005e8 <+104>: ld r2,24(r1) 0x00000000100005ec <+108>: cmpwi cr7,r3,0 0x00000000100005f0 <+112>: bne cr7,0x1000074c <main+460> 0x00000000100005f4 <+116>: addis r3,r2,-2 0x00000000100005f8 <+120>: addi r3,r3,-29840 0x00000000100005fc <+124>: bl 0x10000540 <00000022.plt_call.puts@@GLIBC_2.17> 0x0000000010000600 <+128>: ld r2,24(r1) 0x0000000010000604 <+132>: mr r3,r31 0x0000000010000608 <+136>: bl 0x100004e0 <00000022.plt_call.free@@GLIBC_2.17> 0x000000001000060c <+140>: ld r2,24(r1) 0x0000000010000610 <+144>: addi r1,r1,48 0x0000000010000614 <+148>: li r3,0 0x0000000010000618 <+152>: ld r0,16(r1) 0x000000001000061c <+156>: ld r30,-16(r1) 0x0000000010000620 <+160>: ld r31,-8(r1) 0x0000000010000624 <+164>: mtlr r0 0x0000000010000628 <+168>: blr 0x000000001000062c <+172>: ldbrx r8,0,r3 => 0x0000000010000630 <+176>: ldbrx r9,0,r4 0x0000000010000634 <+180>: li r10,0 0x0000000010000638 <+184>: cmpb r7,r8,r9 0x000000001000063c <+188>: cmpb r10,r8,r10 0x0000000010000640 <+192>: orc. r10,r10,r7 0x0000000010000644 <+196>: bne 0x10000734 <main+436> 0x0000000010000648 <+200>: addi r9,r3,8 0x000000001000064c <+204>: ldbrx r8,0,r9 0x0000000010000650 <+208>: addi r9,r4,8 0x0000000010000654 <+212>: ldbrx r9,0,r9 0x0000000010000658 <+216>: cmpb r10,r8,r10 0x000000001000065c <+220>: cmpb r7,r8,r9 0x0000000010000660 <+224>: orc. r10,r10,r7 0x0000000010000664 <+228>: bne 0x10000734 <main+436> This means we have to handle unaligned partial reads somehow. I am not sure we can. The code in memcheck/mc_main.c (mc_LOADVn_slow) dealing with partial reads seems to very much rely on at least word alignment.
(In reply to Mark Wielaard from comment #44) > Created attachment 116593 [details] > implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1 > > Real implementation of ldbrx as 64-bit load and Iop_Reverse8sIn64_x1. Regardless of the fact that there are still misaligned ldbrxs to deal with, I think this is good and should land. I would ask: * for "case 0x214: // ldbrx (Load Doubleword Byte-Reverse Indexed)" please add a comment to say that dis_int_ldst_rev's caller guarantees to ask for this instruction to be decoded only in 64-bit mode * for the backend expression, it's fine, except that for // r = (r & 0x00000000FFFFFFFF) << 32 | (r & 0xFFFFFFFF00000000) >> 32; there's no need to do the masking since the shifts will remove all of the bits that just got masked out anyway. So we can skip the _LI, the NOT and the two ANDs. * oh .. and I think the whole thing (insn selection for Iop_Reverse8sIn64_x1) needs to be guarded by the is-this-64-bit-mode boolean (env->mode64 or something like that) since we don't want to be emitting this bit of code in 32 bit mode.
(In reply to Mark Wielaard from comment #45) > Unfortunately even with the ldbrx patch (which seems an OK improvement in > itself) we still have some issues. > > ldbrx is also used on unaligned addresses. In which case even > --partial-loads-ok=yes doesn't help us. > [..] > This means we have to handle unaligned partial reads somehow. > I am not sure we can. The code in memcheck/mc_main.c (mc_LOADVn_slow) > dealing with partial reads seems to very much rely on at least word > alignment. That doesn't worry me so much. I'm sure we can come up with some ppc64le-only special-casing in mc_LOADVn_slow to deal with it. What worries me more is the base compiler pipeline (ignoring Memcheck). We'll have a misaligned ldbrx going into the pipeline, which will eventually get re-emitted as a plain, vanilla, etc, 64-bit integer load (ld). Will that work properly for a misaligned address? Or will it trap (sigbus?) Or will it silently zero out the three lowest address bits before performing the load? This is only going to work correctly if it correctly handles the misalignment.
Mark: I took your patch " implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1" in comment 44, attachment 116593 [details] and applied it to current mainline. commit f2387ed7bed64571197b44c1f8abae4d907bce1b Author: Mark Wielaard <mark@klomp.org> Date: Fri Nov 30 15:23:06 2018 -0500 Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1. commit 206e81e8add574bbb38c11105005decedabc2f4a Author: Mark Wielaard <mark@klomp.org> Date: Sun Dec 2 12:39:27 2018 +0100 Fix tsan_unittest.cpp compile error with older compilers. Older compilers (g++ 4.8.5) don't like '>>': error: ‘>>’ should be ‘> >’ within a nested template argument list. Add an extra space. Then compiled and installed valgrind. I then ran your string program in comment 45 on P7, P8 BE, P8 LE, P9 as follows: gcc -o mark_test ../mark_test.c valgrind ./mark_test On Power 7, Power 8 BE, Power 8 LE, Power 9 LE running the test: valgrind ./mark_test ==18373== Memcheck, a memory error detector ==18373== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==18373== Using Valgrind-3.15.0.GIT and LibVEX; rerun with -h for copyright info ==18373== Command: ./mark_test ==18373== no ==18373== ==18373== HEAP SUMMARY: ==18373== in use at exit: 0 bytes in 0 blocks ==18373== total heap usage: 1 allocs, 1 frees, 5 bytes allocated ==18373== ==18373== All heap blocks were freed -- no leaks are possible ==18373== ==18373== For counts of detected and suppressed errors, rerun with: -v ==18373== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Running natively an all of these systems gives: ./mark_test no I read thru the patch, it looks fine. I do think there might be an easier/faster way to implement the ldbrx instruction. Your patch does the following: + assign( dw1, load(Ity_I64, mkexpr(EA)) ); + assign( dw2, unop(Iop_Reverse8sIn64_x1, mkexpr(dw1)) ); + putIReg( rD_addr, mkexpr(dw2) ); Note, load(Ity_I64, mkexpr(EA)) does an endian aware load: /* This generates a normal (non load-linked) load. */ static IRExpr* load ( IRType ty, IRExpr* addr ) { if (host_endness == VexEndnessBE) return IRExpr_Load(Iend_BE, ty, addr); else return IRExpr_Load(Iend_LE, ty, addr); } I think if you were to call IRExpr_Load to do the load as follows: /* Load the value with the bytes reversed by doing a BE load on an LE machine and a LE load on a BE machine. */ if (host_endness == VexEndnessBE) assign( dw1, IRExpr_Load(Iend_LE, Ity_I64, mkexpr(EA))); else assign( dw1, IRExpr_Load(Iend_BE, Ity_I64, mkexpr(EA))); putIReg( rD_addr, mkexpr(dw1) ); You wouldn't actually need to do the byte reverse with Iop_Reverse8sIn64_x1 as the value would be loaded with the bytes reversed already. That said, having support for the Iop_Reverse8sIn64_x1 is nice to have around for future use. :-) Any way, I did work thru the logic and coding for the Iop_Reverse8sIn64x1 and it all looks correct. Carl Love
We did discuss this a bit more on irc (#valgrind irc.freenode.net) and there are a couple more issues that need to be resolved to make all this work correctly. lxvd2x and lxvb16x have a similar issue where the load is split up in the frontend into smaller loads, which makes the partial-load-ok logic ineffective for memcheck. These also need to be rewritten to do full loads plus IR to manipulate the data so that memcheck can work effectively. The IRExpr_Load expression carries an IREndness expression the Endian-ness of the load, which should be helpful in the frontend to express some of these instructions. Unaligned (normal) load instructions do seem undefined. The power backend does seem to handle unaligned vector (128bit) loads (in VEX/priv/host_ppc_isel.c iselVecExpr_wrk), but not for non-vector loads. So given that unaligned loads are an issue, we need some solution for those (if ldbrx are ultimately translated into normal loads) in the ppc backend. memcheck mc_LOADVn_slow needs to be adapted to deal with unaligned loads (on ppc only?). And I assume mc_LOADV_128_or_256_slow would need something similar for unaligned vector loads (assuming lxvd2x and lxvb16x are allowed to do unaligned loads). There was also a question from the glibc hackers whether these vector instruction should be emitted at all given some power errata that talked about memory/cache corruption: https://gcc.gnu.org/ml/gcc/2018-12/msg00010.html but it looks like that is not a real concern, so we should act as if gcc will emit these instructions for new inlined str[n]cmp code.
Created attachment 116688 [details] aligned and unaligned ldbrx, lxvd2x and lxvb16x testcases These are the assembly files generated by gcc8 with the new strcmp inlined code. The aligned code shows a short string being compared where the instruction is split up in multiple loads (by the frontend!), one of which memcheck will think it completely past addressable memory. The unaligned code shows the strcmp using one of these double word or vector instructions on an unaligned memory address, which also confuses memcheck. The following is what current valgrind produces on a power9 ppc64le setup: # for a in aligned unaligned; do for i in ldbrx lxvd2x lxvb16x; do echo $a-$i; gcc -o $a-$i $a-$i.s; valgrind -q ./$a-$i > /dev/null; done; done aligned-ldbrx ==7890== Invalid read of size 4 ==7890== at 0x1000061C: main (in /root/v-ppc64-stcmp/aligned-ldbrx) ==7890== Address 0x42e0044 is 0 bytes after a block of size 4 alloc'd ==7890== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7890== by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-ldbrx) ==7890== aligned-lxvd2x ==7895== Invalid read of size 8 ==7895== at 0x10000628: main (in /root/v-ppc64-stcmp/aligned-lxvd2x) ==7895== Address 0x42e0048 is 4 bytes after a block of size 4 alloc'd ==7895== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7895== by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-lxvd2x) ==7895== ==7895== Use of uninitialised value of size 8 ==7895== at 0x4050794: _vgnU_freeres (vg_preloaded.c:68) ==7895== by 0x4114A03: __run_exit_handlers (in /usr/lib64/power9/libc-2.28.so) ==7895== by 0x4114A97: exit (in /usr/lib64/power9/libc-2.28.so) ==7895== by 0x40F417F: (below main) (in /usr/lib64/power9/libc-2.28.so) ==7895== ==7895== Use of uninitialised value of size 8 ==7895== at 0x40505C4: ??? (in /usr/lib64/valgrind/vgpreload_core-ppc64le-linux.so) ==7895== by 0x4114A03: __run_exit_handlers (in /usr/lib64/power9/libc-2.28.so) ==7895== by 0x4114A97: exit (in /usr/lib64/power9/libc-2.28.so) ==7895== by 0x40F417F: (below main) (in /usr/lib64/power9/libc-2.28.so) ==7895== aligned-lxvb16x ==7900== Invalid read of size 1 ==7900== at 0x10000628: main (in /root/v-ppc64-stcmp/aligned-lxvb16x) ==7900== Address 0x42e0044 is 0 bytes after a block of size 4 alloc'd ==7900== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7900== by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-lxvb16x) ==7900== unaligned-ldbrx ==7905== Invalid read of size 4 ==7905== at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-ldbrx) ==7905== Address 0x42e0045 is 0 bytes after a block of size 5 alloc'd ==7905== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7905== by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-ldbrx) ==7905== unaligned-lxvd2x ==7910== Invalid read of size 8 ==7910== at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-lxvd2x) ==7910== Address 0x42e0041 is 1 bytes inside a block of size 5 alloc'd ==7910== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7910== by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-lxvd2x) ==7910== unaligned-lxvb16x ==7915== Invalid read of size 1 ==7915== at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-lxvb16x) ==7915== Address 0x42e0045 is 0 bytes after a block of size 5 alloc'd ==7915== at 0x408420C: malloc (vg_replace_malloc.c:299) ==7915== by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-lxvb16x) ==7915==
(In reply to Carl Love from comment #48) > I think if you were to call IRExpr_Load to do the load as follows: > > /* Load the value with the bytes reversed by doing a BE load on an LE > machine > and a LE load on a BE machine. */ > if (host_endness == VexEndnessBE) > > assign( dw1, IRExpr_Load(Iend_LE, Ity_I64, mkexpr(EA))); > > else > > assign( dw1, IRExpr_Load(Iend_BE, Ity_I64, mkexpr(EA))); > > putIReg( rD_addr, mkexpr(dw1) ); > > You wouldn't actually need to do the byte reverse with Iop_Reverse8sIn64_x1 > as the value would be loaded with the bytes reversed already. I tried this, but seem to hit the following in the host_ppc_isel.c backend: static HReg iselWordExpr_R_wrk ( ISelEnv* env, const IRExpr* e, IREndness IEndianess ) [...] /* --------- LOAD --------- */ case Iex_Load: { HReg r_dst; PPCAMode* am_addr; if (e->Iex.Load.end != IEndianess) goto irreducible; Either I am misunderstanding how the IEndianess is passed around, or it isn't actually possible to have a different endian load from the arch endianness.
(In reply to Mark Wielaard from comment #51) > I tried this, but seem to hit the following in the host_ppc_isel.c backend: > > static HReg iselWordExpr_R_wrk ( ISelEnv* env, const IRExpr* e, > IREndness IEndianess ) > > [...] > /* --------- LOAD --------- */ > case Iex_Load: { > HReg r_dst; > PPCAMode* am_addr; > if (e->Iex.Load.end != IEndianess) > goto irreducible; > > Either I am misunderstanding how the IEndianess is passed around, or it > isn't actually possible to have a different endian load from the arch > endianness. To clarify: there are two options here: (1) in the front end, generate an IR load with the same endianness as the host, and then byte-reverse the vector in the IR. (2) in the front end, generate an IR load with the opposite endianness as the host. The effect of (1) is that the load will arrive and be handled at the back end no problem, and then the code generated from the swap-endianness IR will .. well .. swap the endianness of the value. Whereas with (2), an opposite-endianness-to-the-host IR load travels through to the back end. At that point your options are: (2a) directly generate an opposite-endian load, or, (2b) generate a same-endian load, followed by code to swap the vector around. So: (2a) is the most direct route. (1) and (2b) are essentially the same, with the difference that (1) does the rearrangement in IR whereas (2b) does the arrangement at the POWER instruction level (in the backend). Personally I'd vote for (1) because (a) it's easier and (b) I am not sure that Memcheck will work correctly when presented with opposite-from-the-host endian loads/stores.
Created attachment 116742 [details] Option1: Implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1 (In reply to Julian Seward from comment #52) > Personally I'd vote for (1) because (a) it's easier and (b) I am not sure > that Memcheck will work correctly when presented with opposite-from-the-host > endian loads/stores. Lets go with option 1. That is the code that this patch implements. Changes from the previous version are: 1) Added a comment that there is an earlier check that ldbrx is only used in 64bit mode. 2) Added a comment explaining what we could do if we went with option 2. 3) Add a vassert(mode64) for the usage of Iop_Reverse8sIn64_x1 in host_ppc_isel.c. 4) Simplified the last part of Iop_Reverse8sIn64_x1 by removing the unnecessary masking as suggested by Julian. This make the ldbrx-aligned testcase work as expected.
To make the ldbrx-unaligned testcase work we can use something like the following: diff --git a/memcheck/mc_main.c b/memcheck/mc_main.c index 3ef7cb913..9a7b04c9c 100644 --- a/memcheck/mc_main.c +++ b/memcheck/mc_main.c @@ -1508,6 +1508,10 @@ ULong mc_LOADVn_slow ( Addr a, SizeT nBits, Bool bigendian ) # if defined(VGA_mips64) && defined(VGABI_N32) if (szB == VG_WORDSIZE * 2 && VG_IS_WORD_ALIGNED(a) && n_addrs_bad < VG_WORDSIZE * 2) +#elif defined(VGA_ppc64be) || defined(VGA_ppc64le) + /* Unaligned loads of words on the same page are OK. */ + if (szB == VG_WORDSIZE && ((a & 0xfff) <= 0x1000 - szB) + && n_addrs_bad < VG_WORDSIZE) # else if (szB == VG_WORDSIZE && VG_IS_WORD_ALIGNED(a) && n_addrs_bad < VG_WORDSIZE) The question is whether the extra check for the word not crossing a page boundary is valid. Aaron, are there are alignment constraints for the usage of the ldbrx instruction? (this would also need new exp files for the memcheck/tests/partial_load_ok and partial_load_dflt tests that check unaligned word loads produce an warning)
Mark, No alignment constraints on ldbrx, not having to check the alignment is what makes this whole thing work. If I had to do runtime alignment checks it would be too much code to generate inline.
Created attachment 116790 [details] memcheck: Allow unaligned loads of words on ppc64[le] (In reply to Aaron Sawdey from comment #55) > No alignment constraints on ldbrx, not having to check the alignment is > what makes this whole thing work. If I had to do runtime alignment checks it > would be too much code to generate inline. Thanks. In that case I propose this patch which allows any unaligned word sized load on ppc64[le] and adjusts the partial-loads testcases.
Created attachment 116801 [details] Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le. This fixes the partial load issues seen with lxvd2x by changing the separate 64bit loads with one 128bit load with the double words swapped afterwards if necessary (on ppc64le).
Created attachment 116808 [details] Allow unaligned loads of 128bit vectors on ppc64[le] lxvd2x might generate an unaligned 128 bit vector load, allow those for partial loads on ppc64[le].
The last two patches get rid of the invalid loads for the power8 vsx inlined strcmp sequences, but still leave us with conditional jumps depending on undefined values. The issue is that we will hit the following: 10000624: 99 1e 20 7c lxvd2x vs33,0,r3 10000628: 99 fe 00 7c lxvd2x vs32,0,r31 1000062c: 8c 03 a0 11 vspltisw v13,0 10000630: 00 00 40 39 li r10,0 10000634: 06 00 81 11 vcmpequb v12,v1,v0 10000638: 06 68 01 10 vcmpequb v0,v1,v13 1000063c: 57 65 00 f0 xxlorc vs32,vs32,vs44 10000640: 06 6c 20 10 vcmpequb. v1,v0,v13 10000644: 78 00 98 40 bge cr6,100006bc <main+0x13c> The bge depends on the cr6 flag which is set as a result of comparing the result of the vcmpequb. on two possibly not completely defined vectors. This is done in set_AV_CR6 () in guest_ppc_toIR.c: /* Set the CR6 flags following an AltiVec compare operation. * NOTE: This also works for VSX single-precision compares. * */ static void set_AV_CR6 ( IRExpr* result, Bool test_all_ones ) Called as set_AV_CR6( mkexpr(vD), True ); after interpreting the vcmpequb as assign( vD, binop(Iop_CmpEQ8x16, mkexpr(vA), mkexpr(vB)) );
Created attachment 116818 [details] Implement ppc64 lxvb16x as 128-bit vector load with reversed double words This fixes the partial load issues seen with lxvb16x by changing the small loads with one 128bit load and using Iop_Reverse8sIn64_x1 to reverse the bytes in the double words, on ppc64le the double words are swapped afterwards. Tested on power9 ppc64le. Not tested on big endian ppc64 since I couldn't find an power9 big endian ppc64 install.
The lxvb16x does resolve the invalid load issues, but does not completely resolve the conditional jump depends on undefined value issues. This happens again when using the CR6 conditional to branch: 0x000000001000062c <+172>: lxvb16x vs33,0,r3 0x0000000010000630 <+176>: lxvb16x vs45,0,r4 0x0000000010000634 <+180>: vcmpnezb. v0,v1,v13 => 0x0000000010000638 <+184>: bne cr6,0x10000690 <main+272>
Created attachment 116827 [details] Possible fix for the comment 61 problems This patch: * changes set_AV_CR6 so that it does scalar comparisons against zero, rather than sometimes against an all-ones word. This is something that Memcheck can instrument exactly. * in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64 by default on ppc64le. Note, this is all untested! In particular it would be good to verify that the CR6 results are unchanged.
(In reply to Julian Seward from comment #62) > Created attachment 116827 [details] > Possible fix for the comment 61 problems > > This patch: > > * changes set_AV_CR6 so that it does scalar comparisons against zero, > rather than sometimes against an all-ones word. This is something > that Memcheck can instrument exactly. > > * in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64 > by default on ppc64le. > > Note, this is all untested! In particular it would be good to verify > that the CR6 results are unchanged. That definitely fixes the issue. I tested on both power9 and power8 with a gcc and glibc build with the new str[n]cmp inline patch plus the following fixes from this bug: - Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1. - memcheck: Allow unaligned loads of words on ppc64[le]. - Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le. - memcheck: Allow unaligned loads of 128bit vectors on ppc64[le]. - Implement ppc64 lxvb16x as 128-bit vector load with reversed double words. And it seems to resolve all gcc strcmp inline issues I was seeing before. Both setups were ppc64le. I haven't yet tested on ppc64 or ppc32. I do see bug #401827 and bug #401828 but they are unrelated. Also memcheck/tests/undef_malloc_args (stderr) fails. That one does seem caused by earlier patches from this bug. And occasionally I see Use of uninitialised value of size 8 in _vgnU_freeres (vg_preloaded.c:68) which is mysterious. It goes away with --run-libc-freeres=no --run-cxx-freeres=no. It seems unrelated to this issue. But I haven't found out yet what is causing it. It is as if sometimes we are unable to see that the weak symbol _ZN9__gnu_cxx9__freeresEv (__gnu_cxx::__freeres) is initialized. Very odd, since it normally works except in some cases. But it is very likely unrelated to this bug.
Created attachment 116861 [details] Possible fix for the memcheck/tests/undef_malloc_args failure (In reply to Mark from comment #63) > Also memcheck/tests/undef_malloc_args (stderr) fails. That one does seem > caused by earlier patches from this bug. This isn't a great fix, but it does now cause the output for the test to be the same on ppc64le-linux and amd64-linux. So it might be good enough, if it produces consistent results on all other targets too.
(In reply to Julian Seward from comment #64) > Created attachment 116861 [details] > Possible fix for the memcheck/tests/undef_malloc_args failure > > This isn't a great fix, but it does now cause the output for the test to be > the same on ppc64le-linux and amd64-linux. So it might be good enough, if it > produces consistent results on all other targets too. It seems to also work on ppc64-linux, ppc32-linux and x86-linux.
(In reply to Mark Wielaard from comment #63) > And occasionally I see Use of uninitialised value of size 8 in _vgnU_freeres > (vg_preloaded.c:68) which is mysterious. It goes away with > --run-libc-freeres=no --run-cxx-freeres=no. It seems unrelated to this > issue. But I haven't found out yet what is causing it. It is as if sometimes > we are unable to see that the weak symbol _ZN9__gnu_cxx9__freeresEv > (__gnu_cxx::__freeres) is initialized. Very odd, since it normally works > except in some cases. But it is very likely unrelated to this bug. This was bug #402006 which has been fixed now as commit be7a73004583aab5d4c97cf55276ca58d5b3090b "Mark helper regs defined in final_tidyup before freeres_wrapper call." Thanks to Julian who immediately spotted the issue almost immediately.
It looks like this issue is solved now with some of the uncommitted patches to this bug. To get a bit more testing I backported the following commits to the fedora 29 package: commit 7f1dd9d5a: valgrind-3.14.0-get_otrack_shadow_offset_wrk-ppc.patch commit 427198981: valgrind-3.14.0-new-strlen-IROps.patch commit 97d336b79: valgrind-3.14.0-ppc-instr-new-IROps.patch commit e221eca26: valgrind-3.14.0-memcheck-new-IROps.patch commit 81d983222: valgrind-3.14.0-ppc-frontend-new-IROps.patch commit cb5d7e047: valgrind-3.14.0-transform-popcount64-ctznat64.patch commit 27fe22378: valgrind-3.14.0-enable-ppc-Iop_Sar_Shr8.patch PR402006: valgrind-3.14.0-final_tidyup.patch Plus the following uncommitted proposed patches from this bug: attachment 116742 [details]: valgrind-3.14.0-ppc64-ldbrx.patch attachment 116790 [details]: valgrind-3.14.0-ppc64-unaligned-words.patch attachment 116801 [details]: valgrind-3.14.0-ppc64-lxvd2x.patch attachment 116808 [details]: valgrind-3.14.0-ppc64-unaligned-vecs.patch attachment 116818 [details]: valgrind-3.14.0-ppc64-lxvb16x.patch attachment 116827 [details]: valgrind-3.14.0-set_AV_CR6.patch attachment 116861 [details]: valgrind-3.14.0-undef_malloc_args.patch Fedora packages: https://koji.fedoraproject.org/koji/buildinfo?buildID=1171900 Copr packages: https://copr.fedorainfracloud.org/coprs/mjw/valgrind-3.14.0/ Unfortunately neither (build) system has a new enough gcc compiler and the testresults don't look great. But maybe they help some people with testing. Any opinions on which of the 7 uncommitted patches should get in or needs more testing/work?
(In reply to Mark Wielaard from comment #67) > It looks like this issue is solved now with some of the uncommitted patches > to this bug. To get a bit more testing I backported the following commits to > the fedora 29 package: > > commit 7f1dd9d5a: valgrind-3.14.0-get_otrack_shadow_offset_wrk-ppc.patch > commit 427198981: valgrind-3.14.0-new-strlen-IROps.patch > commit 97d336b79: valgrind-3.14.0-ppc-instr-new-IROps.patch > commit e221eca26: valgrind-3.14.0-memcheck-new-IROps.patch > commit 81d983222: valgrind-3.14.0-ppc-frontend-new-IROps.patch > commit cb5d7e047: valgrind-3.14.0-transform-popcount64-ctznat64.patch > commit 27fe22378: valgrind-3.14.0-enable-ppc-Iop_Sar_Shr8.patch > > PR402006: valgrind-3.14.0-final_tidyup.patch > > Plus the following uncommitted proposed patches from this bug: > > attachment 116742 [details]: valgrind-3.14.0-ppc64-ldbrx.patch > attachment 116790 [details]: valgrind-3.14.0-ppc64-unaligned-words.patch > attachment 116801 [details]: valgrind-3.14.0-ppc64-lxvd2x.patch > attachment 116808 [details]: valgrind-3.14.0-ppc64-unaligned-vecs.patch > attachment 116818 [details]: valgrind-3.14.0-ppc64-lxvb16x.patch > attachment 116827 [details]: valgrind-3.14.0-set_AV_CR6.patch > attachment 116861 [details]: valgrind-3.14.0-undef_malloc_args.patch Here are the actual patches: https://src.fedoraproject.org/rpms/valgrind/tree/master
I believe this bug is finally resolved with the push of the following 7 commits to master: commit 3af8e12b0d49dc87cd26258131ebd60c9b587c74 Author: Julian Seward <jseward@acm.org> Date: Wed Dec 12 13:55:01 2018 +0100 Fix memcheck/tests/undef_malloc_args failure. Try harder to trigger a memcheck error if a value is (partially) undefined. commit 01f1936b1238a3bbeaf2821b61ecb36ba0ae15b1 Author: Julian Seward <jseward@acm.org> Date: Mon Dec 10 17:18:20 2018 +0100 Adjust ppc set_AV_CR6 computation to help Memcheck instrumentation. * changes set_AV_CR6 so that it does scalar comparisons against zero, rather than sometimes against an all-ones word. This is something that Memcheck can instrument exactly. * in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64 by default on ppc64le. https://bugs.kde.org/show_bug.cgi?id=386945#c62 commit 3ef4b2c780ea76a80ae5beaa63c1cb1d6530988b Author: Mark Wielaard <mark@klomp.org> Date: Sun Dec 9 23:25:05 2018 +0100 Implement ppc64 lxvb16x as 128-bit vector load with reversed double words. This makes it possible for memcheck to know which part of the 128bit vector is defined, even if the load is partly beyond an addressable block. Partially resolves bug 386945. commit 8d12697b157bb50ad98467b565b3a7a39097bf31 Author: Mark Wielaard <mark@klomp.org> Date: Sun Dec 9 14:26:39 2018 +0100 memcheck: Allow unaligned loads of 128bit vectors on ppc64[le]. On powerpc partial unaligned loads of vectors from partially invalid addresses are OK and could be generated by our translation of lxvd2x. Adjust partial_load memcheck tests to allow partial loads of 16 byte vectors on powerpc64. Part of resolving bug #386945. commit 98a73de1c0c83918a63e736b62d428bc2f98c943 Author: Mark Wielaard <mark@klomp.org> Date: Sun Dec 9 00:55:42 2018 +0100 Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le. This makes it possible for memcheck to know which part of the 128bit vector is defined, even if the load is partly beyond an addressable block. Partially resolves bug 386945. commit 5ecdecdcd3efecadf149dcd6276140e833b7f8e6 Author: Mark Wielaard <mark@klomp.org> Date: Sat Dec 8 13:47:43 2018 -0500 memcheck: Allow unaligned loads of words on ppc64[le]. On powerpc partial unaligned loads of words from partially invalid addresses are OK and could be generated by our translation of ldbrx. Adjust partial_load memcheck tests to allow partial loads of words on powerpc64. Part of resolving bug #386945. commit 0ed17bc9f675c59cf1e93caa4290e0b21b84c0a0 Author: Mark Wielaard <mark@klomp.org> Date: Fri Dec 7 10:42:22 2018 -0500 Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1. This makes it possible for memcheck to analyse the new gcc strcmp inlined code correctly even if the ldbrx load is partly beyond an addressable block. Partially resolves bug 386945. Thanks to everybody for testing and feedback. Note that you still need the latest gcc-8-branch (and rebuild at least glibc with it) on ppc64[le] to have a setup that valgrind memcheck fully understands.