Bug 386945 - Bogus memcheck errors on ppc64(le) when using strcmp() with gcc-7
Summary: Bogus memcheck errors on ppc64(le) when using strcmp() with gcc-7
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.14 SVN
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL: https://gcc.gnu.org/bugzilla/show_bug...
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-15 13:29 UTC by Markus Trippelsdorf
Modified: 2018-12-20 21:55 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
assembler code (2.18 KB, text/x-tex)
2017-11-15 13:53 UTC, Markus Trippelsdorf
Details
Initial attempt at a fix (1.59 KB, patch)
2017-11-16 19:49 UTC, Julian Seward
Details
Initial attempt at a fix (9.11 KB, patch)
2017-11-16 19:52 UTC, Julian Seward
Details
Next attempt at a fix (11.74 KB, patch)
2017-11-17 11:18 UTC, Julian Seward
Details
WIP patch, as described in comment 15 (28.01 KB, patch)
2018-02-27 17:14 UTC, Julian Seward
Details
Small ppc64le binary with inlined string functions (71.80 KB, application/octet-stream)
2018-10-17 10:23 UTC, Mark Wielaard
Details
strcmp test case, ppc64le gpr version (2.09 KB, text/x-tex)
2018-11-09 15:33 UTC, Aaron Sawdey
Details
strcmp test case, ppc64le p8 vsx version (1.87 KB, text/plain)
2018-11-09 15:34 UTC, Aaron Sawdey
Details
strcmp test case, ppc64le p9 vsx version (1.60 KB, text/plain)
2018-11-09 15:34 UTC, Aaron Sawdey
Details
WIP patch, for testing (43.83 KB, patch)
2018-11-15 17:48 UTC, Julian Seward
Details
Patch for wider testing (50.37 KB, patch)
2018-11-16 09:16 UTC, Julian Seward
Details
t binary testcase (72.34 KB, application/octet-stream)
2018-11-29 22:41 UTC, Mark Wielaard
Details
PPC Iop_Reverse8sIn64_x1 implementation (3.17 KB, text/plain)
2018-11-30 14:21 UTC, Mark Wielaard
Details
implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1 (5.39 KB, text/plain)
2018-11-30 21:21 UTC, Mark Wielaard
Details
aligned and unaligned ldbrx, lxvd2x and lxvb16x testcases (2.04 KB, application/octet-stream)
2018-12-05 16:42 UTC, Mark Wielaard
Details
Option1: Implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1 (6.17 KB, text/plain)
2018-12-07 22:23 UTC, Mark Wielaard
Details
memcheck: Allow unaligned loads of words on ppc64[le] (4.38 KB, text/plain)
2018-12-08 21:40 UTC, Mark Wielaard
Details
Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le. (1.84 KB, text/plain)
2018-12-09 11:39 UTC, Mark Wielaard
Details
Allow unaligned loads of 128bit vectors on ppc64[le] (1.18 KB, text/plain)
2018-12-09 13:30 UTC, Mark Wielaard
Details
Implement ppc64 lxvb16x as 128-bit vector load with reversed double words (3.52 KB, text/plain)
2018-12-09 22:45 UTC, Mark Wielaard
Details
Possible fix for the comment 61 problems (5.37 KB, patch)
2018-12-10 11:12 UTC, Julian Seward
Details
Possible fix for the memcheck/tests/undef_malloc_args failure (3.39 KB, patch)
2018-12-11 19:04 UTC, Julian Seward
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Trippelsdorf 2017-11-15 13:29:51 UTC
trippels@gcc2-power8 ~ % cat example.c
#include <string.h>

int main ()
{
   char str1[15];
   char str2[15];
   strcpy(str1, "abcdef");
   strcpy(str2, "ABCDEF");
   return strcmp(str1, str2);
}

trippels@gcc2-power8 ~ % gcc -O2 -g example.c
trippels@gcc2-power8 ~ % valgrind ./a.out
==28988== Memcheck, a memory error detector
==28988== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==28988== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==28988== Command: ./a.out
==28988== 
==28988== Conditional jump or move depends on uninitialised value(s)
==28988==    at 0x100004E8: main (example.c:9)

A new strcmp optimization was added to gcc-7. Since then valgrind
shows the bogus errors. So some kind of suppression would be needed.
Comment 1 Markus Trippelsdorf 2017-11-15 13:53:44 UTC
Created attachment 108876 [details]
assembler code

Also happens with v3.13:

==142441== Memcheck, a memory error detector
==142441== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==142441== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==142441== Command: ./a.out
==142441== 
==142441== Conditional jump or move depends on uninitialised value(s)
==142441==    at 0x10000508: main (example.c:9)

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479#c10 for analysis.
Comment 2 Julian Seward 2017-11-16 19:49:07 UTC
This is caused by insufficiently precise definedness analysis for
(1) count leading zero operations (Iop_Clz32, Iop_Clz64)
(2) PPC integer compares (Iop_CmpORD family)
(3) integer subtract

Here's an initial, hacky patch that fixes (1) and (2).  Along with
use of the flag --expensive-definedness-checks=yes, which fixes (3),
it makes the test case run quiet.

Can you give it a try on something larger?
Comment 3 Julian Seward 2017-11-16 19:49:52 UTC
Created attachment 108905 [details]
Initial attempt at a fix
Comment 4 Julian Seward 2017-11-16 19:52:23 UTC
Created attachment 108906 [details]
Initial attempt at a fix

Second attempt, attaching the correct patch this time.
Comment 5 Markus Trippelsdorf 2017-11-16 21:24:40 UTC
(In reply to Julian Seward from comment #2)
> This is caused by insufficiently precise definedness analysis for
> (1) count leading zero operations (Iop_Clz32, Iop_Clz64)
> (2) PPC integer compares (Iop_CmpORD family)
> (3) integer subtract
> 
> Here's an initial, hacky patch that fixes (1) and (2).  Along with
> use of the flag --expensive-definedness-checks=yes, which fixes (3),
> it makes the test case run quiet.
> 
> Can you give it a try on something larger?

Unfortunately it does't work yet.
I still get "Invalid read of size 4" and "Conditional jump or move depends on uninitialised value(s)" on many strcmp() invokations.
Plus the patch corrupts memory and gcc ICEs now.
(Testcase is running g++ under valgrind when compiling a small C++ program.)
Comment 6 Julian Seward 2017-11-17 11:18:29 UTC
Created attachment 108913 [details]
Next attempt at a fix

* fixes crashing (from my buggy implementation of cntlzd)
* removes all (I think) invalid read warnings (I reimplemented ldbrx)

Still invalidly reports conditional branches on uninit data.  Something
is not right somewhere.

Really, how Valgrind handles integer conditional branches on POWER 
isn't good (too complex, hard to analyse) and should be reimplemented.
Comment 7 Markus Trippelsdorf 2017-11-17 12:15:16 UTC
Thanks.
The gcc crash is gone, but I still get lots of invalid read warnings.

However the amount of errors is much lower now:

From (valgrind trunk):
 ERROR SUMMARY: 89817 errors from 298 contexts 
To (your patch and --expensive-definedness-checks=yes): 
 ERROR SUMMARY: 6495 errors from 85 contexts

BTW valgrind trunk (without any patch) doesn't like --expensive-definedness-checks=yes:

==64698== Memcheck, a memory error detector
==64698== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==64698== Using Valgrind-3.14.0.GIT and LibVEX; rerun with -h for copyright info
==64698== Command: /home/trippels/gcc_test/usr/local/bin/../libexec/gcc/powerpc64le-unknown-linux-gnu/8.0.0/cc1plus -fpreprocessed bench.ii -quiet -dumpbase bench.cpp -auxbase ben
ch -O2 -version -o bench.s
==64698== 

vex: priv/host_ppc_isel.c:1531 (iselWordExpr_R_wrk): Assertion `0' failed.
vex storage: T total 666718712 bytes allocated
vex storage: P total 192 bytes allocated

valgrind: the 'impossible' happened:
   LibVEX called failure_exit().

host stacktrace:
==64698==    at 0x58084148: show_sched_status_wrk (m_libcassert.c:355)
==64698==    by 0x5808430F: report_and_quit (m_libcassert.c:426)
==64698==    by 0x5808458B: panic (m_libcassert.c:502)
==64698==    by 0x5808458B: vgPlain_core_panic_at (m_libcassert.c:507)
==64698==    by 0x580845DB: vgPlain_core_panic (m_libcassert.c:512)
==64698==    by 0x580B1987: failure_exit (m_translate.c:751)
==64698==    by 0x581CBA7B: vex_assert_fail (main_util.c:247)
==64698==    by 0x5825AC83: iselWordExpr_R_wrk (host_ppc_isel.c:1531)
==64698==    by 0x5825AC83: iselWordExpr_R (host_ppc_isel.c:1404)
==64698==    by 0x5825E48F: iselWordExpr_RH_wrk (host_ppc_isel.c:2756)
==64698==    by 0x5825E48F: iselWordExpr_RH (host_ppc_isel.c:2704)
==64698==    by 0x58259B6B: iselWordExpr_R_wrk (host_ppc_isel.c:1477)
==64698==    by 0x58259B6B: iselWordExpr_R (host_ppc_isel.c:1404)
==64698==    by 0x58259E6B: iselWordExpr_R_wrk (host_ppc_isel.c:1505)
==64698==    by 0x58259E6B: iselWordExpr_R (host_ppc_isel.c:1404)
==64698==    by 0x5825E48F: iselWordExpr_RH_wrk (host_ppc_isel.c:2756)
==64698==    by 0x5825E48F: iselWordExpr_RH (host_ppc_isel.c:2704)
==64698==    by 0x58259AFB: iselWordExpr_R_wrk (host_ppc_isel.c:1481)
==64698==    by 0x58259AFB: iselWordExpr_R (host_ppc_isel.c:1404)
==64698==    by 0x58258363: iselCondCode_wrk (host_ppc_isel.c:2965)
==64698==    by 0x5826AAF7: iselCondCode (host_ppc_isel.c:2915)
==64698==    by 0x5826AAF7: iselStmt (host_ppc_isel.c:6385)
==64698==    by 0x5826AAF7: iselSB_PPC (host_ppc_isel.c:6983)
==64698==    by 0x581C859F: libvex_BackEnd (main_main.c:1049)
==64698==    by 0x581C859F: LibVEX_Translate (main_main.c:1186)
==64698==    by 0x580B5323: vgPlain_translate (m_translate.c:1805)
==64698==    by 0x58110E9F: handle_chain_me (scheduler.c:1084)
==64698==    by 0x58113FFF: vgPlain_scheduler (scheduler.c:1428)
==64698==    by 0x5812F697: thread_wrapper (syswrap-linux.c:103)
==64698==    by 0x5812F697: run_a_thread_NORETURN (syswrap-linux.c:156)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 64698)
==64698==    at 0x10B22120: equal (hash-traits.h:221)
==64698==    by 0x10B22120: equal_keys (hash-map-traits.h:57)
==64698==    by 0x10B22120: equal (hash-map.h:44)
==64698==    by 0x10B22120: hash_table<hash_map<nofree_string_hash, opt_pass*, simple_hashmap_traits<default_hash_traits<nofree_string_hash>, opt_pass*> >::hash_entry, xcallocator
>::find_with_hash(char const* const&, unsigned int) (hash-table.h:846)
==64698==    by 0x10B18C33: get (hash-map.h:161)
==64698==    by 0x10B18C33: gcc::pass_manager::register_pass_name(opt_pass*, char const*) (passes.c:864)
==64698==    by 0x10B19433: gcc::pass_manager::register_one_dump_file(opt_pass*) (passes.c:834)
==64698==    by 0x10B19567: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:846)
==64698==    by 0x10B195A7: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:849)
==64698==    by 0x10B195A7: gcc::pass_manager::register_dump_files(opt_pass*) (passes.c:849)
==64698==    by 0x10B1F7EF: gcc::pass_manager::pass_manager(gcc::context*) (passes.c:1615)
==64698==    by 0x101C220B: general_init (toplev.c:1168)
==64698==    by 0x101C220B: toplev::main(int, char**) (toplev.c:2148)
==64698==    by 0x101C51A7: main (main.c:39)
Comment 8 Julian Seward 2017-11-17 13:49:29 UTC
(In reply to Markus Trippelsdorf from comment #7)
> Thanks.
> The gcc crash is gone, but I still get lots of invalid read warnings.
> 
> However the amount of errors is much lower now:
> 
> From (valgrind trunk):
>  ERROR SUMMARY: 89817 errors from 298 contexts 
> To (your patch and --expensive-definedness-checks=yes): 
>  ERROR SUMMARY: 6495 errors from 85 contexts

I re-checked my fixes.  I cannot figure out why V still reports undefined
value errors now.  Can you point me at the bit of the gcc source that generates
this code?  Maybe that has some further explanation of how this should work.

> I still get lots of invalid read warnings.

I get those too (see below).  These are misaligned (non-8-byte-aligned)
ldbrx instructions (I checked) as part of this inlined strcmp.  Is the
strcmp-inlining expected to produce misaligned loads, or not?
Comment 9 Julian Seward 2017-11-17 14:05:53 UTC
Misaligned loads as referred to in comment 8:

==63134== Invalid read of size 8
==63134==    at 0xB163FE4: process_symtab (lto-plugin.c:905)
==63134==    by 0xB16D897: simple_object_elf_find_sections (simple-object-elf.c:570)
==63134==    by 0xB16BAF7: simple_object_find_sections (simple-object.c:194)
==63134==    by 0xB1645A7: claim_file_handler (lto-plugin.c:1009)
==63134==    by 0x1003A1BF: ??? (in /usr/bin/ld)
==63134==    by 0x10035F9F: ??? (in /usr/bin/ld)
==63134==  Address 0x4797413 is 243 bytes inside a block of size 249 alloc'd
==63134==    at 0x4083F40: malloc (vg_replace_malloc.c:299)
==63134==    by 0x41EC7FB: xmalloc (in /usr/lib64/libbfd-2.25.1-32.base.el7_4.1.so)
==63134==    by 0xB16D7CF: simple_object_elf_find_sections (simple-object-elf.c:535)
==63134==    by 0xB16BAF7: simple_object_find_sections (simple-object.c:194)
==63134==    by 0xB1645A7: claim_file_handler (lto-plugin.c:1009)
==63134==    by 0x1003A1BF: ??? (in /usr/bin/ld)
==63134==    by 0x10035F9F: ??? (in /usr/bin/ld)

==63134== 
==63134== Invalid read of size 8
==63134==    at 0xB1639F4: process_offload_section (lto-plugin.c:952)
==63134==    by 0xB16D897: simple_object_elf_find_sections (simple-object-elf.c:570)
==63134==    by 0xB16BAF7: simple_object_find_sections (simple-object.c:194)
==63134==    by 0xB164757: claim_file_handler (lto-plugin.c:1022)
==63134==    by 0x1003A1BF: ??? (in /usr/bin/ld)
==63134==    by 0x10035F9F: ??? (in /usr/bin/ld)
==63134==  Address 0x4797c53 is 243 bytes inside a block of size 249 alloc'd
==63134==    at 0x4083F40: malloc (vg_replace_malloc.c:299)
==63134==    by 0x41EC7FB: xmalloc (in /usr/lib64/libbfd-2.25.1-32.base.el7_4.1.so)
==63134==    by 0xB16D7CF: simple_object_elf_find_sections (simple-object-elf.c:535)
==63134==    by 0xB16BAF7: simple_object_find_sections (simple-object.c:194)
==63134==    by 0xB164757: claim_file_handler (lto-plugin.c:1022)
==63134==    by 0x1003A1BF: ??? (in /usr/bin/ld)
==63134==    by 0x10035F9F: ??? (in /usr/bin/ld)
Comment 10 Markus Trippelsdorf 2017-11-17 14:40:36 UTC
For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch:
https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=244598
Comment 11 Julian Seward 2017-11-17 20:40:29 UTC
(In reply to Markus Trippelsdorf from comment #10)
> For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch:

I can't move this along without further input.  Maybe Aaron doesn't have
an account at the KDE bugzilla.

In particular I'd like to understand why there are misaligned loads happening.
Comment 12 Markus Trippelsdorf 2017-11-18 06:31:41 UTC
(In reply to Julian Seward from comment #11)
> (In reply to Markus Trippelsdorf from comment #10)
> > For some reason I cannot CC Aaron Sawdey, who wrote the PPC strcmp patch:
> 
> I can't move this along without further input.  Maybe Aaron doesn't have
> an account at the KDE bugzilla.
> 
> In particular I'd like to understand why there are misaligned loads
> happening.

As already mentioned above, Aaron describes his algorithm here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479#c10
Quote: »If both strings have alignment > 8 we cannot inadvertently cross a page boundary doing 8B loads. For any argument that has smaller alignment, it emits a runtime check to see if the inline code would cross a 4k boundary. If so, the library function call is used instead of the inline code.«
Comment 13 Aaron Sawdey 2017-11-18 13:51:54 UTC
The unaligned ldbrx are expected -- there is not much penalty for doing this on power7/power8 so it's faster for these short comparisons to just do it possibly unaligned rather than have the extra branches and code to do a partial comparison to get one/both arguments aligned.

Targeting power8 we can generate overlapping unaligned ldbrx to avoid having to do multiple smaller loads or a shift/mask to compare something less than a doubleword at the end. Power7 doesn't like that so there we do the shift/mask.

Doing the full doubleword compare first means that in the equality case I can check for zero byte with just one cmpb. In the case where the doublewords are not equal then two cmpb are needed, one to compare the string data and the other to check just one of them for a zero byte. 

  The code that generates this is in expand_strn_compare() in rs6000.c (gcc 7) or rs6000-string.c (gcc 8).
Comment 14 Julian Seward 2018-01-05 15:43:40 UTC
Here's a program that computes the CR.{LT,GT,EQ} bits after "cmplw"
using just AND, OR, XOR, NOT, SUB, SHL and SHR.  For "cmplw" (the
only case tested here so far), it produces the same results as the
hardware.

Since we have exact V-bit interpretations of all of the abovementioned
operations (in mc_translate.c), we should be able to mechanically
derive an exact V-bit interpretation of Iop_CmpORD32U.  If we do
the same for Iop_CmpORD64U and their signed equivalents, we'll be
along way along the road to getting these false-positive problems
with POWER condition codes fixed.

Performance will probably be pretty horrible.  But I am hoping that
there will be a lot of duplicated expressions in the resulting V-bit
DAG, so if those are commoned up it might not be quite so bad.

------------------------

#include <stdio.h>
#include <assert.h>

typedef  unsigned int  UInt;


__attribute__((noinline))
UInt h_CmpORD32U ( UInt x, UInt y )
{
   UInt res;
   __asm__ __volatile__(
   ""      "cmplw 0, %1, %2"
   "\n\t"  "mfcr  %0"
   : /*OUT*/ "=r"(res) : /*IN*/ "r"(x), "r"(y) : /*TRASH*/ "cc", "memory"
   );
   return (res >> 28) & 0xF;
}

__attribute__((noinline))
UInt s_CmpORD32U ( UInt x, UInt y )
{
#if 0
   if (x < y) return 8;
   if (x > y) return 4;
   return 2;
#else
   /*     x-y unsignedly overflows
      ==> x < y 
   */
   UInt x_minus_y_UO = (~x & y) | ((~(x ^ y)) & (x-y));
   x_minus_y_UO >>= 31;

   UInt y_minus_x_UO = (~y & x) | ((~(y ^ x)) & (y-x));
   y_minus_x_UO >>= 31;

   //UInt res = (x_minus_y_UO << 3) | (y_minus_x_UO << 2);
   //if (res == 0) res = 2;

   UInt x_eq_y = (x_minus_y_UO | y_minus_x_UO) ^ 1;
   UInt res = (x_minus_y_UO << 3) | (y_minus_x_UO << 2) | (x_eq_y << 1);

   return res;
#endif
}

const UInt vals[] = 
   { 0x0,
     0x1,
     0x2,

     0x3ffffffeU,
     0x3fffffffU,
     0x40000000U,
     0x40000001U,
     0x40000002U,

     0x7ffffffeU,
     0x7fffffffU,
     0x80000000U,
     0x80000001U,
     0x80000002U,

     0xbffffffeU,
     0xbfffffffU,
     0xc0000000U,
     0xc0000001U,
     0xc0000002U,

     0xfffffffdU,
     0xfffffffeU,
     0xffffffffU
   };

int main ( void )
{
   UInt xi, yi;
   const UInt nElem = sizeof(vals) / sizeof(vals[0]);
   for (xi = 0; xi < nElem; xi++) {
      for (yi = 0; yi < nElem; yi++) {
         UInt x = vals[xi];
         UInt y = vals[yi];
         UInt resHU = h_CmpORD32U(x, y);
         UInt resSU = s_CmpORD32U(x, y);
         printf("%08x %08x -> %02x %02x\n", x, y, resHU, resSU);
         assert(resHU == resSU);
      }
   }
   return 0;
}
Comment 15 Julian Seward 2018-02-27 16:21:51 UTC
I have a patch which I've been using for investigating this.  It reduces the
noise level significantly, but doesn't remove it entirely.  I'll post it in
a following comment.  In the meantime I have a small failing testcase and a
question about acsawdey's strcmp implementation.  Here's the testcase:

  __attribute__((noinline))
  char* setup() { return strndup("abcdef", 12); }

  int main (void) {
     char* x = setup();
     return strcmp(x, "abcdef") == 0 ? 99 : 77;
  }

|setup| is done out of line only so as to make the assembly for |main|
easier to follow.

Even with the abovementioned patch in place, Memcheck still reports a branch
on undefined values in |main|, in the inlined |strcmp|.  This is when
compiled with gcc-7.3.0 at -O2.

I single-stepped through this with GDB attached to Valgrind, so I can look
at both the register values and what Memcheck thinks their definedness state
is, after every instruction.  The important part of the trace follows.  A
"." means the instruction was executed. ".nt" is "not taken" and ".t" is
"taken".

0000000010000450 <main>:
.    10000450:  03 10 40 3c     lis     r2,4099
.    10000454:  00 81 42 38     addi    r2,r2,-32512
.    10000458:  a6 02 08 7c     mflr    r0
.    1000045c:  10 00 01 f8     std     r0,16(r1)
.    10000460:  a1 ff 21 f8     stdu    r1,-96(r1)
.    10000464:  25 03 00 48     bl      10000788 <setup+0x8>
.    10000468:  fe ff 82 3c     addis   r4,r2,-2
.    1000046c:  78 88 84 38     addi    r4,r4,-30600
.    10000470:  20 05 69 78     clrldi  r9,r3,52
.    10000474:  c0 0f a9 2f     cmpdi   cr7,r9,4032
.nt  10000478:  5c 01 9c 40     bge     cr7,100005d4 <main+0x184>
.    1000047c:  fe ff 42 3d     addis   r10,r2,-2
.    10000480:  28 1c 20 7d     ldbrx   r9,0,r3
.    10000484:  78 1b 68 7c     mr      r8,r3
.    10000488:  78 88 4a 39     addi    r10,r10,-30600
.    1000048c:  28 54 40 7d     ldbrx   r10,0,r10

At this point, we've loaded r10 from presumably a constant pool, and it
contains "abcdef\0\0", all bytes defined.  Also, we've loaded r9 from the
block allocated by strndup.  It just so happens that r9 also now holds the
value "abcdef\0\0", but because that block is only 7 bytes long (as we
expect), that last \0 is marked as undefined.

.    10000490:  51 48 6a 7c     subf.   r3,r10,r9

Now r3 contains zero, because r10 == r9, but the lowest 8 bits of r3 are
marked as undefined, because the lowest 8 bits of r9 are undefined.

.t   10000494:  2c 00 82 41     beq     100004c0 <main+0x70> *********

At this beq, Memcheck issues as error.  It believes -- correctly, I think --
that the branch depends on uninitialised data.  Specifically, we are
comparing "abcdef\0\0" with "abcdef\0<undefined>", and since the 7 defined
bytes are identical, the branch actually depends on the undefined lowest
byte of r9.

    10000498:   00 00 e0 38     li      r7,0
    1000049c:   f8 53 28 7d     cmpb    r8,r9,r10
    100004a0:   f8 3b 27 7d     cmpb    r7,r9,r7
    100004a4:   38 43 e8 7c     orc     r8,r7,r8
    100004a8:   74 00 08 7d     cntlzd  r8,r8
    100004ac:   08 00 08 39     addi    r8,r8,8
    100004b0:   30 46 23 79     rldcl   r3,r9,r8,56
    100004b4:   30 46 4a 79     rldcl   r10,r10,r8,56
    100004b8:   50 18 6a 7c     subf    r3,r10,r3
    100004bc:   20 01 00 48     b       100005dc <main+0x18c>
.    100004c0:  f8 1b 29 7d     cmpb    r9,r9,r3
.    100004c4:  00 00 a9 2f     cmpdi   cr7,r9,0
.t   100004c8:  14 01 9e 40     bne     cr7,100005dc <main+0x18c>

So two questions:

(1) Does the above analysis seem correct?

(2) If so, am I correct to understand that the branch on uninitialised data
    is intended, and that this is "fixed up" later on (perhaps beginning at
    100004c0) so that the correct answer is nevertheless obtained?
Comment 16 Julian Seward 2018-02-27 17:14:26 UTC
Created attachment 111056 [details]
WIP patch, as described in comment 15
Comment 17 Aaron Sawdey 2018-02-27 17:26:08 UTC
Julian,
 The analysis is ok. The code that follows on either branch looks for a zero byte in the process of producing the final result. 

What happens after the subf./beq is that a cmpb with 0 is done to check that we are not going past the end of the string:

        beq 0,.L19
.L13:
        li 7,0
        cmpb 8,9,10
        cmpb 7,9,7
        orc 8,7,8
        cntlzd 8,8
        addi 8,8,8
        rldcl 3,9,8,56
        rldcl 10,10,8,56
        subf 3,10,3
        b .L10
.L19:
        cmpb 9,9,3
        cmpdi 7,9,0
        bne 7,.L10

If the beq is not taken, the cmpb/cmpb/orc sequence produces a result that has FF in the byte position of either the first difference, or the first zero byte. Then this identified byte is extracted from both strings and subtracted to produce the final result.

We only need to look for zero in one string because: if one has zero and the other does not, this will be a difference. If both have a zero byte in the same position then comparing either one with all zeros will make that zero byte the one that is extracted and subtracted to produce the final result.

If the beq is taken, then r3 contains zero and used to see if there is a zero byte in r9 (which must be equal to r10). If there is a zero byte, then the strings are equal and r3 is the result.
Comment 18 Julian Seward 2018-02-27 17:41:40 UTC
(In reply to Aaron Sawdey from comment #17)

Aaron,

Thanks for the details.  It's not clear though to me what the answer
to 

(2) If so, am I correct to understand that the branch on uninitialised data
    is intended,

is.  Can you give give me a yes/no answer on that?
Comment 19 Aaron Sawdey 2018-02-27 17:48:53 UTC
Sorry if that was unclear. Yes, the branch on uninitialized data is intended and is caught and fixed by the cmpb with 0 that happens on both paths.
Comment 20 Julian Seward 2018-02-27 18:06:48 UTC
(In reply to Aaron Sawdey from comment #19)
> Yes, the branch on uninitialized data is intended

Thanks for the clarification.

Unfortunately, the analysis framework has the deeply wired-in assumption
that every conditional branch instruction is "important" for the final
outcome of the program.  So there's no easy way (AFAIK, at least) to
fix it from the Memcheck side.

This isn't the first time I've seen this kind of thing -- a conditional 
branch on uninitialised data, followed by suitable fixups afterwards.
I think gzip/zlib used to do this, but were subsequently modified so as
to not do that.

Despite some study of the problem I haven't come up with a way to solve
it.  How practical is it for you to change the generated code so it
doesn't do that?

There's some related stuff in the talk at 
https://fosdem.org/2018/schedule/event/debugging_tools_memcheck/.
See in particular the second half of slide 13.
Comment 21 Julian Seward 2018-02-28 12:12:20 UTC
(In reply to Julian Seward from comment #16)
> Created attachment 111056 [details]
> WIP patch, as described in comment 15

I should comment that this patch will still report invalid reads at
the end of blocks (ends of strings in malloc'd storage), when they
are misaligned but do not cross a page boundary, as described in
comment 12.  This would be easy enough to fix, if someone can get
me a test case.

Is the page size in the code hardwired at 4096?  Or is it 4K for
4K page systems, 64K for 64K page systems, etc?
Comment 22 Aaron Sawdey 2018-02-28 12:14:24 UTC
It is always hardwired to avoid 4k boundary crossings.
Comment 23 Mark Wielaard 2018-10-17 10:04:44 UTC
Any progress on this?
I am seeing more and more reports of this issue with GCC 8.1 on ppc64le.
Comment 24 Mark Wielaard 2018-10-17 10:23:33 UTC
Created attachment 115701 [details]
Small ppc64le binary with inlined string functions

Here is an example with some inlined string functions on Fedora 28 ppc64le:

$ cat foo.c
#include <string.h>
#include <stdio.h>

__attribute__ ((weak)) void
do_test (const char *left, const char *right)
{
  printf ("result: %d\n", strcmp (left, right));
}

int
main (void)
{
  do_test (strdup ("a"), strdup ("b"));
}

$ gcc --version | head -1
gcc (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
$ gcc -O2 -g -o foo foo.c

$ valgrind -q ./foo 2>&1 | head -30
==10495== Invalid read of size 4
==10495==    at 0x10000790: do_test (foo.c:7)
==10495==    by 0x10000587: main (foo.c:13)
==10495==  Address 0x4310044 is 2 bytes after a block of size 2 alloc'd
==10495==    at 0x4093F6C: malloc (vg_replace_malloc.c:299)
==10495==    by 0x4196F63: strdup (in /usr/lib64/libc-2.27.so)
==10495==    by 0x10000563: main (foo.c:13)
==10495== 
==10495== Invalid read of size 4
==10495==    at 0x10000794: do_test (foo.c:7)
==10495==    by 0x10000587: main (foo.c:13)
==10495==  Address 0x4310094 is 2 bytes after a block of size 2 alloc'd
==10495==    at 0x4093F6C: malloc (vg_replace_malloc.c:299)
==10495==    by 0x4196F63: strdup (in /usr/lib64/libc-2.27.so)
==10495==    by 0x10000577: main (foo.c:13)
==10495== 
==10495== Conditional jump or move depends on uninitialised value(s)
==10495==    at 0x1000079C: do_test (foo.c:7)
==10495==    by 0x10000587: main (foo.c:13)
==10495== 
==10495== Conditional jump or move depends on uninitialised value(s)
==10495==    at 0x4156044: vfprintf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so)
==10495==    by 0x415DED7: printf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so)
==10495==    by 0x100007D7: do_test (foo.c:7)
==10495==    by 0x10000587: main (foo.c:13)
==10495== 
==10495== Use of uninitialised value of size 8
==10495==    at 0x41522E8: _itoa_word (in /usr/lib64/libc-2.27.so)
==10495==    by 0x41568B7: vfprintf@@GLIBC_2.17 (in /usr/lib64/libc-2.27.so)
==10495==    by 0x100007D7: do_test (foo.c:7)

objdump -d of do_test ():

0000000010000730 <do_test>:
    10000730:	02 10 40 3c 	lis     r2,4098
    10000734:	00 7f 42 38 	addi    r2,r2,32512
    10000738:	a6 02 08 7c 	mflr    r0
    1000073c:	20 05 69 78 	clrldi  r9,r3,52
    10000740:	c0 0f a9 2f 	cmpdi   cr7,r9,4032
    10000744:	10 00 01 f8 	std     r0,16(r1)
    10000748:	a1 ff 21 f8 	stdu    r1,-96(r1)
    1000074c:	10 00 9c 40 	bge     cr7,1000075c <do_test+0x2c>
    10000750:	20 05 89 78 	clrldi  r9,r4,52
    10000754:	c0 0f a9 2f 	cmpdi   cr7,r9,4032
    10000758:	38 00 9c 41 	blt     cr7,10000790 <do_test+0x60>
    1000075c:	a5 fd ff 4b 	bl      10000500 <00000039.plt_call.strcmp@@GLIBC_2.17>
    10000760:	18 00 41 e8 	ld      r2,24(r1)
    10000764:	b4 07 64 7c 	extsw   r4,r3
    10000768:	fe ff 62 3c 	addis   r3,r2,-2
    1000076c:	98 8b 63 38 	addi    r3,r3,-29800
    10000770:	51 fd ff 4b 	bl      100004c0 <00000039.plt_call.printf@@GLIBC_2.17>
    10000774:	18 00 41 e8 	ld      r2,24(r1)
    10000778:	60 00 21 38 	addi    r1,r1,96
    1000077c:	10 00 01 e8 	ld      r0,16(r1)
    10000780:	a6 03 08 7c 	mtlr    r0
    10000784:	20 00 80 4e 	blr
    10000788:	00 00 00 60 	nop
    1000078c:	00 00 42 60 	ori     r2,r2,0
    10000790:	28 1c 40 7d 	ldbrx   r10,0,r3
    10000794:	28 24 00 7d 	ldbrx   r8,0,r4
    10000798:	51 50 28 7d 	subf.   r9,r8,r10
    1000079c:	54 00 82 41 	beq     100007f0 <do_test+0xc0>
    100007a0:	00 00 20 39 	li      r9,0
    100007a4:	f8 43 43 7d 	cmpb    r3,r10,r8
    100007a8:	f8 4b 49 7d 	cmpb    r9,r10,r9
    100007ac:	38 1b 23 7d 	orc     r3,r9,r3
    100007b0:	74 00 63 7c 	cntlzd  r3,r3
    100007b4:	08 00 63 38 	addi    r3,r3,8
    100007b8:	30 1e 4a 79 	rldcl   r10,r10,r3,56
    100007bc:	30 1e 03 79 	rldcl   r3,r8,r3,56
    100007c0:	50 50 23 7d 	subf    r9,r3,r10
    100007c4:	78 4b 23 7d 	mr      r3,r9
    100007c8:	b4 07 64 7c 	extsw   r4,r3
    100007cc:	fe ff 62 3c 	addis   r3,r2,-2
    100007d0:	98 8b 63 38 	addi    r3,r3,-29800
    100007d4:	ed fc ff 4b 	bl      100004c0 <00000039.plt_call.printf@@GLIBC_2.17>
    100007d8:	18 00 41 e8 	ld      r2,24(r1)
    100007dc:	60 00 21 38 	addi    r1,r1,96
    100007e0:	10 00 01 e8 	ld      r0,16(r1)
    100007e4:	a6 03 08 7c 	mtlr    r0
    100007e8:	20 00 80 4e 	blr
    100007ec:	00 00 42 60 	ori     r2,r2,0
    100007f0:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    100007f4:	00 00 aa 2f 	cmpdi   cr7,r10,0
    100007f8:	cc ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    100007fc:	08 00 23 39 	addi    r9,r3,8
    10000800:	28 4c 40 7d 	ldbrx   r10,0,r9
    10000804:	08 00 24 39 	addi    r9,r4,8
    10000808:	28 4c 00 7d 	ldbrx   r8,0,r9
    1000080c:	51 50 28 7d 	subf.   r9,r8,r10
    10000810:	90 ff 82 40 	bne     100007a0 <do_test+0x70>
    10000814:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    10000818:	00 00 aa 2f 	cmpdi   cr7,r10,0
    1000081c:	a8 ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    10000820:	10 00 23 39 	addi    r9,r3,16
    10000824:	28 4c 40 7d 	ldbrx   r10,0,r9
    10000828:	10 00 24 39 	addi    r9,r4,16
    1000082c:	28 4c 00 7d 	ldbrx   r8,0,r9
    10000830:	51 50 28 7d 	subf.   r9,r8,r10
    10000834:	6c ff 82 40 	bne     100007a0 <do_test+0x70>
    10000838:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    1000083c:	00 00 aa 2f 	cmpdi   cr7,r10,0
    10000840:	84 ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    10000844:	18 00 23 39 	addi    r9,r3,24
    10000848:	28 4c 40 7d 	ldbrx   r10,0,r9
    1000084c:	18 00 24 39 	addi    r9,r4,24
    10000850:	28 4c 00 7d 	ldbrx   r8,0,r9
    10000854:	51 50 28 7d 	subf.   r9,r8,r10
    10000858:	48 ff 82 40 	bne     100007a0 <do_test+0x70>
    1000085c:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    10000860:	00 00 aa 2f 	cmpdi   cr7,r10,0
    10000864:	60 ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    10000868:	20 00 23 39 	addi    r9,r3,32
    1000086c:	28 4c 40 7d 	ldbrx   r10,0,r9
    10000870:	20 00 24 39 	addi    r9,r4,32
    10000874:	28 4c 00 7d 	ldbrx   r8,0,r9
    10000878:	51 50 28 7d 	subf.   r9,r8,r10
    1000087c:	24 ff 82 40 	bne     100007a0 <do_test+0x70>
    10000880:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    10000884:	00 00 aa 2f 	cmpdi   cr7,r10,0
    10000888:	3c ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    1000088c:	28 00 23 39 	addi    r9,r3,40
    10000890:	28 4c 40 7d 	ldbrx   r10,0,r9
    10000894:	28 00 24 39 	addi    r9,r4,40
    10000898:	28 4c 00 7d 	ldbrx   r8,0,r9
    1000089c:	51 50 28 7d 	subf.   r9,r8,r10
    100008a0:	00 ff 82 40 	bne     100007a0 <do_test+0x70>
    100008a4:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    100008a8:	00 00 aa 2f 	cmpdi   cr7,r10,0
    100008ac:	18 ff 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    100008b0:	30 00 23 39 	addi    r9,r3,48
    100008b4:	28 4c 40 7d 	ldbrx   r10,0,r9
    100008b8:	30 00 24 39 	addi    r9,r4,48
    100008bc:	28 4c 00 7d 	ldbrx   r8,0,r9
    100008c0:	51 50 28 7d 	subf.   r9,r8,r10
    100008c4:	dc fe 82 40 	bne     100007a0 <do_test+0x70>
    100008c8:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    100008cc:	00 00 aa 2f 	cmpdi   cr7,r10,0
    100008d0:	f4 fe 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    100008d4:	38 00 23 39 	addi    r9,r3,56
    100008d8:	28 4c 40 7d 	ldbrx   r10,0,r9
    100008dc:	38 00 24 39 	addi    r9,r4,56
    100008e0:	28 4c 00 7d 	ldbrx   r8,0,r9
    100008e4:	51 50 28 7d 	subf.   r9,r8,r10
    100008e8:	b8 fe 82 40 	bne     100007a0 <do_test+0x70>
    100008ec:	f8 4b 4a 7d 	cmpb    r10,r10,r9
    100008f0:	00 00 aa 2f 	cmpdi   cr7,r10,0
    100008f4:	d0 fe 9e 40 	bne     cr7,100007c4 <do_test+0x94>
    100008f8:	40 00 84 38 	addi    r4,r4,64
    100008fc:	40 00 63 38 	addi    r3,r3,64
    10000900:	5c fe ff 4b 	b       1000075c <do_test+0x2c>
    10000904:	00 00 00 00 	.long 0x0
    10000908:	00 00 00 01 	.long 0x1000000
    1000090c:	80 00 00 00 	.long 0x80
Comment 25 Aaron Sawdey 2018-11-09 15:33:23 UTC
Created attachment 116201 [details]
strcmp test case, ppc64le gpr version
Comment 26 Aaron Sawdey 2018-11-09 15:34:08 UTC
Created attachment 116202 [details]
strcmp test case, ppc64le p8 vsx version
Comment 27 Aaron Sawdey 2018-11-09 15:34:38 UTC
Created attachment 116203 [details]
strcmp test case, ppc64le p9 vsx version
Comment 28 Aaron Sawdey 2018-11-09 15:37:26 UTC
I've just attached asm generated by current gcc trunk for this test case:

#include <string.h>

int main ()
{
   char str1[15];
   char str2[15];
   strcpy(str1, "abcdef");
   strcpy(str2, "ABCDEF");
   return strcmp(str1, str2);
}

They are compiled with -mno-vsx (gpr version), -mcpu=power8 (vsx p8), and -mcpu=power9 (vsx p9).
Comment 29 Julian Seward 2018-11-15 17:48:20 UTC
Created attachment 116331 [details]
WIP patch, for testing

WIP patch.  Tested on 64-bit P8 LE.  This runs clean with 
the two P8 test cases in comments 25 and 26:

* strcmp test case, ppc64le gpr version
* strcmp test case, ppc64le p8 vsx version

I didn't test the P9 VSX version yet.  I think the patch is more accurate
and also reasonably correct, but may assert in a couple of cases due to
missing code generation cases, which I will fix.  It's worth trying as-is
on P8-LE though.  It applies on top of 1d42a625abd8b6d24186e6bda2b401ee24a118c4
(Philippe Waroquiers, Tue Nov 6 21:40:43 2018 +0100)
Comment 30 Julian Seward 2018-11-16 09:16:52 UTC
Created attachment 116337 [details]
Patch for wider testing

Here's a patch for wider testing.  If it looks OK after testing, I'll
divide it up into its various logically-independent pieces for landing.

It looks OK to me in initial testing on:

  gcc135 (P9, LE64)
  gcc112 (P8, LE64)
  gcc110 (P7, BE64 and BE32)

One thing I changed, but couldn't test, is the implementation of vpopcntd
on 32 bit targets.  So this really needs testing on a BE32 platform that can
do vpopcntd (gcc110 can't).
Comment 31 Aaron Sawdey 2018-11-16 18:11:45 UTC
I tested this on a target that does support 32-bit (power8 BE) and it looked good. Various strcmp/memcmp expansion tests do not see errors from valgrind. There were no additional regtest fails between unpatched and patched builds. What I don't know is if the regtest is a sufficient test of vpopcntd.
Comment 32 Carl Love 2018-11-19 15:54:54 UTC
Aaron, it sounds like you ran make regtest with and without the patch and didn't see any errors.  The regtest includes a fairly good test of vpopcntd in test jm_vec_isa_2_07.  So as long as you didn't see any additional errors, specifically in jm_vec_isa_2_07 you should be good to go.
Comment 33 Mark Wielaard 2018-11-19 16:50:11 UTC
I assume the testing was done against GCC9/svn trunk? If so, are there any specific GCC9 patches that could/should be backported to GCC8?
Comment 34 Aaron Sawdey 2018-11-19 17:12:17 UTC
Carl, yes I ran regtest on power8 BE (which supports 32 and 64 bit abi) with and without the patch and didn't see any differences.

Mark, it's on my list to back port the gcc code gen change to gcc8. The patch involved is this one:

https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01589.html
Comment 35 Julian Seward 2018-11-20 11:16:57 UTC
Fixed:

27fe22378da38424102c5292b782cacdd9d7b9e4
  Add support for Iop_{Sar,Shr}8 on ppc.  --expensive-definedness-checks=yes
  needs them.

cb5d7e047598bff6d0f1d707a70d9fb1a1c7f0e2
  fold_Expr: transform PopCount64(And64(Add64(x,-1),Not64(x))) into CtzNat64(x).
    
81d9832226d6e3d1ee78ee3133189d7b520e7eea
  ppc front end: use new IROps added in 42719898.

e221eca26be6b2396e3fcbf4117e630fc22e79f6
  Add Memcheck support for IROps added in 42719898.

97d336b79e36f6c99d8b07f49ebc9b780e6df84e
  Add ppc host-side isel and instruction support for IROps added in previous
  commit.

4271989815b5fc933c1e29bc75507c2726dc3738
  Add some new IROps to support improved Memcheck analysis of strlen etc.

7f1dd9d5aec1f1fd4eb0ae3a311358a914f1d73f
  get_otrack_shadow_offset_wrk for ppc32 and ppc64: add missing cases for
  XER_OV32, XER_CA32 and C_FPCC.
Comment 36 curaga 2018-11-27 10:38:36 UTC
Seems only partially fixed? On power8, ppc64le, gcc 8.2, and current valgrind git, the following still fails:

#include <stdlib.h>
#include <string.h>

int main() {

        char *foo = calloc(3, 1);

        return strcmp(foo, "a");
}
==32113== Memcheck, a memory error detector
==32113== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32113== Using Valgrind-3.15.0.GIT and LibVEX; rerun with -h for copyright info
==32113== Command: ../valgrind-strcmp
==32113==
==32113== Invalid read of size 4
==32113==    at 0x1806F8: main (valgrind-strcmp.c:8)
==32113==  Address 0x4b30044 is 1 bytes after a block of size 3 alloc'd
==32113==    at 0x48973E8: calloc (vg_replace_malloc.c:752)
==32113==    by 0x1806DF: main (valgrind-strcmp.c:6)
==32113==
==32113== Conditional jump or move depends on uninitialised value(s)
==32113==    at 0x180700: main (valgrind-strcmp.c:8)
==32113==
==32113==
==32113== HEAP SUMMARY:
==32113==     in use at exit: 3 bytes in 1 blocks
==32113==   total heap usage: 1 allocs, 0 frees, 3 bytes allocated
==32113==
==32113== LEAK SUMMARY:
==32113==    definitely lost: 3 bytes in 1 blocks
==32113==    indirectly lost: 0 bytes in 0 blocks
==32113==      possibly lost: 0 bytes in 0 blocks
==32113==    still reachable: 0 bytes in 0 blocks
==32113==         suppressed: 0 bytes in 0 blocks
==32113== Rerun with --leak-check=full to see details of leaked memory
==32113==
==32113== For counts of detected and suppressed errors, rerun with: -v
==32113== Use --track-origins=yes to see where uninitialised values come from
==32113== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Comment 37 Aaron Sawdey 2018-11-27 12:24:59 UTC
The fix is in gcc 9 trunk. I'm working on the gcc 8 back port now so it's definitely not in 8.2, hopefully will be in 8.3 when that comes out.
Comment 38 Aaron Sawdey 2018-11-28 19:37:17 UTC
The fix for this is checked in to gcc-8-branch svn revision 266578. I believe this will show up in gcc 8.3.
Comment 39 Mark Wielaard 2018-11-29 22:36:18 UTC
With that gcc backport https://gcc.gnu.org/ml/gcc-patches/2018-11/msg02161.html and the valgrind fixes we get rid of all the Conditional jump or move depends on uninitialised value(s) issues, but unfortunately we still have an issue with the some Invalid read of size 4.

The simplest example is the last C program:

#include <stdlib.h>
#include <string.h>

int main() {

        char *foo = calloc(3, 1);

        return strcmp(foo, "a");
}

gcc -g -O2 -o t t.c

valgrind -q ./t
==31430== Invalid read of size 4
==31430==    at 0x10000510: main (t.c:8)
==31430==  Address 0x42e0044 is 1 bytes after a block of size 3 alloc'd
==31430==    at 0x40874C8: calloc (vg_replace_malloc.c:752)
==31430==    by 0x100004FF: main (t.c:6)
==31430== 

The issue is the following ldbrx:

Dump of assembler code for function main:
   0x00000000100004e0 <+0>:	lis     r2,4098
   0x00000000100004e4 <+4>:	addi    r2,r2,32512
   0x00000000100004e8 <+8>:	mflr    r0
   0x00000000100004ec <+12>:	li      r4,1
   0x00000000100004f0 <+16>:	li      r3,3
   0x00000000100004f4 <+20>:	std     r0,16(r1)
   0x00000000100004f8 <+24>:	stdu    r1,-32(r1)
   0x00000000100004fc <+28>:	bl      0x10000480 <00000022.plt_call.calloc@@GLIBC_2.17>
   0x0000000010000500 <+32>:	ld      r2,24(r1)
   0x0000000010000504 <+36>:	addis   r4,r2,-2
   0x0000000010000508 <+40>:	li      r10,0
   0x000000001000050c <+44>:	addi    r4,r4,-30120
=> 0x0000000010000510 <+48>:	ldbrx   r7,0,r3
   0x0000000010000514 <+52>:	ldbrx   r8,0,r4
   0x0000000010000518 <+56>:	cmpb    r10,r7,r10
   0x000000001000051c <+60>:	cmpb    r9,r7,r8
   0x0000000010000520 <+64>:	orc.    r10,r10,r9
   0x0000000010000524 <+68>:	bne     0x10000548 <main+104>
   0x0000000010000528 <+72>:	addi    r9,r3,8
   0x000000001000052c <+76>:	ldbrx   r7,0,r9
   0x0000000010000530 <+80>:	addi    r9,r4,8
   0x0000000010000534 <+84>:	ldbrx   r8,0,r9
   0x0000000010000538 <+88>:	cmpb    r10,r7,r10
   0x000000001000053c <+92>:	cmpb    r9,r7,r8
   0x0000000010000540 <+96>:	orc.    r10,r10,r9
   0x0000000010000544 <+100>:	beq     0x10000570 <main+144>
   0x0000000010000548 <+104>:	cntlzd  r9,r10
   0x000000001000054c <+108>:	addi    r9,r9,8
   0x0000000010000550 <+112>:	rldcl   r3,r7,r9,56
   0x0000000010000554 <+116>:	rldcl   r9,r8,r9,56
   0x0000000010000558 <+120>:	subf    r3,r9,r3
   0x000000001000055c <+124>:	addi    r1,r1,32
   0x0000000010000560 <+128>:	extsw   r3,r3
   0x0000000010000564 <+132>:	ld      r0,16(r1)
   0x0000000010000568 <+136>:	mtlr    r0
   0x000000001000056c <+140>:	blr
   0x0000000010000570 <+144>:	addi    r9,r3,16
   0x0000000010000574 <+148>:	ldbrx   r7,0,r9
   0x0000000010000578 <+152>:	addi    r9,r4,16
   0x000000001000057c <+156>:	ldbrx   r8,0,r9
   0x0000000010000580 <+160>:	cmpb    r10,r7,r10
   0x0000000010000584 <+164>:	cmpb    r9,r7,r8
   0x0000000010000588 <+168>:	orc.    r10,r10,r9
   0x000000001000058c <+172>:	bne     0x10000548 <main+104>
   0x0000000010000590 <+176>:	addi    r9,r3,24
   0x0000000010000594 <+180>:	ldbrx   r7,0,r9
   0x0000000010000598 <+184>:	addi    r9,r4,24
   0x000000001000059c <+188>:	ldbrx   r8,0,r9
   0x00000000100005a0 <+192>:	cmpb    r10,r7,r10
   0x00000000100005a4 <+196>:	cmpb    r9,r7,r8
   0x00000000100005a8 <+200>:	orc.    r10,r10,r9
   0x00000000100005ac <+204>:	bne     0x10000548 <main+104>
   0x00000000100005b0 <+208>:	addi    r9,r3,32
   0x00000000100005b4 <+212>:	ldbrx   r7,0,r9
   0x00000000100005b8 <+216>:	addi    r9,r4,32
   0x00000000100005bc <+220>:	ldbrx   r8,0,r9
   0x00000000100005c0 <+224>:	cmpb    r10,r7,r10
   0x00000000100005c4 <+228>:	cmpb    r9,r7,r8
   0x00000000100005c8 <+232>:	orc.    r10,r10,r9
   0x00000000100005cc <+236>:	bne     0x10000548 <main+104>
   0x00000000100005d0 <+240>:	addi    r9,r3,40
   0x00000000100005d4 <+244>:	ldbrx   r7,0,r9
   0x00000000100005d8 <+248>:	addi    r9,r4,40
   0x00000000100005dc <+252>:	ldbrx   r8,0,r9
   0x00000000100005e0 <+256>:	cmpb    r10,r7,r10
   0x00000000100005e4 <+260>:	cmpb    r9,r7,r8
   0x00000000100005e8 <+264>:	orc.    r10,r10,r9
   0x00000000100005ec <+268>:	bne     0x10000548 <main+104>
   0x00000000100005f0 <+272>:	addi    r9,r3,48
   0x00000000100005f4 <+276>:	ldbrx   r7,0,r9
   0x00000000100005f8 <+280>:	addi    r9,r4,48
   0x00000000100005fc <+284>:	ldbrx   r8,0,r9
   0x0000000010000600 <+288>:	cmpb    r10,r7,r10
   0x0000000010000604 <+292>:	cmpb    r9,r7,r8
   0x0000000010000608 <+296>:	orc.    r10,r10,r9
   0x000000001000060c <+300>:	bne     0x10000548 <main+104>
   0x0000000010000610 <+304>:	addi    r9,r3,56
   0x0000000010000614 <+308>:	ldbrx   r7,0,r9
   0x0000000010000618 <+312>:	addi    r9,r4,56
   0x000000001000061c <+316>:	ldbrx   r8,0,r9
   0x0000000010000620 <+320>:	cmpb    r10,r7,r10
   0x0000000010000624 <+324>:	cmpb    r9,r7,r8
   0x0000000010000628 <+328>:	orc.    r10,r10,r9
   0x000000001000062c <+332>:	bne     0x10000548 <main+104>
   0x0000000010000630 <+336>:	addi    r4,r4,64
   0x0000000010000634 <+340>:	addi    r3,r3,64
   0x0000000010000638 <+344>:	bl      0x100004c0 <00000022.plt_call.strcmp@@GLIBC_2.17>
   0x000000001000063c <+348>:	ld      r2,24(r1)
   0x0000000010000640 <+352>:	b       0x1000055c <main+124>
   0x0000000010000644 <+356>:	.long 0x0
   0x0000000010000648 <+360>:	.long 0x1000000
   0x000000001000064c <+364>:	.long 0x80
Comment 40 Mark Wielaard 2018-11-29 22:41:51 UTC
Created attachment 116575 [details]
t binary testcase

The t binary from comment #39 to help with debugging.
Comment 41 Mark Wielaard 2018-11-29 23:55:06 UTC
Hoping that the JRS FIXME for ldbrx (Load Doubleword Byte-Reverse Indexed) was a hint that memcheck would be helped by a single double-word load plus Iop_Reverse I hacked up:

         DIP("ldbrx r%u,r%u,r%u\n", rD_addr, rA_addr, rB_addr);
         IRTemp dw1 = newTemp(Ity_I64);
         IRTemp dw2 = newTemp(Ity_I64);

         assign( dw1, load( Ity_I64, mkexpr(EA)) );
         assign( dw2, unop(Iop_Reverse8sIn64_x1, mkexpr(dw1)) );
         putIReg( rD_addr, mkexpr(dw2) );

But the ppc backend doesn't handle Iop_Reverse8sIn64_x1, so that didn't really go anywhere for now.
Comment 42 Aaron Sawdey 2018-11-30 03:02:13 UTC
I think you're on the right track with ldbrx. The only difference between the code that gcc is generating and what is in glibc is ldbrx vs ld, here is what is in glibc trunk:

        /* For short string up to 16 bytes, load both s1 and s2 using
           unaligned dwords and compare.  */
        ld      r8,0(r3)
        ld      r10,0(r4)
        cmpb    r12,r8,r0
        cmpb    r11,r8,r10
        orc.    r9,r12,r11
        bne     cr0,L(different_nocmpb)

The glibc code has a couple more instructions afterwards because we don't have a count trailing zeros instruction that would make the most sense for the byte ordering that ld gives you on little endian.
Comment 43 Mark Wielaard 2018-11-30 14:21:20 UTC
Created attachment 116586 [details]
PPC Iop_Reverse8sIn64_x1 implementation

This compiles and runs, but hasn't been tested at all (combine with the patch in comment #41)
Comment 44 Mark Wielaard 2018-11-30 21:21:50 UTC
Created attachment 116593 [details]
implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1

Real implementation of ldbrx as 64-bit load and Iop_Reverse8sIn64_x1.

This works well and resolves the issue with (aligned) ldbrx instructions reading past a memory block.

It has only be tested on a ppc64le power9 setup for now.
Could someone who actually knows ppc64 review this implementation and say if it is good enough for other setups too?
Comment 45 Mark Wielaard 2018-11-30 21:45:07 UTC
Unfortunately even with the ldbrx patch (which seems an OK improvement in itself) we still have some issues.

ldbrx is also used on unaligned addresses. In which case even --partial-loads-ok=yes doesn't help us.

For example this little program which has a string table with some strings at odd addresses:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

int main (int argc, char **argv)
{
  char *strtable = malloc (5);
  // First element is the empty string.
  strtable[0] = '\0';
  // Second (and last) element is a short string.
  strcpy (&strtable[1], "abc");

  // What was the second string again?
  int equal = strcmp (argv[0], &strtable[1]);
  if (equal == 0)
    puts ("yes");
  else
    puts ("no");

  free (strtable);
  return 0;
}

gcc -Wall -g -O2 -o table table.c
valgrind -q ./table
==7902== Invalid read of size 8
==7902==    at 0x10000630: main (table.c:14)
==7902==  Address 0x42e0041 is 1 bytes inside a block of size 5 alloc'd
==7902==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7902==    by 0x100005A7: main (table.c:7)
==7902== 
no

Dump of assembler code for function main:
   0x0000000010000580 <+0>:     lis     r2,4098
   0x0000000010000584 <+4>:     addi    r2,r2,32512
   0x0000000010000588 <+8>:     mflr    r0
   0x000000001000058c <+12>:    std     r30,-16(r1)
   0x0000000010000590 <+16>:    std     r31,-8(r1)
   0x0000000010000594 <+20>:    li      r3,5
   0x0000000010000598 <+24>:    mr      r30,r4
   0x000000001000059c <+28>:    std     r0,16(r1)
   0x00000000100005a0 <+32>:    stdu    r1,-48(r1)
   0x00000000100005a4 <+36>:    bl      0x10000560 <00000022.plt_call.malloc@@GLIBC_2.17>
   0x00000000100005a8 <+40>:    ld      r2,24(r1)
   0x00000000100005ac <+44>:    lis     r10,99
   0x00000000100005b0 <+48>:    li      r8,0
   0x00000000100005b4 <+52>:    ori     r10,r10,25185
   0x00000000100005b8 <+56>:    mr      r31,r3
   0x00000000100005bc <+60>:    ld      r3,0(r30)
   0x00000000100005c0 <+64>:    addi    r4,r31,1
   0x00000000100005c4 <+68>:    stb     r8,0(r31)
   0x00000000100005c8 <+72>:    stw     r10,1(r31)
   0x00000000100005cc <+76>:    clrldi  r9,r3,52
   0x00000000100005d0 <+80>:    cmpdi   cr7,r9,4032
   0x00000000100005d4 <+84>:    bge     cr7,0x100005e4 <main+100>
   0x00000000100005d8 <+88>:    clrldi  r9,r4,52
   0x00000000100005dc <+92>:    cmpdi   cr7,r9,4032
   0x00000000100005e0 <+96>:    blt     cr7,0x1000062c <main+172>
   0x00000000100005e4 <+100>:   bl      0x10000520 <00000022.plt_call.strcmp@@GLIBC_2.17>
   0x00000000100005e8 <+104>:   ld      r2,24(r1)
   0x00000000100005ec <+108>:   cmpwi   cr7,r3,0
   0x00000000100005f0 <+112>:   bne     cr7,0x1000074c <main+460>
   0x00000000100005f4 <+116>:   addis   r3,r2,-2
   0x00000000100005f8 <+120>:   addi    r3,r3,-29840
   0x00000000100005fc <+124>:   bl      0x10000540 <00000022.plt_call.puts@@GLIBC_2.17>
   0x0000000010000600 <+128>:   ld      r2,24(r1)
   0x0000000010000604 <+132>:   mr      r3,r31
   0x0000000010000608 <+136>:   bl      0x100004e0 <00000022.plt_call.free@@GLIBC_2.17>
   0x000000001000060c <+140>:   ld      r2,24(r1)
   0x0000000010000610 <+144>:   addi    r1,r1,48
   0x0000000010000614 <+148>:   li      r3,0
   0x0000000010000618 <+152>:   ld      r0,16(r1)
   0x000000001000061c <+156>:   ld      r30,-16(r1)
   0x0000000010000620 <+160>:   ld      r31,-8(r1)
   0x0000000010000624 <+164>:   mtlr    r0
   0x0000000010000628 <+168>:   blr
   0x000000001000062c <+172>:   ldbrx   r8,0,r3
=> 0x0000000010000630 <+176>:   ldbrx   r9,0,r4
   0x0000000010000634 <+180>:   li      r10,0
   0x0000000010000638 <+184>:   cmpb    r7,r8,r9
   0x000000001000063c <+188>:   cmpb    r10,r8,r10
   0x0000000010000640 <+192>:   orc.    r10,r10,r7
   0x0000000010000644 <+196>:   bne     0x10000734 <main+436>
   0x0000000010000648 <+200>:   addi    r9,r3,8
   0x000000001000064c <+204>:   ldbrx   r8,0,r9
   0x0000000010000650 <+208>:   addi    r9,r4,8
   0x0000000010000654 <+212>:   ldbrx   r9,0,r9
   0x0000000010000658 <+216>:   cmpb    r10,r8,r10
   0x000000001000065c <+220>:   cmpb    r7,r8,r9
   0x0000000010000660 <+224>:   orc.    r10,r10,r7
   0x0000000010000664 <+228>:   bne     0x10000734 <main+436>

This means we have to handle unaligned partial reads somehow.
I am not sure we can. The code in memcheck/mc_main.c (mc_LOADVn_slow) dealing with partial reads seems to very much rely on at least word alignment.
Comment 46 Julian Seward 2018-12-02 16:04:34 UTC
(In reply to Mark Wielaard from comment #44)
> Created attachment 116593 [details]
> implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1
> 
> Real implementation of ldbrx as 64-bit load and Iop_Reverse8sIn64_x1.

Regardless of the fact that there are still misaligned ldbrxs to deal with,
I think this is good and should land.  I would ask:

* for "case 0x214: // ldbrx (Load Doubleword Byte-Reverse Indexed)"
  please add a comment to say that dis_int_ldst_rev's caller guarantees
  to ask for this instruction to be decoded only in 64-bit mode

* for the backend expression, it's fine, except that for
  // r = (r & 0x00000000FFFFFFFF) << 32 | (r & 0xFFFFFFFF00000000) >> 32;
  there's no need to do the masking since the shifts will remove all of
  the bits that just got masked out anyway.  So we can skip the _LI, the NOT
  and the two ANDs.

* oh .. and I think the whole thing (insn selection for Iop_Reverse8sIn64_x1)
  needs to be guarded by the is-this-64-bit-mode boolean (env->mode64 or
  something like that) since we don't want to be emitting this bit of code in
  32 bit mode.
Comment 47 Julian Seward 2018-12-02 16:13:24 UTC
(In reply to Mark Wielaard from comment #45)
> Unfortunately even with the ldbrx patch (which seems an OK improvement in
> itself) we still have some issues.
> 
> ldbrx is also used on unaligned addresses. In which case even
> --partial-loads-ok=yes doesn't help us.
> [..]
> This means we have to handle unaligned partial reads somehow.
> I am not sure we can. The code in memcheck/mc_main.c (mc_LOADVn_slow)
> dealing with partial reads seems to very much rely on at least word
> alignment.

That doesn't worry me so much.  I'm sure we can come up with some ppc64le-only
special-casing in mc_LOADVn_slow to deal with it.

What worries me more is the base compiler pipeline (ignoring Memcheck).  We'll
have a misaligned ldbrx going into the pipeline, which will eventually get re-emitted
as a plain, vanilla, etc, 64-bit integer load (ld).  Will that work properly for a
misaligned address?  Or will it trap (sigbus?)  Or will it silently zero out the
three lowest address bits before performing the load?  This is only going to work
correctly if it correctly handles the misalignment.
Comment 48 Carl Love 2018-12-03 18:47:23 UTC
Mark:

I took your patch " implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1" in comment 44, attachment 116593 [details] and applied it to current mainline. 

commit f2387ed7bed64571197b44c1f8abae4d907bce1b
Author: Mark Wielaard <mark@klomp.org>
Date:   Fri Nov 30 15:23:06 2018 -0500

    Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1.

commit 206e81e8add574bbb38c11105005decedabc2f4a
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Dec 2 12:39:27 2018 +0100

    Fix tsan_unittest.cpp compile error with older compilers.
    
    Older compilers (g++ 4.8.5) don't like '>>':
      error: ‘>>’ should be ‘> >’ within a nested template argument list.
    Add an extra space.

Then compiled and installed valgrind.

I then ran your string program in comment 45 on P7, P8 BE, P8 LE, P9 as follows:
 
  gcc -o mark_test ../mark_test.c   
  valgrind ./mark_test

On Power 7, Power 8 BE, Power 8 LE, Power 9 LE running the test:

  valgrind ./mark_test
  ==18373== Memcheck, a memory error detector
  ==18373== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
  ==18373== Using Valgrind-3.15.0.GIT and LibVEX; rerun with -h for copyright   info
  ==18373== Command: ./mark_test
  ==18373== 
  no
  ==18373== 
  ==18373== HEAP SUMMARY:
  ==18373==     in use at exit: 0 bytes in 0 blocks
  ==18373==   total heap usage: 1 allocs, 1 frees, 5 bytes allocated
  ==18373== 
  ==18373== All heap blocks were freed -- no leaks are possible
  ==18373== 
  ==18373== For counts of detected and suppressed errors, rerun with: -v
  ==18373== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Running natively an all of these systems gives:

   ./mark_test
   no


I read thru the patch, it looks fine.  I do think there might be an easier/faster way to implement the ldbrx instruction.

Your patch does the following:

+         assign( dw1, load(Ity_I64, mkexpr(EA)) );
+         assign( dw2, unop(Iop_Reverse8sIn64_x1, mkexpr(dw1)) );
+         putIReg( rD_addr, mkexpr(dw2) );

Note, load(Ity_I64, mkexpr(EA)) does an endian aware load:

  /* This generates a normal (non load-linked) load. */                           
  static IRExpr* load ( IRType ty, IRExpr* addr )                                 
  {                                                                               
     if (host_endness == VexEndnessBE)                                            
        return IRExpr_Load(Iend_BE, ty, addr);                                    
     else                                                                         
        return IRExpr_Load(Iend_LE, ty, addr);                                    
  }                                

I think if you were to call IRExpr_Load to do the load as follows:

  /* Load the value with the bytes reversed by doing a BE load on an LE machine
     and a LE load on a BE machine.  */  
  if (host_endness == VexEndnessBE)                                            
     assign( dw1, IRExpr_Load(Iend_LE, Ity_I64, mkexpr(EA)));                                    
  else                                                                         
     assign( dw1, IRExpr_Load(Iend_BE, Ity_I64, mkexpr(EA))); 
  
  putIReg( rD_addr, mkexpr(dw1) );

You wouldn't actually need to do the byte reverse with Iop_Reverse8sIn64_x1 as the value would be loaded with the bytes reversed already.  That said, having support for the Iop_Reverse8sIn64_x1 is nice to have around for future use.  :-)

Any way, I did work thru the logic and coding for the Iop_Reverse8sIn64x1 and it all looks correct.
 
                     Carl Love
Comment 49 Mark Wielaard 2018-12-05 13:45:41 UTC
We did discuss this a bit more on irc (#valgrind irc.freenode.net) and there are a couple more issues that need to be resolved to make all this work correctly.

lxvd2x and lxvb16x have a similar issue where the load is split up in
the frontend into smaller loads, which makes the partial-load-ok logic
ineffective for memcheck. These also need to be rewritten to do full loads
plus IR to manipulate the data so that memcheck can work effectively.

The IRExpr_Load expression carries an IREndness expression the Endian-ness of the load, which should be helpful in the frontend to express some of these instructions.

Unaligned (normal) load instructions do seem undefined. The power backend does seem to handle unaligned vector (128bit) loads (in VEX/priv/host_ppc_isel.c iselVecExpr_wrk), but not for non-vector loads. So given that unaligned loads are an issue, we need some solution for those (if ldbrx are ultimately translated into normal loads) in the ppc backend.

memcheck mc_LOADVn_slow needs to be adapted to deal with unaligned loads (on ppc only?). And I assume mc_LOADV_128_or_256_slow would need something similar for unaligned vector loads (assuming lxvd2x and lxvb16x are allowed to do unaligned loads).

There was also a question from the glibc hackers whether these vector instruction should be emitted at all given some power errata that talked about memory/cache corruption: https://gcc.gnu.org/ml/gcc/2018-12/msg00010.html but it looks like that is not a real concern, so we should act as if gcc will emit these instructions for new inlined str[n]cmp code.
Comment 50 Mark Wielaard 2018-12-05 16:42:23 UTC
Created attachment 116688 [details]
aligned and unaligned ldbrx, lxvd2x and lxvb16x testcases

These are the assembly files generated by gcc8 with the new strcmp inlined code.

The aligned code shows a short string being compared where the instruction is split up in multiple loads (by the frontend!), one of which memcheck will think it completely past addressable memory.

The unaligned code shows the strcmp using one of these double word or vector instructions on an unaligned memory address, which also confuses memcheck.

The following is what current valgrind produces on a power9 ppc64le setup:

# for a in aligned unaligned; do for i in ldbrx lxvd2x lxvb16x; do echo $a-$i; gcc -o $a-$i $a-$i.s; valgrind -q ./$a-$i > /dev/null; done; done
aligned-ldbrx
==7890== Invalid read of size 4
==7890==    at 0x1000061C: main (in /root/v-ppc64-stcmp/aligned-ldbrx)
==7890==  Address 0x42e0044 is 0 bytes after a block of size 4 alloc'd
==7890==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7890==    by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-ldbrx)
==7890== 
aligned-lxvd2x
==7895== Invalid read of size 8
==7895==    at 0x10000628: main (in /root/v-ppc64-stcmp/aligned-lxvd2x)
==7895==  Address 0x42e0048 is 4 bytes after a block of size 4 alloc'd
==7895==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7895==    by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-lxvd2x)
==7895== 
==7895== Use of uninitialised value of size 8
==7895==    at 0x4050794: _vgnU_freeres (vg_preloaded.c:68)
==7895==    by 0x4114A03: __run_exit_handlers (in /usr/lib64/power9/libc-2.28.so)
==7895==    by 0x4114A97: exit (in /usr/lib64/power9/libc-2.28.so)
==7895==    by 0x40F417F: (below main) (in /usr/lib64/power9/libc-2.28.so)
==7895== 
==7895== Use of uninitialised value of size 8
==7895==    at 0x40505C4: ??? (in /usr/lib64/valgrind/vgpreload_core-ppc64le-linux.so)
==7895==    by 0x4114A03: __run_exit_handlers (in /usr/lib64/power9/libc-2.28.so)
==7895==    by 0x4114A97: exit (in /usr/lib64/power9/libc-2.28.so)
==7895==    by 0x40F417F: (below main) (in /usr/lib64/power9/libc-2.28.so)
==7895== 
aligned-lxvb16x
==7900== Invalid read of size 1
==7900==    at 0x10000628: main (in /root/v-ppc64-stcmp/aligned-lxvb16x)
==7900==  Address 0x42e0044 is 0 bytes after a block of size 4 alloc'd
==7900==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7900==    by 0x100005A7: main (in /root/v-ppc64-stcmp/aligned-lxvb16x)
==7900== 
unaligned-ldbrx
==7905== Invalid read of size 4
==7905==    at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-ldbrx)
==7905==  Address 0x42e0045 is 0 bytes after a block of size 5 alloc'd
==7905==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7905==    by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-ldbrx)
==7905== 
unaligned-lxvd2x
==7910== Invalid read of size 8
==7910==    at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-lxvd2x)
==7910==  Address 0x42e0041 is 1 bytes inside a block of size 5 alloc'd
==7910==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7910==    by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-lxvd2x)
==7910== 
unaligned-lxvb16x
==7915== Invalid read of size 1
==7915==    at 0x10000630: main (in /root/v-ppc64-stcmp/unaligned-lxvb16x)
==7915==  Address 0x42e0045 is 0 bytes after a block of size 5 alloc'd
==7915==    at 0x408420C: malloc (vg_replace_malloc.c:299)
==7915==    by 0x100005A7: main (in /root/v-ppc64-stcmp/unaligned-lxvb16x)
==7915==
Comment 51 Mark Wielaard 2018-12-07 16:22:29 UTC
(In reply to Carl Love from comment #48)
> I think if you were to call IRExpr_Load to do the load as follows:
> 
>   /* Load the value with the bytes reversed by doing a BE load on an LE
> machine
>      and a LE load on a BE machine.  */  
>   if (host_endness == VexEndnessBE)                                         
> 
>      assign( dw1, IRExpr_Load(Iend_LE, Ity_I64, mkexpr(EA)));               
> 
>   else                                                                      
> 
>      assign( dw1, IRExpr_Load(Iend_BE, Ity_I64, mkexpr(EA))); 
>   
>   putIReg( rD_addr, mkexpr(dw1) );
> 
> You wouldn't actually need to do the byte reverse with Iop_Reverse8sIn64_x1
> as the value would be loaded with the bytes reversed already.

I tried this, but seem to hit the following in the host_ppc_isel.c backend:

 static HReg iselWordExpr_R_wrk ( ISelEnv* env, const IRExpr* e,
                                 IREndness IEndianess )

[...]
   /* --------- LOAD --------- */
   case Iex_Load: {
      HReg      r_dst;
      PPCAMode* am_addr;
      if (e->Iex.Load.end != IEndianess)
         goto irreducible;

Either I am misunderstanding how the IEndianess is passed around, or it isn't actually possible to have a different endian load from the arch endianness.
Comment 52 Julian Seward 2018-12-07 16:46:00 UTC
(In reply to Mark Wielaard from comment #51)

> I tried this, but seem to hit the following in the host_ppc_isel.c backend:
> 
>  static HReg iselWordExpr_R_wrk ( ISelEnv* env, const IRExpr* e,
>                                  IREndness IEndianess )
> 
> [...]
>    /* --------- LOAD --------- */
>    case Iex_Load: {
>       HReg      r_dst;
>       PPCAMode* am_addr;
>       if (e->Iex.Load.end != IEndianess)
>          goto irreducible;
> 
> Either I am misunderstanding how the IEndianess is passed around, or it
> isn't actually possible to have a different endian load from the arch
> endianness.

To clarify: there are two options here:

(1) in the front end, generate an IR load with the same endianness as the
    host, and then byte-reverse the vector in the IR.

(2) in the front end, generate an IR load with the opposite endianness as
    the host.  

The effect of (1) is that the load will arrive and be handled at the
back end no problem, and then the code generated from the swap-endianness
IR will .. well .. swap the endianness of the value.

Whereas with (2), an opposite-endianness-to-the-host IR load travels 
through to the back end.  At that point your options are:

(2a) directly generate an opposite-endian load, or,
(2b) generate a same-endian load, followed by code to swap the vector around.

So: (2a) is the most direct route.  (1) and (2b) are essentially the same, 
with the difference that (1) does the rearrangement in IR whereas (2b) does
the arrangement at the POWER instruction level (in the backend).

Personally I'd vote for (1) because (a) it's easier and (b) I am not sure
that Memcheck will work correctly when presented with opposite-from-the-host
endian loads/stores.
Comment 53 Mark Wielaard 2018-12-07 22:23:25 UTC
Created attachment 116742 [details]
Option1: Implement ldbrx as 64-bit load and Iop_Reverse8sIn64_x1

(In reply to Julian Seward from comment #52)
> Personally I'd vote for (1) because (a) it's easier and (b) I am not sure
> that Memcheck will work correctly when presented with opposite-from-the-host
> endian loads/stores.

Lets go with option 1. That is the code that this patch implements.

Changes from the previous version are:
1) Added a comment that there is an earlier check that ldbrx is only used in 64bit mode.
2) Added a comment explaining what we could do if we went with option 2.
3) Add a vassert(mode64) for the usage of Iop_Reverse8sIn64_x1 in host_ppc_isel.c.
4) Simplified the last part of Iop_Reverse8sIn64_x1 by removing the unnecessary masking as suggested by Julian.

This make the ldbrx-aligned testcase work as expected.
Comment 54 Mark Wielaard 2018-12-07 22:37:31 UTC
To make the ldbrx-unaligned testcase work we can use something like the following:

diff --git a/memcheck/mc_main.c b/memcheck/mc_main.c
index 3ef7cb913..9a7b04c9c 100644
--- a/memcheck/mc_main.c
+++ b/memcheck/mc_main.c
@@ -1508,6 +1508,10 @@ ULong mc_LOADVn_slow ( Addr a, SizeT nBits, Bool bigendian )
 #  if defined(VGA_mips64) && defined(VGABI_N32)
    if (szB == VG_WORDSIZE * 2 && VG_IS_WORD_ALIGNED(a)
        && n_addrs_bad < VG_WORDSIZE * 2)
+#elif defined(VGA_ppc64be) || defined(VGA_ppc64le)
+   /* Unaligned loads of words on the same page are OK. */
+   if (szB == VG_WORDSIZE && ((a & 0xfff) <= 0x1000 - szB)
+       && n_addrs_bad < VG_WORDSIZE)
 #  else
    if (szB == VG_WORDSIZE && VG_IS_WORD_ALIGNED(a)
        && n_addrs_bad < VG_WORDSIZE)

The question is whether the extra check for the word not crossing a page boundary is valid.

Aaron, are there are alignment constraints for the usage of the ldbrx instruction?

(this would also need new exp files for the memcheck/tests/partial_load_ok and  partial_load_dflt tests that check unaligned word loads produce an warning)
Comment 55 Aaron Sawdey 2018-12-08 00:03:37 UTC
Mark,
 No alignment constraints on ldbrx, not having to check the alignment is what makes this whole thing work. If I had to do runtime alignment checks it would be too much code to generate inline.
Comment 56 Mark Wielaard 2018-12-08 21:40:58 UTC
Created attachment 116790 [details]
memcheck: Allow unaligned loads of words on ppc64[le]

(In reply to Aaron Sawdey from comment #55)
>  No alignment constraints on ldbrx, not having to check the alignment is
> what makes this whole thing work. If I had to do runtime alignment checks it
> would be too much code to generate inline.

Thanks. In that case I propose this patch which allows any unaligned word sized load on ppc64[le] and adjusts the partial-loads testcases.
Comment 57 Mark Wielaard 2018-12-09 11:39:54 UTC
Created attachment 116801 [details]
Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le.

This fixes the partial load issues seen with lxvd2x by changing the separate 64bit loads with one 128bit load with the double words swapped afterwards if necessary (on ppc64le).
Comment 58 Mark Wielaard 2018-12-09 13:30:47 UTC
Created attachment 116808 [details]
Allow unaligned loads of 128bit vectors on ppc64[le]

lxvd2x might generate an unaligned 128 bit vector load, allow those for partial loads on ppc64[le].
Comment 59 Mark Wielaard 2018-12-09 13:55:51 UTC
The last two patches get rid of the invalid loads for the power8 vsx inlined strcmp sequences, but still leave us with conditional jumps depending on undefined values.

The issue is that we will hit the following:

    10000624:   99 1e 20 7c     lxvd2x  vs33,0,r3
    10000628:   99 fe 00 7c     lxvd2x  vs32,0,r31
    1000062c:   8c 03 a0 11     vspltisw v13,0
    10000630:   00 00 40 39     li      r10,0
    10000634:   06 00 81 11     vcmpequb v12,v1,v0
    10000638:   06 68 01 10     vcmpequb v0,v1,v13
    1000063c:   57 65 00 f0     xxlorc  vs32,vs32,vs44
    10000640:   06 6c 20 10     vcmpequb. v1,v0,v13
    10000644:   78 00 98 40     bge     cr6,100006bc <main+0x13c>

The bge depends on the cr6 flag which is set as a result of comparing the result of the vcmpequb. on two possibly not completely defined vectors.

This is done in set_AV_CR6 () in guest_ppc_toIR.c:

/* Set the CR6 flags following an AltiVec compare operation.
 * NOTE: This also works for VSX single-precision compares.
 * */
static void set_AV_CR6 ( IRExpr* result, Bool test_all_ones )

Called as set_AV_CR6( mkexpr(vD), True ); after interpreting the vcmpequb as assign( vD, binop(Iop_CmpEQ8x16, mkexpr(vA), mkexpr(vB)) );
Comment 60 Mark Wielaard 2018-12-09 22:45:40 UTC
Created attachment 116818 [details]
Implement ppc64 lxvb16x as 128-bit vector load with reversed double words

This fixes the partial load issues seen with lxvb16x by changing the small loads with one 128bit load and using Iop_Reverse8sIn64_x1 to reverse the bytes in the double words, on ppc64le the double words are swapped afterwards.

Tested on power9 ppc64le.

Not tested on big endian ppc64 since I couldn't find an power9 big endian ppc64 install.
Comment 61 Mark Wielaard 2018-12-09 22:55:24 UTC
The lxvb16x does resolve the invalid load issues, but does not completely resolve the conditional jump depends on undefined value issues. This happens again when using the CR6 conditional to branch:

   0x000000001000062c <+172>:   lxvb16x vs33,0,r3
   0x0000000010000630 <+176>:   lxvb16x vs45,0,r4
   0x0000000010000634 <+180>:   vcmpnezb. v0,v1,v13
=> 0x0000000010000638 <+184>:   bne     cr6,0x10000690 <main+272>
Comment 62 Julian Seward 2018-12-10 11:12:54 UTC
Created attachment 116827 [details]
Possible fix for the comment 61 problems

This patch:

* changes set_AV_CR6 so that it does scalar comparisons against zero,
  rather than sometimes against an all-ones word.  This is something
  that Memcheck can instrument exactly.

* in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64
  by default on ppc64le.

Note, this is all untested!  In particular it would be good to verify
that the CR6 results are unchanged.
Comment 63 Mark Wielaard 2018-12-10 22:26:41 UTC
(In reply to Julian Seward from comment #62)
> Created attachment 116827 [details]
> Possible fix for the comment 61 problems
> 
> This patch:
> 
> * changes set_AV_CR6 so that it does scalar comparisons against zero,
>   rather than sometimes against an all-ones word.  This is something
>   that Memcheck can instrument exactly.
> 
> * in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64
>   by default on ppc64le.
> 
> Note, this is all untested!  In particular it would be good to verify
> that the CR6 results are unchanged.

That definitely fixes the issue. I tested on both power9 and power8 with a gcc and glibc build with the new str[n]cmp inline patch plus the following fixes from this bug:

- Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1.
- memcheck: Allow unaligned loads of words on ppc64[le].
- Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le.
- memcheck: Allow unaligned loads of 128bit vectors on ppc64[le].
- Implement ppc64 lxvb16x as 128-bit vector load with reversed double words.

And it seems to resolve all gcc strcmp inline issues I was seeing before.
Both setups were ppc64le. I haven't yet tested on ppc64 or ppc32.

I do see bug #401827 and bug #401828 but they are unrelated.

Also memcheck/tests/undef_malloc_args (stderr) fails. That one does seem caused by earlier patches from this bug.

And occasionally I see Use of uninitialised value of size 8 in _vgnU_freeres (vg_preloaded.c:68) which is mysterious. It goes away with --run-libc-freeres=no --run-cxx-freeres=no. It seems unrelated to this issue. But I haven't found out yet what is causing it. It is as if sometimes we are unable to see that the weak symbol _ZN9__gnu_cxx9__freeresEv (__gnu_cxx::__freeres) is initialized. Very odd, since it normally works except in some cases. But it is very likely unrelated to this bug.
Comment 64 Julian Seward 2018-12-11 19:04:30 UTC
Created attachment 116861 [details]
Possible fix for the memcheck/tests/undef_malloc_args failure

(In reply to Mark from comment #63)

> Also memcheck/tests/undef_malloc_args (stderr) fails. That one does seem
> caused by earlier patches from this bug.

This isn't a great fix, but it does now cause the output for the test to be
the same on ppc64le-linux and amd64-linux.  So it might be good enough, if it
produces consistent results on all other targets too.
Comment 65 Mark Wielaard 2018-12-12 13:06:10 UTC
(In reply to Julian Seward from comment #64)
> Created attachment 116861 [details]
> Possible fix for the memcheck/tests/undef_malloc_args failure
> 
> This isn't a great fix, but it does now cause the output for the test to be
> the same on ppc64le-linux and amd64-linux.  So it might be good enough, if it
> produces consistent results on all other targets too.

It seems to also work on ppc64-linux, ppc32-linux and x86-linux.
Comment 66 Mark Wielaard 2018-12-12 13:35:43 UTC
(In reply to Mark Wielaard from comment #63)
> And occasionally I see Use of uninitialised value of size 8 in _vgnU_freeres
> (vg_preloaded.c:68) which is mysterious. It goes away with
> --run-libc-freeres=no --run-cxx-freeres=no. It seems unrelated to this
> issue. But I haven't found out yet what is causing it. It is as if sometimes
> we are unable to see that the weak symbol _ZN9__gnu_cxx9__freeresEv
> (__gnu_cxx::__freeres) is initialized. Very odd, since it normally works
> except in some cases. But it is very likely unrelated to this bug.

This was bug #402006 which has been fixed now as commit be7a73004583aab5d4c97cf55276ca58d5b3090b "Mark helper regs defined in final_tidyup before freeres_wrapper call."

Thanks to Julian who immediately spotted the issue almost immediately.
Comment 67 Mark Wielaard 2018-12-13 22:08:36 UTC
It looks like this issue is solved now with some of the uncommitted patches to this bug. To get a bit more testing I backported the following commits to the fedora 29 package:

commit 7f1dd9d5a: valgrind-3.14.0-get_otrack_shadow_offset_wrk-ppc.patch
commit 427198981: valgrind-3.14.0-new-strlen-IROps.patch
commit 97d336b79: valgrind-3.14.0-ppc-instr-new-IROps.patch
commit e221eca26: valgrind-3.14.0-memcheck-new-IROps.patch
commit 81d983222: valgrind-3.14.0-ppc-frontend-new-IROps.patch
commit cb5d7e047: valgrind-3.14.0-transform-popcount64-ctznat64.patch
commit 27fe22378: valgrind-3.14.0-enable-ppc-Iop_Sar_Shr8.patch

PR402006: valgrind-3.14.0-final_tidyup.patch

Plus the following uncommitted proposed patches from this bug:

attachment 116742 [details]: valgrind-3.14.0-ppc64-ldbrx.patch
attachment 116790 [details]: valgrind-3.14.0-ppc64-unaligned-words.patch
attachment 116801 [details]: valgrind-3.14.0-ppc64-lxvd2x.patch
attachment 116808 [details]: valgrind-3.14.0-ppc64-unaligned-vecs.patch
attachment 116818 [details]: valgrind-3.14.0-ppc64-lxvb16x.patch
attachment 116827 [details]: valgrind-3.14.0-set_AV_CR6.patch
attachment 116861 [details]: valgrind-3.14.0-undef_malloc_args.patch

Fedora packages: https://koji.fedoraproject.org/koji/buildinfo?buildID=1171900
Copr packages: https://copr.fedorainfracloud.org/coprs/mjw/valgrind-3.14.0/

Unfortunately neither (build) system has a new enough gcc compiler and the testresults don't look great. But maybe they help some people with testing.

Any opinions on which of the 7 uncommitted patches should get in or needs more testing/work?
Comment 68 Mark Wielaard 2018-12-13 22:10:57 UTC
(In reply to Mark Wielaard from comment #67)
> It looks like this issue is solved now with some of the uncommitted patches
> to this bug. To get a bit more testing I backported the following commits to
> the fedora 29 package:
> 
> commit 7f1dd9d5a: valgrind-3.14.0-get_otrack_shadow_offset_wrk-ppc.patch
> commit 427198981: valgrind-3.14.0-new-strlen-IROps.patch
> commit 97d336b79: valgrind-3.14.0-ppc-instr-new-IROps.patch
> commit e221eca26: valgrind-3.14.0-memcheck-new-IROps.patch
> commit 81d983222: valgrind-3.14.0-ppc-frontend-new-IROps.patch
> commit cb5d7e047: valgrind-3.14.0-transform-popcount64-ctznat64.patch
> commit 27fe22378: valgrind-3.14.0-enable-ppc-Iop_Sar_Shr8.patch
> 
> PR402006: valgrind-3.14.0-final_tidyup.patch
> 
> Plus the following uncommitted proposed patches from this bug:
> 
> attachment 116742 [details]: valgrind-3.14.0-ppc64-ldbrx.patch
> attachment 116790 [details]: valgrind-3.14.0-ppc64-unaligned-words.patch
> attachment 116801 [details]: valgrind-3.14.0-ppc64-lxvd2x.patch
> attachment 116808 [details]: valgrind-3.14.0-ppc64-unaligned-vecs.patch
> attachment 116818 [details]: valgrind-3.14.0-ppc64-lxvb16x.patch
> attachment 116827 [details]: valgrind-3.14.0-set_AV_CR6.patch
> attachment 116861 [details]: valgrind-3.14.0-undef_malloc_args.patch

Here are the actual patches:
https://src.fedoraproject.org/rpms/valgrind/tree/master
Comment 69 Mark Wielaard 2018-12-20 21:55:31 UTC
I believe this bug is finally resolved with the push of the following 7 commits to master:

commit 3af8e12b0d49dc87cd26258131ebd60c9b587c74
Author: Julian Seward <jseward@acm.org>
Date:   Wed Dec 12 13:55:01 2018 +0100

    Fix memcheck/tests/undef_malloc_args failure.
    
    Try harder to trigger a memcheck error if a value is (partially) undefined.

commit 01f1936b1238a3bbeaf2821b61ecb36ba0ae15b1
Author: Julian Seward <jseward@acm.org>
Date:   Mon Dec 10 17:18:20 2018 +0100

    Adjust ppc set_AV_CR6 computation to help Memcheck instrumentation.
    
    * changes set_AV_CR6 so that it does scalar comparisons against zero,
      rather than sometimes against an all-ones word.  This is something
      that Memcheck can instrument exactly.
    
    * in Memcheck, requests expensive instrumentation of Iop_Cmp{EQ,NE}64
      by default on ppc64le.
    
    https://bugs.kde.org/show_bug.cgi?id=386945#c62

commit 3ef4b2c780ea76a80ae5beaa63c1cb1d6530988b
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Dec 9 23:25:05 2018 +0100

    Implement ppc64 lxvb16x as 128-bit vector load with reversed double words.
    
    This makes it possible for memcheck to know which part of the 128bit
    vector is defined, even if the load is partly beyond an addressable block.
    
    Partially resolves bug 386945.

commit 8d12697b157bb50ad98467b565b3a7a39097bf31
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Dec 9 14:26:39 2018 +0100

    memcheck: Allow unaligned loads of 128bit vectors on ppc64[le].
    
    On powerpc partial unaligned loads of vectors from partially invalid
    addresses are OK and could be generated by our translation of lxvd2x.
    
    Adjust partial_load memcheck tests to allow partial loads of 16 byte
    vectors on powerpc64.
    
    Part of resolving bug #386945.

commit 98a73de1c0c83918a63e736b62d428bc2f98c943
Author: Mark Wielaard <mark@klomp.org>
Date:   Sun Dec 9 00:55:42 2018 +0100

    Implement ppc64 lxvd2x as 128-bit load with double word swap for ppc64le.
    
    This makes it possible for memcheck to know which part of the 128bit
    vector is defined, even if the load is partly beyond an addressable block.
    
    Partially resolves bug 386945.

commit 5ecdecdcd3efecadf149dcd6276140e833b7f8e6
Author: Mark Wielaard <mark@klomp.org>
Date:   Sat Dec 8 13:47:43 2018 -0500

    memcheck: Allow unaligned loads of words on ppc64[le].
    
    On powerpc partial unaligned loads of words from partially invalid
    addresses are OK and could be generated by our translation of ldbrx.
    
    Adjust partial_load memcheck tests to allow partial loads of words
    on powerpc64.
    
    Part of resolving bug #386945.

commit 0ed17bc9f675c59cf1e93caa4290e0b21b84c0a0
Author: Mark Wielaard <mark@klomp.org>
Date:   Fri Dec 7 10:42:22 2018 -0500

    Implement ppc64 ldbrx as 64-bit load and Iop_Reverse8sIn64_x1.
    
    This makes it possible for memcheck to analyse the new gcc strcmp
    inlined code correctly even if the ldbrx load is partly beyond an
    addressable block.
    
    Partially resolves bug 386945.

Thanks to everybody for testing and feedback.

Note that you still need the latest gcc-8-branch (and rebuild at least glibc with it) on ppc64[le] to have a setup that valgrind memcheck fully understands.