Bug 385409 - s390x: z13 vector integer instructions not implemented
Summary: s390x: z13 vector integer instructions not implemented
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: vex (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-05 17:54 UTC by Andreas Arnez
Modified: 2018-09-26 17:34 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Vector integer & string z13 support for s390 z13 (758.65 KB, patch)
2018-02-04 21:17 UTC, Vadim Barkov
Details
Vector integer & string z13 support for s390 z13 (761.34 KB, patch)
2018-07-25 18:37 UTC, Vadim Barkov
Details
More tests for VTM instruction (15.93 KB, patch)
2018-07-26 13:45 UTC, Vadim Barkov
Details
Test case diff with vtm failure (1018 bytes, patch)
2018-07-26 14:53 UTC, Andreas Arnez
Details
Fix vector register number calculation (777 bytes, patch)
2018-07-26 18:26 UTC, Andreas Arnez
Details
Vector & string z13 implementation (tests) (553.93 KB, patch)
2018-07-29 23:56 UTC, Vadim Barkov
Details
Vector & string z13 implementation (code) (172.29 KB, patch)
2018-07-29 23:57 UTC, Vadim Barkov
Details
Fix strict aliasing violation (1.92 KB, patch)
2018-07-31 13:35 UTC, Andreas Arnez
Details
Use guarded loads in the guest (2.13 KB, patch)
2018-08-05 13:21 UTC, Florian Weimer
Details
Implement early exit in s390_vr_loadWithLength (2.33 KB, patch)
2018-08-05 17:18 UTC, Florian Weimer
Details
New VLL implementation (4.07 KB, patch)
2018-08-06 13:37 UTC, Vadim Barkov
Details
simple z13 executable with strlen inlined (10.49 KB, application/octet-stream)
2018-08-06 16:19 UTC, Mark Wielaard
Details
New VFENE Implementation (9.51 KB, patch)
2018-08-07 17:01 UTC, Vadim Barkov
Details
Fix vector register move insns (1.46 KB, patch)
2018-08-16 11:07 UTC, Andreas Arnez
Details
Refactoring (1.39 MB, patch)
2018-08-18 22:35 UTC, Vadim Barkov
Details
Refactoring (45.70 KB, application/mbox)
2018-08-18 23:10 UTC, Vadim Barkov
Details
Implement VLL with aligned loads (4.78 KB, patch)
2018-09-13 13:20 UTC, Andreas Arnez
Details
Implement VLL/VLBB with aligned loads (6.43 KB, patch)
2018-09-14 18:19 UTC, Andreas Arnez
Details
s390x: Vector integer and string instruction support (192.46 KB, patch)
2018-09-24 18:41 UTC, Andreas Arnez
Details
s390x: Vector string and insn support -- tests (556.25 KB, patch)
2018-09-24 18:43 UTC, Andreas Arnez
Details
Fixes based on Julian's comment #71 (45.94 KB, patch)
2018-09-25 17:29 UTC, Andreas Arnez
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Arnez 2017-10-05 17:54:45 UTC
Valgrind currently lacks support for the z/Architecture vector integer instructions introduced with z13.  These are documented in the z/Architecture Principles of Operation, Eleventh Edition (March, 2015), chapter 22: "Vector Integer Instructions".
Comment 1 Vadim Barkov 2018-02-04 21:17:33 UTC
Created attachment 110339 [details]
Vector integer & string z13 support for s390 z13

Contains the implementation of all vector integer (and string) instructions for z13.
Comment 2 Christian Borntraeger 2018-07-12 13:59:00 UTC
I have not checked if these patches still apply, but for the general patches consider these patches acked from my s390x perspective.
Comment 3 Vadim Barkov 2018-07-21 13:55:53 UTC
(In reply to Christian Borntraeger from comment #2)
> I have not checked if these patches still apply, but for the general patches
> consider these patches acked from my s390x perspective.

I've ensured that this patched can be applied on master.
Comment 4 Vadim Barkov 2018-07-25 18:37:13 UTC
Created attachment 114122 [details]
Vector integer & string z13 support for s390 z13

Changes:

Now this patch can be applied on master (on d44563c49e55f47016e23591f708c7aa57f7a098 or later commit)
Comment 5 Andreas Arnez 2018-07-26 11:25:32 UTC
(In reply to Vadim Barkov from comment #4)
> Now this patch can be applied on master (on
> d44563c49e55f47016e23591f708c7aa57f7a098 or later commit)

Good!  Tried the patch, and it looks good, with one exception: the test case output for the VTM instruction in vector_integer.stdout.out shows a diff on my system:

-  r_result = 0000000000000001
+  r_result = 0000000000000003

This happens for all four of the vtm iterations, exactly with the same diff.  Haven't investigated this any further yet.
Comment 6 Vadim Barkov 2018-07-26 13:45:48 UTC
Created attachment 114145 [details]
More tests for VTM instruction

Changes:

Added tests for all possible result of VTM instruction.
Comment 7 Vadim Barkov 2018-07-26 13:53:45 UTC
Andreas,

I can't reproduce you test results so I added tests for every possible result of VTM instruction. Could you apply patch from attachment 114145 [details] and write the results to me?

I need more information to reproduce the bug. Tell me please:
 - CPU version (z13 or z14?)
 - GNU/Linux distributive that you use (Fedora, OpenSuse, etc)
 - compiler version ($ gcc --version)
Comment 8 Andreas Arnez 2018-07-26 14:52:19 UTC
(In reply to Vadim Barkov from comment #7)
> Andreas,
> 
> I can't reproduce you test results so I added tests for every possible
> result of VTM instruction. Could you apply patch from attachment 114145 [details]
> [details] and write the results to me?
Sure.  The diff looks similar.  I'll attach it.

> I need more information to reproduce the bug. Tell me please:
>  - CPU version (z13 or z14?)
z14
>  - GNU/Linux distributive that you use (Fedora, OpenSuse, etc)
Fedora 28
>  - compiler version ($ gcc --version)
gcc 8.0.1
Comment 9 Andreas Arnez 2018-07-26 14:53:46 UTC
Created attachment 114150 [details]
Test case diff with vtm failure
Comment 10 Andreas Arnez 2018-07-26 18:26:52 UTC
Created attachment 114152 [details]
Fix vector register number calculation

Independent from the vtm issue above, I noticed another problem with the current vector register logic.  Whenever decoding an instruction that references one of the upper 16 vector registers, the register number may be calculated incorrectly.  This is fixed.
Comment 11 Julian Seward 2018-07-27 15:03:02 UTC
Vadim, Andreas, please can you split this into two patches:
* one with all the tests
* one with the final bug-fixed implementation
so it's easier to review.
Comment 12 Vadim Barkov 2018-07-27 22:59:29 UTC
Andreas, I've reproduced the bug.

This occurs only when compiling valgrind with gcc-8 (I've used "gcc (SUSE Linux) 8.1.1 20180719") and with default optimisation flags (I guess it is "-O2").
I've compiled with same compiler and "-O0" and the problem dissapeared. 
When I've used gcc-7 compiler optimisation flag doesn't matter (all tests pass).

So I guess there is new wrong optimisation bug in gcc-8. I'll try to resolve somehow this problem in code.
Comment 13 Vadim Barkov 2018-07-27 23:47:48 UTC
I've made the union in s390x_dirtyhelper_vec_op volatile and the bug magically dissapeared.

Fixed.

Now I'll rearrange all stuff into two patches (tests & code) and submit here.
Comment 14 Julian Seward 2018-07-28 08:13:56 UTC
(In reply to Vadim Barkov from comment #13)
> I've made the union in s390x_dirtyhelper_vec_op volatile and the bug
> magically dissapeared.
> 
> Fixed.

No.  Just hidden.

More likely is that there's some violation of C "defined behaviour"
that the compiler is exploiting whilst optimising.  Or some type
error.  I would prefer that you investigated this more and found the
root cause of the problem.
Comment 15 Vadim Barkov 2018-07-29 23:56:41 UTC
Created attachment 114206 [details]
Vector & string z13 implementation (tests)
Comment 16 Vadim Barkov 2018-07-29 23:57:26 UTC
Created attachment 114207 [details]
Vector & string z13 implementation (code)
Comment 17 Vadim Barkov 2018-07-30 00:00:35 UTC
(In reply to Julian Seward from comment #14)

Okay, I've refixed that issue without volatile (replaced UChar[6] array with ULong in union).

Please, check the final version out. It includes the Andreas' patch and some tests for it.
Comment 18 Andreas Arnez 2018-07-31 11:55:11 UTC
(In reply to Julian Seward from comment #14)
> More likely is that there's some violation of C "defined behaviour"
> that the compiler is exploiting whilst optimising.  Or some type
> error.  I would prefer that you investigated this more and found the
> root cause of the problem.
Right.  The main problem seems to be caused by the inline assembly, where only the first byte of the_insn was declared as input:
  [insn] "m" (the_insn.bytes[0])
The whole insn is of course needed at this time.  So this should be changed to
  [insn] "m" (the_insn)
Another suspicous construct is the type cast
  (const s390x_vec_op_details_t*) &details
where 'details' is a ULong.  I believe this violates the strict aliasing rule.
Comment 19 Ivo Raisr 2018-07-31 12:02:00 UTC
(In reply to Andreas Arnez from comment #18)

> Another suspicous construct is the type cast
>   (const s390x_vec_op_details_t*) &details
> where 'details' is a ULong.  I believe this violates the strict aliasing rule.

Indeed. In short words strict aliasing says you are not allowed to view the same piece of memory through lenses of different types *other* than 'char *'.
I debugged once a nasty problem in SPEC2017 CPU due to a strict aliasing violation combined with LTO.

The question is also whether strict aliasing is enabled for this compilation unit...
Comment 20 Julian Seward 2018-07-31 12:14:06 UTC
(In reply to Ivo Raisr from comment #19)
> (In reply to Andreas Arnez from comment #18)
> 
> > Another suspicous construct is the type cast
> >   (const s390x_vec_op_details_t*) &details
> > where 'details' is a ULong.  I believe this violates the strict aliasing rule.

Yes.  I would much prefer not to have such violations in tree.

> The question is also whether strict aliasing is enabled for this compilation
> unit...

To tell the truth, we have built with -fno-strict-aliasing for about a
decade now :-/  But I still would prefer not to have such casting.
Comment 21 Andreas Arnez 2018-07-31 12:37:00 UTC
(In reply to Julian Seward from comment #20)
> To tell the truth, we have built with -fno-strict-aliasing for about a
> decade now :-/  But I still would prefer not to have such casting.
Hm, for guest_s390_helpers.c I get a GCC command line that contains
  ... -fno-strict-aliasing -fno-builtin  -fomit-frame-pointer  -Wbad-function-cast -fstrict-aliasing ...
Which means that strict aliasing is disabled and then re-enabled.  Not sure if that's intentional, though.
Comment 22 Andreas Arnez 2018-07-31 13:35:50 UTC
Created attachment 114230 [details]
Fix strict aliasing violation

This fix applies to Vadim's new version.  It also undoes the change from UChar[6] to ULong, because that change was unnecessary and I prefer the original approach.  (A 6-byte character array seems more appropriate for representing an s390x instruction, and the original approach avoids pointer conversions.)
Comment 23 Andreas Arnez 2018-08-03 12:23:13 UTC
Julian, is this OK now?
Comment 24 Andreas Arnez 2018-08-03 15:51:25 UTC
In the meantime I did some more testing, and one thing I found is that VFENE also marks bytes in the input vectors beyond the zero-terminator as read.  This leads to lots of false positives with real workloads.  There might be similar issues with the other vector string instructions.
Comment 25 Florian Weimer 2018-08-05 11:09:56 UTC
I applied the patches in attachment 114207 [details] and attachment 114230 [details] on top of the master branch (commit b9cfb2d15413d16f330878938af3d6fa1617f8b4).  I get a crash in the inline expansion of strlen in ld.so of a z13-enabled glibc build, around here:

==18364== Invalid read of size 1
==18364==    at 0x401892E: _dl_sysdep_start (dl-sysdep.c:236)                                                
==18364==    by 0x40188A5: frob_brk (dl-sysdep.c:36)
==18364==    by 0x40188A5: _dl_sysdep_start (dl-sysdep.c:227)                                                
==18364==  Address 0x1fff001000 is not stack'd, malloc'd or (recently) free'd                                

   0x0000000004018914 <+860>:   lghi    %r3,0
   0x0000000004018918 <+864>:   risbg   %r1,%r2,60,191,0                                                     
   0x000000000401891e <+870>:   je      0x401897c <_dl_sysdep_start+964>                                     
   0x0000000004018922 <+874>:   lghi    %r4,15
   0x0000000004018926 <+878>:   aghi    %r3,16
   0x000000000401892a <+882>:   sgr     %r4,%r1
=> 0x000000000401892e <+886>:   vll     %v0,%r4,0(%r2)
   0x0000000004018934 <+892>:   vfenezbs        %v0,%v0,%v0                                                  
   0x000000000401893a <+898>:   je      0x4018952 <_dl_sysdep_start+922>                                     
   0x000000000401893e <+902>:   vl      %v0,0(%r3,%r2)
   0x0000000004018944 <+908>:   aghi    %r3,16
   0x0000000004018948 <+912>:   vfenezbs        %v0,%v0,%v0                                                  
   0x000000000401894e <+918>:   jne     0x401893e <_dl_sysdep_start+902>                                     
   0x0000000004018952 <+922>:   vlgvb   %r1,%v0,7
   0x0000000004018958 <+928>:   llgcr   %r1,%r1
   0x000000000401895c <+932>:   cgr     %r1,%r4

(gdb) print/x $r2
$1 = 0x1fff000ffb
(gdb) print (char *)$r2
$17 = 0x1fff000ffb "z13"
(gdb) print _rtld_local_ro._dl_platform
$18 = 0x1fff000ffb "z13"
(gdb) print _rtld_local_ro._dl_platform == $r2
$19 = 1

So strlen is called with the correct value of GLRO(dl_platform), it seems, but the emulation does not like it.

Are there additional patches I need?
Comment 26 Florian Weimer 2018-08-05 11:30:33 UTC
The SIGSEGV happens here in the generated code:

(gdb) disas 0x00000010038c9230,+20                                                                           
Dump of assembler code from 0x10038c9230 to 0x10038c9244:
   0x00000010038c9230:  lghi    %r5,8
   0x00000010038c9234:  lg      %r3,2792(%r13)
=> 0x00000010038c923a:  ic      %r4,0(%r3)
   0x00000010038c923e:  lg      %r3,2520(%r13)
End of assembler dump.
(gdb) print/x $r3
$1 = 0x1fff001000
(gdb) 

Looks like the load limit for VLL does not work.  Consider this data from the valgrind emulation:

(gdb) print $r4
$22 = 4
(gdb) print/x $r2 + $r4
$24 = 0x1fff000fff

So the highest-indexed byte is still on the same page, but it is the last byte on the page.
Comment 27 Florian Weimer 2018-08-05 11:50:42 UTC
Hopefully someone can see what is wrong with the generated IR:

vll      %v0,%r4,0(%r2)
              ------ IMark(0x401892E, 6, 0) ------
              t6 = Add64(0x0:I64,GET:I64(592))
              t7 = GET:I32(612)
              PUT(64) = SetElem8x16(GET:V128(64),0x0:I8,ITE(CmpLE32U(0x0:I32,t7),LDbe:I8(Add64(t6,0x0:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x1:I8,ITE(CmpLE32U(0x1:I32,t7),LDbe:I8(Add64(t6,0x1:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x2:I8,ITE(CmpLE32U(0x2:I32,t7),LDbe:I8(Add64(t6,0x2:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x3:I8,ITE(CmpLE32U(0x3:I32,t7),LDbe:I8(Add64(t6,0x3:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x4:I8,ITE(CmpLE32U(0x4:I32,t7),LDbe:I8(Add64(t6,0x4:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x5:I8,ITE(CmpLE32U(0x5:I32,t7),LDbe:I8(Add64(t6,0x5:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x6:I8,ITE(CmpLE32U(0x6:I32,t7),LDbe:I8(Add64(t6,0x6:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x7:I8,ITE(CmpLE32U(0x7:I32,t7),LDbe:I8(Add64(t6,0x7:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x8:I8,ITE(CmpLE32U(0x8:I32,t7),LDbe:I8(Add64(t6,0x8:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0x9:I8,ITE(CmpLE32U(0x9:I32,t7),LDbe:I8(Add64(t6,0x9:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xA:I8,ITE(CmpLE32U(0xA:I32,t7),LDbe:I8(Add64(t6,0xA:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xB:I8,ITE(CmpLE32U(0xB:I32,t7),LDbe:I8(Add64(t6,0xB:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xC:I8,ITE(CmpLE32U(0xC:I32,t7),LDbe:I8(Add64(t6,0xC:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xD:I8,ITE(CmpLE32U(0xD:I32,t7),LDbe:I8(Add64(t6,0xD:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xE:I8,ITE(CmpLE32U(0xE:I32,t7),LDbe:I8(Add64(t6,0xE:I64)),0x0:I8))
              PUT(64) = SetElem8x16(GET:V128(64),0xF:I8,ITE(CmpLE32U(0xF:I32,t7),LDbe:I8(Add64(t6,0xF:I64)),0x0:I8))
              PUT(720) = 0x4018934:I64
Comment 28 Florian Weimer 2018-08-05 12:05:10 UTC
Found it:

      /* A ternary if-then-else operator.  It returns iftrue if cond is                                      
         nonzero, iffalse otherwise.  Note that it is STRICT, ie. both                                       
         iftrue and iffalse are evaluated in all cases.                                                      
                                                                                                             
         ppIRExpr output: ITE(<cond>,<iftrue>,<iffalse>),                                                    
                         eg. ITE(t6,t7,t8)                                                                   
      */

So we need to emit a guard before the loads that go beyond the load limit.
Comment 29 Florian Weimer 2018-08-05 13:21:58 UTC
Created attachment 114302 [details]
Use guarded loads in the guest

This patch changes the guest code to use guarded loads.

However, this insufficient because the isel implementation does not handle Ist_LoadG.  I don't know how to fix this.  s390_insn_cond_move does not perform the U8 → U32 zero extension, so it is not a good fit here (and even Ist_LoadG  with ILGop_8Uto32 is a bit off in this context, we would need ILGop_Ident8).
Comment 30 Florian Weimer 2018-08-05 17:18:12 UTC
Created attachment 114305 [details]
Implement early exit in s390_vr_loadWithLength

Updated patch.  This seems to fix the VLL issue.  Limited testing only.

However, these VLL instructions are typically part of the GCC inline expansion of strlen, and the result is used with VFENEZBS.  Since VLL is used to over-load bytes until the next boundary, the vector result is partially undefined.  VFENEZBS is implemented with a host assist only, though, so we get tons of warnings from inline copies of strlen:

==52238== Conditional jump or move depends on uninitialised value(s)
==52238==    at 0x401B202: strdup (strdup.c:41)
==52238==    by 0x40090AF: _dl_map_object (dl-load.c:2173)

As a result, running just /bin/true on a z13-compiled distribution results in lots of errors:

==52238== ERROR SUMMARY: 7969 errors from 450 contexts (suppressed: 0 from 0)
Comment 31 Julian Seward 2018-08-06 07:14:38 UTC
(In reply to Florian Weimer from comment #29)
> Created attachment 114302 [details]
> Use guarded loads in the guest
> 
> This patch changes the guest code to use guarded loads.
> 
> However, this insufficient because the isel implementation does not handle
> Ist_LoadG.  I don't know how to fix this.

Look at how the amd64 back end does it.  It basically is going to be a
completely normal unconditional load that has a conditional branch 
in front of it.

> s390_insn_cond_move does not perform the U8 → U32 zero extension, so it is
> not a good fit here (and even Ist_LoadG  with ILGop_8Uto32 is a bit off in
> this context, we would need ILGop_Ident8).

This doesn't matter very much.  Presumably s390 has an instruction that
loads 8 bits from memory and zero extends the loaded value out to 64 bits.
Then you can use the ILGop_8Uto64 descriptor in your conditional load.  The
narrowing operations that you will have to use afterwards (64to8, or 32to8)
become no-ops in the generated code, so there's no performance disadvantage
to using an 8bits-to-64bits-and-back-to-8bits scheme.
Comment 32 Julian Seward 2018-08-06 07:17:52 UTC
(In reply to Florian Weimer from comment #30)
> Created attachment 114305 [details]
> Implement early exit in s390_vr_loadWithLength
> 
> Updated patch.  This seems to fix the VLL issue.  Limited testing only.

It would be better to implement guarded loads and use that scheme, so that
we don't have to end the superblock every time we see VLL.

> However, these VLL instructions are typically part of the GCC inline
> expansion of strlen, and the result is used with VFENEZBS.  Since VLL is
> used to over-load bytes until the next boundary, the vector result is
> partially undefined.  VFENEZBS is implemented with a host assist only,
> though, so we get tons of warnings from inline copies of strlen:

What does VFENEZBS do?  I imagine this will need to be fixed by making
a translation of it that is more Memcheck-friendly, and/or adjusting
Memcheck to deal with the underlying IROps better.
Comment 33 Julian Seward 2018-08-06 07:19:21 UTC
(In reply to Andreas Arnez from comment #23)
> Julian, is this OK now?

Considering comment 30, alas, no.  Let's try and get the Memcheck problems
with this fixed first.
Comment 34 Vadim Barkov 2018-08-06 08:17:53 UTC
(In reply to Julian Seward from comment #32)
> What does VFENEZBS do? 

VFENE V1, V2, V3 -- vector find element not equal
It elementwise compares V2 and V3. If an unequal element is found, its index is stored in byte 7th (counting from 0th byte) of V1. If not, the value 16 is stored in 7th byte (counting from 0th byte) of V1. All other bytes of V1 are zeroed.

VFENEZBS:
B -- element size is 1 byte
Z -- also compares elements of V2 with zero
S -- set condition code with result of comparation

So,
"VFENEZBS %%v0, %%v0, %%v0" just searchs the first zero byte in %%v0, stores its  index (or 16 if not found) and indicates if it was found in Conditional Code (CC).
Comment 35 Vadim Barkov 2018-08-06 10:53:09 UTC
(In reply to Florian Weimer from comment #30)

Are you sure that VLL is the reason of the bug with strlen? If so, I believe that the conditional load implementation is the most natural. I'll add support for conditional load (and store) in the host part of s390x VEX. It shouldn't be hard (I've already implemented conditional dirty helpers in "vector basic" patch)
Comment 36 Julian Seward 2018-08-06 10:56:14 UTC
(In reply to Vadim Barkov from comment #35)
> (In reply to Florian Weimer from comment #30)
> 
> Are you sure that VLL is the reason of the bug with strlen? 

Exactly which bug are you referring to?  If you mean the problem that
Florian describes in comment 25 to comment 28, then I agree with his
analysis -- I think it is correct.
Comment 37 Vadim Barkov 2018-08-06 13:37:33 UTC
Created attachment 114327 [details]
New VLL implementation

Based on Florian's patch with guarded loads (attachment 114302 [details])

I've just added host part to his code and replaced Iop_SetElem8x16 to direct  write.

This patch passes tests but additional testing is needed.
Comment 38 Vadim Barkov 2018-08-06 13:49:01 UTC
(In reply to Florian Weimer from comment #30)

Floarian,

I can't reproduce the problem that you describe. Could you take my patch (attachment 114327 [details]) and test against your executable?
Comment 39 Florian Weimer 2018-08-06 14:19:31 UTC
(In reply to Vadim Barkov from comment #38)
> (In reply to Florian Weimer from comment #30)
> 
> Floarian,
> 
> I can't reproduce the problem that you describe. Could you take my patch
> (attachment 114327 [details]) and test against your executable?

Do you have GCC 7 installed?  Then you can build glibc like this:

git clone --depth 1 git://sourceware.org/git/glibc.git
cd glibc
mkdir build
cd build
../configure --prefix=/usr CFLAGS="-O2 -g -march=z13 -mtune=z13"
make

(Parallel make is supported.)

After that, if valgrind is on PATH, you can run /bin/true like this:

./testrun.sh --tool=valgrind /bin/true

This should report no issues.
Comment 40 Mark Wielaard 2018-08-06 16:19:03 UTC
Created attachment 114334 [details]
simple z13 executable with strlen inlined

Here is a hopefully simpler reproducer for the following program compiled with GCC 8.2.1 with gcc -march=z13 -mtune=z14 -Wall -g -O2 -o t t.c

# cat t.c 
#include <string.h>
#include <stdlib.h>

int
main (int argc, char **argv)
{
  int status = argc > 1;
  if (status)
    {
      char *str = strdup (argv[1]);
      int len1 = strlen (argv[1]);
      int len2 = strlen (str);
      status &= len1 < 8 && len2 > 4;
      free (str);
    }
  
  exit (!status);
}

# gcc -march=z13 -mtune=z14 -Wall -g -O2 -o t t.c
# ./vg-in-place -q ./t hello
==6001== Conditional jump or move depends on uninitialised value(s)
==6001==    at 0x10005E8: main (t.c:12)
==6001== 
==6001== Conditional jump or move depends on uninitialised value(s)
==6001==    at 0x1000610: main (t.c:12)
==6001== 
==6001== Conditional jump or move depends on uninitialised value(s)
==6001==    at 0x1000614: main (t.c:12)
==6001== 
==6001== Conditional jump or move depends on uninitialised value(s)
==6001==    at 0x100062C: main (t.c:13)
==6001== 

# ./vg-in-place --vgdb-error=0 ./t hello

# gdb ./t
0x00000000040013c0 in _start () from /lib/ld64.so.1
(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000010005e8 in main (argc=<optimized out>, argv=0x1ffefffcb8) at t.c:12
12	      int len2 = strlen (str);
(gdb) disassemble 
Dump of assembler code for function main:
   0x0000000001000528 <+0>:	stmg	%r12,%r15,96(%r15)
   0x000000000100052e <+6>:	lay	%r15,-160(%r15)
   0x0000000001000534 <+12>:	cijh	%r2,1,0x1000550 <main+40>
   0x000000000100053a <+18>:	lhi	%r12,0
   0x000000000100053e <+22>:	lr	%r2,%r12
   0x0000000001000540 <+24>:	xilf	%r2,1
   0x0000000001000546 <+30>:	lgfr	%r2,%r2
   0x000000000100054a <+34>:	brasl	%r14,0x10004c8 <exit@plt>
   0x0000000001000550 <+40>:	lgr	%r12,%r3
   0x0000000001000554 <+44>:	lg	%r2,8(%r3)
   0x000000000100055a <+50>:	brasl	%r14,0x1000508 <__strdup@plt>
   0x0000000001000560 <+56>:	lg	%r3,8(%r12)
   0x0000000001000566 <+62>:	lghi	%r4,0
   0x000000000100056a <+66>:	risbg	%r1,%r3,60,191,0
   0x0000000001000570 <+72>:	je	0x100059a <main+114>
   0x0000000001000574 <+76>:	lghi	%r5,15
   0x0000000001000578 <+80>:	sgr	%r5,%r1
   0x000000000100057c <+84>:	vll	%v0,%r5,0(%r3)
   0x0000000001000582 <+90>:	aghi	%r4,16
   0x0000000001000586 <+94>:	vfenezbs	%v0,%v0,%v0
   0x000000000100058c <+100>:	je	0x10005a2 <main+122>
   0x0000000001000590 <+104>:	vl	%v0,0(%r4,%r3)
   0x0000000001000596 <+110>:	j	0x1000582 <main+90>
   0x000000000100059a <+114>:	lghi	%r5,15
   0x000000000100059e <+118>:	j	0x1000590 <main+104>
   0x00000000010005a2 <+122>:	vlgvb	%r1,%v0,7
   0x00000000010005a8 <+128>:	llgcr	%r1,%r1
   0x00000000010005ac <+132>:	cgr	%r1,%r5
   0x00000000010005b0 <+136>:	la	%r5,1(%r5)
   0x00000000010005b4 <+140>:	locgrh	%r4,%r5
   0x00000000010005b8 <+144>:	jh	0x100059a <main+114>
   0x00000000010005bc <+148>:	lay	%r5,-16(%r4,%r1)
   0x00000000010005c2 <+154>:	lghi	%r4,0
   0x00000000010005c6 <+158>:	risbg	%r1,%r2,60,191,0
   0x00000000010005cc <+164>:	je	0x10005f6 <main+206>
   0x00000000010005d0 <+168>:	lghi	%r3,15
   0x00000000010005d4 <+172>:	sgr	%r3,%r1
   0x00000000010005d8 <+176>:	vll	%v0,%r3,0(%r2)
   0x00000000010005de <+182>:	aghi	%r4,16
   0x00000000010005e2 <+186>:	vfenezbs	%v0,%v0,%v0
=> 0x00000000010005e8 <+192>:	je	0x10005fe <main+214>
   0x00000000010005ec <+196>:	vl	%v0,0(%r4,%r2)
   0x00000000010005f2 <+202>:	j	0x10005de <main+182>
   0x00000000010005f6 <+206>:	lghi	%r3,15
   0x00000000010005fa <+210>:	j	0x10005ec <main+196>
   0x00000000010005fe <+214>:	vlgvb	%r1,%v0,7
   0x0000000001000604 <+220>:	llgcr	%r1,%r1
   0x0000000001000608 <+224>:	cgr	%r1,%r3
   0x000000000100060c <+228>:	la	%r3,1(%r3)
   0x0000000001000610 <+232>:	locgrh	%r4,%r3
   0x0000000001000614 <+236>:	jh	0x10005f6 <main+206>
   0x0000000001000618 <+240>:	lay	%r1,-16(%r4,%r1)
   0x000000000100061e <+246>:	cijh	%r5,7,0x100063c <main+276>
   0x0000000001000624 <+252>:	chi	%r1,4
   0x0000000001000628 <+256>:	lhi	%r12,0
   0x000000000100062c <+260>:	lochih	%r12,1
   0x0000000001000632 <+266>:	brasl	%r14,0x10004a8 <free@plt>
   0x0000000001000638 <+272>:	j	0x100053e <main+22>
   0x000000000100063c <+276>:	lhi	%r12,0
   0x0000000001000640 <+280>:	j	0x1000632 <main+266>
End of assembler dump.
Comment 41 Vadim Barkov 2018-08-07 17:01:29 UTC
Created attachment 114361 [details]
New VFENE Implementation

Fixed the strlen issue via new VFENE implementation
Comment 42 Vadim Barkov 2018-08-07 17:06:33 UTC
Good news! I've solved the strlen inlining issue.

The old VFENE implementation used the dirty helper, which marked second and third argument fully read (16 bytes). This is semantically wrong because this instruction reads its arguments byte by byte until needed element will be found (or all bytes are read).

The new implementation is correct from this perspective.

Now I am going to review other string instructions to this bug and fix them too.
Comment 43 Vadim Barkov 2018-08-07 17:26:46 UTC
Generally speaking I think that warning in inline strlen is correct because the VFENE specification says:

If ZS (zero search) flag is set, then each element in the second operand is compared for equality with zero.

So, if we have this instruction:

vfenezbs %v0, %v0, %v0

then all bytes of %v0 (second operand) are read in all cases (if the needed element found or not)

So, if some bytes of %v0 are uninitialized for some reason, then the resulting condition code of VFENEZBS instruction is undefined.

Anyway, the  VFENE implementation is rewritten because in general case (when the second and third operand are different) the third operand is not read fully.

What do you think about it?
Comment 44 Julian Seward 2018-08-07 18:49:29 UTC
(In reply to Vadim Barkov from comment #43)

> What do you think about it?

Honestly, I do not understand.  You need to explain much more clearly:

* what these instructions actually do

* what your implementation strategy is

* what the problem cases are

I have the impression from comment 34 that vfenezbs %v0, %v0, %v0 finds
the index of the first zero in the operand and writes that somewhere in
the result register.  (But why do they all have to to be v0 ?).  We have
solved such problems in the past for amd64 and so it would help if you
could relate your work to the amd64 implementation.

For amd64, we generate the sequence

  t1 = CmpEQ8x16(vec, zero-vector)
  t2 = pmovmskb(t1), which moves one bit from each lane into t2
  t3 = count-leading-zeroes(t2)

Because the count-leading-zeroes operation can be done in a way in
which the result is defined even if there are undefined bits to the
right of the leftmost 1 bit, this means that the resulting t3 value
will be defined if the input vector consisted of defined non-zero
bytes terminated by a defined-zero, even if the bytes after that
are undefined.
Comment 45 Vadim Barkov 2018-08-07 21:36:09 UTC
(In reply to Julian Seward from comment #44)

The new VFENE implementation has no problems. I just want to point you that in my opinion the memchecks' "Conditional jump or move depends on uninitialised value(s)" error for "VFENEZB %v0, %v0, %v0" instruction (which is used in inline ) can be considered as correct error (from some point of view it is not false positive).

Let's assume that we have a vector register %v0:
0xaa11223344556677, 0x8800???????????? -- the contents of %v0
"?" sign means that that bytes are uninitialized/undefined for some reason (e.g. 
 because of "vector load with length" VLL instruction)

We have "VFENEZBS %v0, %v0, %v0" instruction in code. It searchs for the zero byte in %v0, writes the index of zero byte in byte 7 of %v0 and upgrades condition code (CC, used for branches)
The result of execution is:
0x000000000000009, 0x0000000000000000 -- the new contents of %v0

On the one hand this result is memcheck clean (there is issues like "Conditional jump or move depends on uninitialised value(s)") because the CPU works using this algorithm:

int index;
for(index = 0; index < 16; index++) {
   if(v0[index] == 0)
      break;
}
return index;

But on the other hand the S390x specification ("z/Archtecture. Principles of Operation", SA22-7832-10) says that VFENE with ZS flag set (if mnemonic of operation contains the letter "Z") reads all bytes of second operand (in our example from %v0). So we get situation when instruction reads uninitialized bytes.
Is the result of this instruction detemined in the terms of valgrind's terminology? That's my question.
Comment 46 Vadim Barkov 2018-08-07 21:37:52 UTC
Since the issue is solved, is it ready to merge? Or additional testing is needed?
Comment 47 Julian Seward 2018-08-08 04:42:39 UTC
(In reply to Vadim Barkov from comment #46)
> Since the issue is solved,

I don't consider it to be solved.

You didn't answer my questions in comment 44:

* what your implementation strategy is

* about the relationship to the amd64 implementation.

I don't believe that what you have will generate correct definedness
results for strlen-like inputs with defined bytes up to a defined zero
and undefined bytes after that.  It might be correct, but since there is
no description of the implementation, there's no way to assess that.

It would also help if you had a set of test cases to demonstrate that

* the vfene sequence applied to string-end sequences of the form DDDD0UUUU
  generates a defined index value

* and that when applied to sequences of the form DDDDUUUU without a defined
  zero, it generates and undefined index value
Comment 48 Florian Weimer 2018-08-08 16:17:29 UTC
With the current set of patches, I still see failures for the following glibc string function tests:

string/test-memcmp
string/test-strcasecmp
string/test-strcmp
string/test-strncasecmp
string/test-strncmp

You can build those tests using:

make subdirs=string check

These failures happen with --tool=none as well.

Furthermore, with the system binutils on a non-z13 distribution (not recompiled for z13), I get this when running readelf:

./testrun.sh --tool=valgrind /usr/bin/readelf -s /bin/true
…
*** buffer overflow detected ***: /usr/bin/readelf terminated

Also happens with --tool=none as well.

This could the same issue as above, or a different one.
Comment 49 Vadim Barkov 2018-08-14 20:21:25 UTC
(In reply to Florian Weimer from comment #48)
> With the current set of patches, I still see failures for the following
> glibc string function tests:
> 
> string/test-memcmp
> string/test-strcasecmp
> string/test-strcmp
> string/test-strncasecmp
> string/test-strncmp

This issue occurs when I've compiled glibc with gcc-8. There is no failures when I use gcc-7 (gcc (SUSE Linux) 7.3.1 20180323).(In reply to Florian Weimer from comment #48)
> With the current set of patches, I still see failures for the following
> glibc string function tests:
> 
> string/test-memcmp
> string/test-strcasecmp
> string/test-strcmp
> string/test-strncasecmp
> string/test-strncmp

I confirm this failures when compiling glibc with gcc-8. When I use gcc-7 (gcc (SUSE Linux) 7.3.1 20180323), it's fine.
Comment 50 Andreas Arnez 2018-08-15 15:06:41 UTC
Vadim, thanks for the reworks of VLL and VFENE.  These help quite considerably, and the false positives we've seen before are gone.  Now I'm debugging an internal error that seems related to the z13 patches, but not necessarily to VLL and VFENE ("valgrind: the 'impossible' happened:   LibVEX called failure_exit()").  I'll let you know when I find out more about its cause.
In the meantime I noticed that we have another fundamental problem that comes with GCC's inlined strlen:  The last vector load (VL, or with length = VLL) may read past an allocated buffer, causing an "invalid read".  But of course reading beyond the buffer is actually OK, as long as no page boundary is crossed and the data beyond the buffer is ignored.  I'm not sure how to get rid of these false positives.  Any advice?
Comment 51 Andreas Arnez 2018-08-16 11:07:03 UTC
Created attachment 114453 [details]
Fix vector register move insns

This fixes the internal error I've mentioned in comment #50.
Comment 52 Vadim Barkov 2018-08-18 22:35:53 UTC
Created attachment 114484 [details]
Refactoring

Refactoring

Simplified code for most of vector integer instructions (and some vector basic ones). Mostly I've replaced a lot of switches with IROp arrays (it's shorter and nicer to read).
Comment 53 Vadim Barkov 2018-08-18 23:10:21 UTC
Created attachment 114485 [details]
Refactoring

Fixed wrong line endings
Comment 54 Julian Seward 2018-08-20 17:08:52 UTC
(In reply to Andreas Arnez from comment #50)
> In the meantime I noticed that we have another fundamental problem that
> comes with GCC's inlined strlen:  The last vector load (VL, or with length =
> VLL) may read past an allocated buffer, causing an "invalid read".  But of
> course reading beyond the buffer is actually OK, as long as no page boundary
> is crossed and the data beyond the buffer is ignored.  I'm not sure how to
> get rid of these false positives.  Any advice?

Memcheck's "partial-loads-OK" heuristic, which is now enabled by default,
should remove these complaints.  So the question is, why isn't it?  See
mc_LOADV_128_or_256_slow in mc_main.c.

To be clear, what this heuristic allows is:

* a 16 byte load, with 16 byte alignment, which loads memory over a
  transition over a buffer end, will not report an error

* instead, the bytes loaded from the area inside the buffer are 
  marked with the definedness of the loaded data.  The bytes
  corresponding to the area beyond the end of the buffer are marked
  as undefined.

* Hence, provided the loaded data after the terminating zero is
  ignored, there will be no error.

It would help a lot to have a proper answers to the questions I asked in
comments 44 and 47.  To be blunt, I am not happy to have this stuff land until
there's a proper discussion of the implementation strategy for these vector
loads.  It is not enough that programs using these vector load instructions
run correctly.  It is also necessary that they describe the data flow
accurately enough that Memcheck is not going to produce a lot of false
positives on strlen-like code.  But so far I am not convinced that that is the
case.
Comment 55 Vadim Barkov 2018-08-31 00:18:33 UTC
(In reply to Julian Seward from comment #47)
> (In reply to Vadim Barkov from comment #46)
> > Since the issue is solved,
> 
> I don't consider it to be solved.
> 
> You didn't answer my questions in comment 44:
> 
> * what your implementation strategy is
>

The VFENE instruction is implemented in this way:
/* VFENE %%v1, %%v2, %%v3 */
i = 0;
while(i < 16) {
    elem1 = GetElem(V128(v2), i)
    elem2 = GetElem(V128(v3), i)
    if(elem1 != elem2)
        break;
    i++
}
  v1 = 64HLto128(U64(i), U64(0))

VFENEZ also breaks the loop if (elem1 == 0)

I think this code should generate correct valgrind warnings ()
> * about the relationship to the amd64 implementation.
>
I am not sure that understand it.

PS
I am sorry for late response.
Comment 56 Andreas Arnez 2018-09-13 11:25:35 UTC
(In reply to Julian Seward from comment #54)
> [...]
> Memcheck's "partial-loads-OK" heuristic, which is now enabled by default,
> should remove these complaints.  So the question is, why isn't it?  See
> mc_LOADV_128_or_256_slow in mc_main.c.
This heuristic doesn't currently apply because VLL is implemented with up to 16 single-byte loads.  I'll attach a patch that performs a single 16-byte load instead.  Unless the VLL crosses a 16-byte boundary, that load will aligned as well.
Comment 57 Andreas Arnez 2018-09-13 13:20:22 UTC
Created attachment 114932 [details]
Implement VLL with aligned loads

This fixes complaints from memcheck in cases where VLL is used for strlen, strcpy, etc.  This is achieved by implementing a VLL that stays within a 16-byte aligned chunk with an aligned load instead.  Then memcheck will apply its "partial-loads-OK" heuristic and suppress complaints about reading past the allocated string.
One problem still remains with VLBB (vector load to block boundary), which assures that no page boundary is crossed and is often used for loading the initial bytes of a string.  VLBB may be safely used to load from an unaligned address, but memcheck's heuristic does not apply in this case.  Splitting the load into two aligned 16-byte loads wouldn't help either, because the second load would not necessarily overlap with the allocated string at all.
I wonder whether memcheck's heuristic could be adjusted such that it allows for unaligned 16-byte loads as long as no page boundary is crossed?  Or are there any other suggestions?
Comment 58 Julian Seward 2018-09-14 13:39:32 UTC
(In reply to Andreas Arnez from comment #57)
> Created attachment 114932 [details]
> Implement VLL with aligned loads

> Subject: [PATCH] s390x: Implement VLL/VLBB with aligned loads

Looks ok to me.  OK to land with the following changes:

* Add a short comment before the uses of ShlV128/ShlV128 explaining
  that the shift value can never be > 127.

* you use a construction "mkexpr(mktemp(type, expr))" which I am not
  sure what the purpose is.  For expressions that will get used more
  than once, you should bind them to an IRTemp, otherwise multiple uses
  of the expression will lead to it being code-generated multiple times.
  Eg instead of

  IRExpr *zeroed = mkexpr(mktemp(Ity_I64,
                                 binop(Iop_Sub64, mkU64(15), maxIdx)));

  do

  IRTemp zeroed = newTemp(Ity_I64);
  assign(zeroed, binop(Iop_Sub64, mkU64(15), maxIdx));

  and then use mkexpr(zeroed) instead of simply |zeroed| at all the
  use points.  This forces Vex to compute |zeroed| into a register
  and use that register multiple times.  (Vex can fix this stuff itself
  by doing CSE later, but that's expensive and so is only sometimes done).

  I see that |offset|, |zeroed| and |back| are all used multiple times,
  so please bind them to temps as above.
Comment 59 Andreas Arnez 2018-09-14 14:11:22 UTC
(In reply to Julian Seward from comment #58)
> * you use a construction "mkexpr(mktemp(type, expr))" which I am not
>   sure what the purpose is.  For expressions that will get used more
>   than once, you should bind them to an IRTemp, otherwise multiple uses
>   of the expression will lead to it being code-generated multiple times.
Actually, using an IRTemp is the purpose of this construction.
"mktemp(type, expr)" is just a shortcut for "temp = newTemp(type)"
followed by "assign(temp, expr)".  But since this obviously doesn't help
readability, I'll update the patch to avoid mktemp().
Comment 60 Andreas Arnez 2018-09-14 18:19:05 UTC
Created attachment 114958 [details]
Implement VLL/VLBB with aligned loads

I believe this addresses all of Julian's comments.  In addition this version also adjusts the criteria (on s390x only) for applying the partial-loads-ok exemption in mc_LOADV_128_or_256_slow such that it accepts unaligned 16-byte loads as long as no page boundary is crossed.  This should fix all known issues with VLL and VLBB.
Comment 61 Andreas Arnez 2018-09-21 15:52:55 UTC
Is this OK now?
Comment 62 Julian Seward 2018-09-21 17:02:50 UTC
(In reply to Andreas Arnez from comment #60)
> Created attachment 114958 [details]
> Implement VLL/VLBB with aligned loads

Yes.  Good.  That looks fine to me.  Thank you for redoing it.
Comment 63 Andreas Arnez 2018-09-21 17:10:36 UTC
(In reply to Julian Seward from comment #62)
> Yes.  Good.  That looks fine to me.  Thank you for redoing it.
Great!  Can the patches attached to this Bugzilla go upstream then?
Comment 64 Mark Wielaard 2018-09-22 07:25:27 UTC
Which attached patches (there are 7 now) and in which order should they be applied?
Comment 65 Mark Wielaard 2018-09-22 07:43:26 UTC
(In reply to Mark Wielaard from comment #64)
> Which attached patches (there are 7 now) and in which order should they be
> applied?

The answer to that is (it is actually 8 patches) and they should all be applied in order. Should some of them be combined before being committed to git?

I tried this, plus the trapping instruction patch from bug #396839, on a system that has a glibc build with gcc -march=z13. Previously we would die very quickly inside ld.so. But with this patch at least valgrind /bin/true does work as expected.
Comment 66 Andreas Arnez 2018-09-24 07:43:55 UTC
(In reply to Mark Wielaard from comment #65)
> The answer to that is (it is actually 8 patches) and they should all be
> applied in order.
Correct.
> Should some of them be combined before being committed to
> git?
I don't have a strong opinion on this.  Since all of these are required to fix this Bugzilla, we could combine them all.  Or we could separate Vadim's patches from mine, to distinguish authors.  Or we could just leave them separate, to document our struggle ;-).  What do you prefer?
> I tried this, plus the trapping instruction patch from bug #396839, on a
> system that has a glibc build with gcc -march=z13. Previously we would die
> very quickly inside ld.so. But with this patch at least valgrind /bin/true
> does work as expected.
Great!  Thanks for testing.
Comment 67 Vadim Barkov 2018-09-24 16:40:22 UTC
(In reply to Andreas Arnez from comment #66)
> (In reply to Mark Wielaard from comment #65)
> > The answer to that is (it is actually 8 patches) and they should all be
> > applied in order.
> Correct.
> > Should some of them be combined before being committed to
> > git?
> I don't have a strong opinion on this.  Since all of these are required to
> fix this Bugzilla, we could combine them all.  Or we could separate Vadim's
> patches from mine, to distinguish authors.  Or we could just leave them
> separate, to document our struggle ;-).  What do you prefer?

To be honest I don't care if these patches will be applied separate or combined.
Comment 68 Andreas Arnez 2018-09-24 18:41:11 UTC
Created attachment 115209 [details]
s390x: Vector integer and string instruction support

As discussed on IRC with Julian, here's a merged patch (without the tests) for easier review.  I also made some formatting improvements (reduced overlong lines, fixed indentation, fixed two spelling mistakes).  Otherwise it should be exactly the same as combining all the active patches above.
Comment 69 Andreas Arnez 2018-09-24 18:43:04 UTC
Created attachment 115210 [details]
s390x: Vector string and insn support -- tests
Comment 70 Julian Seward 2018-09-25 10:38:31 UTC
(In reply to Andreas Arnez from comment #69)
> Created attachment 115210 [details]
> s390x: Vector string and insn support -- tests

This is fine.  OK to land.
Comment 71 Julian Seward 2018-09-25 10:45:31 UTC
(In reply to Andreas Arnez from comment #68)
> Created attachment 115209 [details]
> s390x: Vector integer and string instruction support

Thank you for finishing this up.  OK to land provided the stuff below is fixed
(none of it is a big deal) and the test in the next para passes.

It is also important to check that all new or modified test cases run
on Memcheck without asserting or otherwise failing.  The regression test
driver will only check them with the "none" tool since the tests are inside
none/tests/s390.  You can do this manually by running the relevant tests
directly on memcheck (just run "by hand").


--- a/VEX/priv/guest_s390_defs.h
+++ b/VEX/priv/guest_s390_defs.h

+/* Arguments of s390x_dirtyhelper_vec_op(...) which are packed into one
+   ULong variable.
+ */
+typedef union {
..
+} s390x_vec_op_details_t;

add STATIC_ASSERT(sizeof(s390x_vec_op_details_t) == sizeof(ULong))


--- a/VEX/priv/guest_s390_helpers.c
+++ b/VEX/priv/guest_s390_helpers.c

+static void
+s390_cc_set_expr(IRExpr* cc)
+{
+   vassert(typeOfIRExpr(irsb->tyenv, cc) == Ity_I64);
+
+   s390_cc_thunk_fill(mkU64(S390_CC_OP_SET), cc, mkU64(0), mkU64(0));
+}

Similar to conversation last week this is again a case where, if |cc| is
used more than once (maybe in the callee of this function) then code for it
will get generated to evaluate it multiple times, and it's quite likely that
those duplicates will not get removed during IR optimisation.  It would be
safer to pass around an IRTemp and use |mkexpr(cc)| instead.

If this is a big amount of work, don't worry; but if it's small and easy
to fix then it would be good to do so now.


+static IRExpr*
+s390_V128_compareLT128x1(IRExpr* arg1, IRExpr* arg2, Bool allow_equal)

Just a sanity check: is the logic with |allow_equal| here correct?
I am not saying it isn't, but I think a second look would not be a bad idea.


+/* Permorms "arg1 + arg2 + carry_out_bit(arg1 + arg2)".

"Performs"


+   if (UNLIKELY(vex_traceflags & VEX_TRACE_FE))
+      s390_disasm(ENC5(MNM, VR, UDXB, VR, UINT), mnm, v1, d2, 0, b2, v3, m4);

and many similar s390_disasm calls: have you checked that s390_disasm can
print all of these new insns?  One way to try is to run your test suites
manually with --trace-flags=10000000 --trace-notbelow=0.  You'll get gigabytes
of output, but you can at least see if easily if s390_disasm aborts, or prints
some kind of "I can't handle this" message.


+   const IROp ops[] = { Iop_InterleaveHI8x16, Iop_InterleaveHI16x8,
+                        Iop_InterleaveHI32x4, Iop_InterleaveHI64x2 };
+   vassert(m4 < sizeof(ops));
+   put_vr_qw(v1, binop(ops[m4], get_vr_qw(v2), get_vr_qw(v3)));

That assertion is surely not right, or at least it assumes that sizeof(IROp)
== 1, which I think isn't true.  Shouldn't it be "m4 <
sizeof(ops)/sizeof(ops[0]))" -- viz, the traditional idiom for "the number of
elements in this array" ?

Please fix, also any other similar that you can find in this diff.  I saw it
also in the following, and maybe there are more?

s390_irgen_VMRH(UChar v1, UChar v2, UChar v3, UChar m4)
s390_irgen_VMRL(UChar v1, UChar v2, UChar v3, UChar m4)
s390_irgen_VPK(UChar v1, UChar v2, UChar v3, UChar m4)
s390_irgen_VUPH(UChar v1, UChar v2, UChar m3)
s390_irgen_VUPLH(UChar v1, UChar v2, UChar m3)
s390_irgen_VUPL(UChar v1, UChar v2, UChar m3)
s390_irgen_VUPLL(UChar v1, UChar v2, UChar m3)
s390_irgen_VPKS(UChar v1, UChar v2, UChar v3, UChar m4, UChar m5)
Also
s390_irgen_VMO(UChar v1, UChar v2, UChar v3, UChar m4) and its friends.


static const HChar *
 s390_irgen_VSEL(UChar v1, UChar v2, UChar v3, UChar v4)
 {
-   IRExpr* vA = get_vr_qw(v3);
-   IRExpr* vB = get_vr_qw(v2);
-   IRExpr* vC = get_vr_qw(v4);
-
-   /* result = (vA & ~vC) | (vB & vC) */
-   put_vr_qw(v1,
-             binop(Iop_OrV128,
-                   binop(Iop_AndV128, vA, unop(Iop_NotV128, vC)),
-                   binop(Iop_AndV128, vB, vC)
-                  )
-            );
+   IRExpr* vIfTrue = get_vr_qw(v2);
+   IRExpr* vIfFalse = get_vr_qw(v3);
+   IRExpr* vCond = get_vr_qw(v4);
+
+   put_vr_qw(v1, s390_V128_bitwiseITE(vCond, vIfTrue, vIfFalse));
    return "vsel";
 }

This kind of thing potentially also suffers from the IRExpr* duplication
problem, assuming that s390_V128_bitwiseITE uses vCond twice.  You could fix
it directly in s390_V128_bitwiseITE by allocating a new temp, binding
(assign()-ing) vCond to that and then using the temp twice (if it doesn't
already do that).


--- a/VEX/priv/host_s390_defs.h
+++ b/VEX/priv/host_s390_defs.h

+   S390_VEC_COUNT_LEADING_ZEROEZ,
+   S390_VEC_COUNT_TRAILING_ZEROEZ,

Really, _ZEROEZ and not _ZEROES ?  Can we use the traditional spelling, unless
ZEROEZ is really what was intended?


--- a/VEX/priv/host_s390_isel.c
+++ b/VEX/priv/host_s390_isel.c

+#define IRCONST_IS_EQUAL_U8(arg, val) \
+   ( ((arg)->tag == Iex_Const) && ((arg)->Iex.Const.con->Ico.U8 == (val)) )

This misses out (or assumes) the check that |arg| has type Ity_I8.  This might
work, but it's somewhat dangerous in the case of programming mistakes.  Please
change it to be

  ((arg)->tag == Iex_Const)
  && ((arg)->Iex.Const.tag == Ico_U8)  <--- the missing check
  && ((arg)->Iex.Const.con->Ico.U8 == (val))


-      s390_unop_t vec_op = 0;
+      UInt vec_op = 0;

Is it necessary to "weaken" the type of vec_op here?


+      case Iop_CmpNEZ128x1: {
+         IRExpr* lowPartNEZ = IRExpr_Unop(Iop_CmpNEZ64,
+                                          IRExpr_Unop(Iop_V128to64, arg));
+         IRExpr* highPartNEZ = IRExpr_Unop(Iop_CmpNEZ64,
+                                           IRExpr_Unop(Iop_V128HIto64, arg));
+         reg1 = s390_isel_int_expr(env,
+                                   IRExpr_Binop(Iop_Or64,
+                                                IRExpr_Unop(Iop_1Sto64, lowPartNEZ),
+                                                IRExpr_Unop(Iop_1Sto64, highPartNEZ)
+                                               )
+                                  );
+
+         dst = newVRegV(env);
+         addInstr(env, s390_insn_vec_binop(size, S390_VEC_INIT_FROM_GPRS,
+                                           dst, reg1, reg1));
+
+         return dst;
+      }

This is (I think!) correct but it's also very inefficient.  We're given a
128 bit value |arg| and we want to make |dst| be 0..(126)..0 if |arg| is
|all| zeroes and 1..(126)..1 if arg is all ones.

Better would be like this

   IRExpr* low64     = IRExpr_Unop(Iop_V128to64, arg);
   IRExpr* high64    = IRExpr_Unop(Iop_V128HIto64, arg);
   IRExpr* both      = IRExpr_Binop(Iop_Or64, low64, high64);
   IRExpr* anyNonZ   = IRExpr_Unop(Iop_CmpNEZ64, both);
   IRExpr* anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ);
   reg1 = s390_isel_int_expr(env, anyNonZ64);

Even then, it's pretty much a hack to create s390 code by first generating
*even more* IR and then handing that off to some other part of the instruction
selector.  It also means that the tree for |arg| will get instruction-selected
twice.

Much better would just be to generate the relevant s390 insns directly.  But
fixing that is beyond the scope of this round of reviewing/bug fixing.


+      case Iop_PwAddL8Ux16: {
+         /* There is no such instruction. We have to emulate it. */
+         dst = s390_isel_vec_expr(env,
+                                  IRExpr_Binop(Iop_Add16x8,
+                                               IRExpr_Binop(Iop_InterleaveEvenLanes8x16,
+                                                            IRExpr_Const(IRConst_V128(0x00)),
+                                                            arg),
+                                               IRExpr_Binop(Iop_InterleaveOddLanes8x16,
+                                                            IRExpr_Const(IRConst_V128(0x00)),
+                                                            arg)
+                                              )
+                                 );
+
+         return dst;
+      }

ditto (re "pretty much a hack".)  At least, please format to fit inside 80
cols.

Please write the arguments to IRConst_V128 here as 0x0000.  This is 
consistent with the style of other use of it and emphasises the fact
that the argument is a 16-bit value (one bit for each byte of a 
16 byte vector).


+      case Iop_PwAddL16Ux8:
+         if ( (arg->tag == Iex_Unop) && (arg->Iex.Unop.op == Iop_PwAddL8Ux16) )
+         {
+            size = 1;
+            arg = arg->Iex.Unop.arg;
+         }
+         else
+         {
+            size = 2;
+         }
+         vec_op = S390_VEC_PWSUM_W;
+         goto Iop_Pairwise_wrk;

Please change to use the house style, not so verbose:

+      case Iop_PwAddL16Ux8:
+         if (arg->tag == Iex_Unop && arg->Iex.Unop.op == Iop_PwAddL8Ux16) {
+            size = 1;
+            arg = arg->Iex.Unop.arg;
+         } else {
+            size = 2;
+         }
+         vec_op = S390_VEC_PWSUM_W;
+         goto Iop_Pairwise_wrk;
Comment 72 Andreas Arnez 2018-09-25 17:25:42 UTC
(In reply to Julian Seward from comment #71)
> Thank you for finishing this up.  OK to land provided the stuff below
> is fixed (none of it is a big deal) and the test in the next para
> passes.
Thanks for reviewing!

> It is also important to check that all new or modified test cases run
> on Memcheck without asserting or otherwise failing.  The regression
> test driver will only check them with the "none" tool since the tests
> are inside none/tests/s390.  You can do this manually by running the
> relevant tests directly on memcheck (just run "by hand").
Done.

> [...]
> +typedef union {
> ..
> +} s390x_vec_op_details_t;
> 
> add STATIC_ASSERT(sizeof(s390x_vec_op_details_t) == sizeof(ULong))
Done.

> +static void
> +s390_cc_set_expr(IRExpr* cc)
> +{
> +   vassert(typeOfIRExpr(irsb->tyenv, cc) == Ity_I64);
> +
> +   s390_cc_thunk_fill(mkU64(S390_CC_OP_SET), cc, mkU64(0), mkU64(0));
> +}
> 
> Similar to conversation last week this is again a case where, if |cc|
> is used more than once (maybe in the callee of this function) then
> code for it will get generated to evaluate it multiple times, and it's
> quite likely that those duplicates will not get removed during IR
> optimisation.  It would be safer to pass around an IRTemp and use
> |mkexpr(cc)| instead.
Done.  Most of the time there was an IRTemp already, so it wasn't a real
problem.  But I agree we're on the safe side this way.  Also renamed the
function to s390_cc_set and the previously named s390_cc_set to
s390_cc_set_val.

> +static IRExpr*
> +s390_V128_compareLT128x1(IRExpr* arg1, IRExpr* arg2, Bool allow_equal)
> 
> Just a sanity check: is the logic with |allow_equal| here correct?  I
> am not saying it isn't, but I think a second look would not be a bad
> idea.
Yes, the logic looks correct to me: `allow_equal' is checked when the
high halves are equal, because in that case the operands might be fully
equal.  Otherwise they certainly differ and `allow_equal' doesn't
matter.

> +/* Permorms "arg1 + arg2 + carry_out_bit(arg1 + arg2)".
> 
> "Performs"
Done.

> [...] have you checked that s390_disasm can print all of these new
> insns?  [...]
I haven't checked whether all insns can be printed correctly.  The ones
I looked at were OK.  I will look into that more thoroughly tomorrow.

> +   const IROp ops[] = { Iop_InterleaveHI8x16, Iop_InterleaveHI16x8,
> +                        Iop_InterleaveHI32x4, Iop_InterleaveHI64x2 };
> +   vassert(m4 < sizeof(ops));
> +   put_vr_qw(v1, binop(ops[m4], get_vr_qw(v2), get_vr_qw(v3)));
> 
> That assertion is surely not right [...]
> 
> Please fix, also any other similar that you can find in this diff.
> [...]
Hm, yeah, this is broken -- luckily it doesn't impact the execution of
correct code.  Fixed all occurrences I found.

> static const HChar *
>  s390_irgen_VSEL(UChar v1, UChar v2, UChar v3, UChar v4)
>  {
> -   IRExpr* vA = get_vr_qw(v3);
> -   IRExpr* vB = get_vr_qw(v2);
> -   IRExpr* vC = get_vr_qw(v4);
> -
> -   /* result = (vA & ~vC) | (vB & vC) */
> -   put_vr_qw(v1,
> -             binop(Iop_OrV128,
> -                   binop(Iop_AndV128, vA, unop(Iop_NotV128, vC)),
> -                   binop(Iop_AndV128, vB, vC)
> -                  )
> -            );
> +   IRExpr* vIfTrue = get_vr_qw(v2);
> +   IRExpr* vIfFalse = get_vr_qw(v3);
> +   IRExpr* vCond = get_vr_qw(v4);
> +
> +   put_vr_qw(v1, s390_V128_bitwiseITE(vCond, vIfTrue, vIfFalse));
>     return "vsel";
>  }
> 
> This kind of thing potentially also suffers from the IRExpr*
> duplication problem, assuming that s390_V128_bitwiseITE uses vCond
> twice.  You could fix it directly in s390_V128_bitwiseITE by
> allocating a new temp, binding (assign()-ing) vCond to that and then
> using the temp twice (if it doesn't already do that).
Done.

> --- a/VEX/priv/host_s390_defs.h
> +++ b/VEX/priv/host_s390_defs.h
> 
> +   S390_VEC_COUNT_LEADING_ZEROEZ,
> +   S390_VEC_COUNT_TRAILING_ZEROEZ,
> 
> Really, _ZEROEZ and not _ZEROES ?  Can we use the traditional
> spelling, unless ZEROEZ is really what was intended?
Hehe, these were cool names though ;-).  Anyhow -- fixed.

> --- a/VEX/priv/host_s390_isel.c
> +++ b/VEX/priv/host_s390_isel.c
> 
> +#define IRCONST_IS_EQUAL_U8(arg, val) \
> +   ( ((arg)->tag == Iex_Const) && ((arg)->Iex.Const.con->Ico.U8 == (val)) )
> 
> This misses out (or assumes) the check that |arg| has type Ity_I8.
> This might work, but it's somewhat dangerous in the case of
> programming mistakes.  Please change it to be
> 
>   ((arg)->tag == Iex_Const)
>   && ((arg)->Iex.Const.tag == Ico_U8)  <--- the missing check
>   && ((arg)->Iex.Const.con->Ico.U8 == (val))
Done.

> -      s390_unop_t vec_op = 0;
> +      UInt vec_op = 0;
> 
> Is it necessary to "weaken" the type of vec_op here?
Reverted, and also removed the dubious initialization.  The compiler
doesn't complain, so the weakening probably wasn't necessary to please
the compiler, and I haven't found any other reason.

> +      case Iop_CmpNEZ128x1: {
> [...]
> +      }
> 
> This is (I think!) correct but it's also very inefficient.  We're
> given a 128 bit value |arg| and we want to make |dst| be 0..(126)..0
> if |arg| is |all| zeroes and 1..(126)..1 if arg is all ones.
> 
> Better would be like this
> 
>    IRExpr* low64     = IRExpr_Unop(Iop_V128to64, arg);
>    IRExpr* high64    = IRExpr_Unop(Iop_V128HIto64, arg);
>    IRExpr* both      = IRExpr_Binop(Iop_Or64, low64, high64);
>    IRExpr* anyNonZ   = IRExpr_Unop(Iop_CmpNEZ64, both);
>    IRExpr* anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ);
>    reg1 = s390_isel_int_expr(env, anyNonZ64);
Done.

> 
> Even then, it's pretty much a hack to create s390 code by first
> generating *even more* IR and then handing that off to some other part
> of the instruction selector.  It also means that the tree for |arg|
> will get instruction-selected twice.
> 
> Much better would just be to generate the relevant s390 insns
> directly.  But fixing that is beyond the scope of this round of
> reviewing/bug fixing.
Yeah, I didn't spend time on that yet.

> +      case Iop_PwAddL8Ux16: {
> [...]
> +      }
> 
> ditto (re "pretty much a hack".)  At least, please format to fit
> inside 80 cols.
Done.

> Please write the arguments to IRConst_V128 here as 0x0000.  This is
> consistent with the style of other use of it and emphasises the fact
> that the argument is a 16-bit value (one bit for each byte of a 16
> byte vector).
Done.

> +      case Iop_PwAddL16Ux8:
> [...]
> 
> Please change to use the house style, not so verbose:
> 
> +      case Iop_PwAddL16Ux8:
> +         if (arg->tag == Iex_Unop && arg->Iex.Unop.op == Iop_PwAddL8Ux16) {
> +            size = 1;
> +            arg = arg->Iex.Unop.arg;
> +         } else {
> +            size = 2;
> +         }
> +         vec_op = S390_VEC_PWSUM_W;
> +         goto Iop_Pairwise_wrk;
Done.
Comment 73 Andreas Arnez 2018-09-25 17:29:34 UTC
Created attachment 115231 [details]
Fixes based on Julian's comment #71

For reference, here's what I changed based on the comments above.  I'll look more into the disassembler tomorrow and commit the patches by end of tomorrow if no more findings appear.
Comment 74 Julian Seward 2018-09-26 11:12:41 UTC
(In reply to Andreas Arnez from comment #73)
> Created attachment 115231 [details]
> Fixes based on Julian's comment #71
> 
> For reference, here's what I changed based on the comments above.  I'll look
> more into the disassembler tomorrow and commit the patches by end of
> tomorrow if no more findings appear.

Looks fine to me.  Just two minor nits:

--- a/VEX/priv/host_s390_isel.c
+++ b/VEX/priv/host_s390_isel.c

-      UInt vec_op = 0;
+      s390_unop_t vec_op;

Myself, I would have left the redundant initialiser there.  I like
initialisation :-)

+         IRExpr *low64     = IRExpr_Unop(Iop_V128to64, arg);
+         IRExpr *high64    = IRExpr_Unop(Iop_V128HIto64, arg);
+         IRExpr *both      = IRExpr_Binop(Iop_Or64, low64, high64);
+         IRExpr *anyNonZ   = IRExpr_Unop(Iop_CmpNEZ64, both);
+         IRExpr *anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ);

Can you put the * next to the type?  Yeah, I know it's technically
wrong/misleading etc.  But in the tree we use "T* v" (non-ugly, but
misleading) when there's only one variable, and "T *v1, *v2" in the
case where we have to be Technically Correct :-)
Comment 75 Andreas Arnez 2018-09-26 15:54:12 UTC
(In reply to Julian Seward from comment #74)
> Looks fine to me.  Just two minor nits:
> 
> --- a/VEX/priv/host_s390_isel.c
> +++ b/VEX/priv/host_s390_isel.c
> 
> -      UInt vec_op = 0;
> +      s390_unop_t vec_op;
> 
> Myself, I would have left the redundant initialiser there.  I like
> initialisation :-)
Hm, I'd really prefer to drop this initialization, because (1) zero translates to S390_ZERO_EXTEND_8, which I don't consider a useful initialization value in this case and (2) the initialization would prevent the compiler from identifying code paths where the variable may be used before assigned to.

> 
> +         IRExpr *low64     = IRExpr_Unop(Iop_V128to64, arg);
> +         IRExpr *high64    = IRExpr_Unop(Iop_V128HIto64, arg);
> +         IRExpr *both      = IRExpr_Binop(Iop_Or64, low64, high64);
> +         IRExpr *anyNonZ   = IRExpr_Unop(Iop_CmpNEZ64, both);
> +         IRExpr *anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ);
> 
> Can you put the * next to the type?  Yeah, I know it's technically
> wrong/misleading etc.  But in the tree we use "T* v" (non-ugly, but
> misleading) when there's only one variable, and "T *v1, *v2" in the
> case where we have to be Technically Correct :-)
I can surely do that.  The reason I put the * next to the name is because the vast majority of "IRExpr *" declarations in this file is written like that.
Comment 76 Julian Seward 2018-09-26 16:06:09 UTC
(In reply to Andreas Arnez from comment #75)
> (In reply to Julian Seward from comment #74)
> > Looks fine to me.  Just two minor nits:

Ok, no problem; I don't have a strong preference for my suggestions.
Leave it as it is, then.
Comment 77 Andreas Arnez 2018-09-26 16:31:30 UTC
Regarding the question whether the disassembler works for the new insns,
I've experimented a bit with --trace-flags=10000000.  It turned out that
the vector_integer test case crashed with this option, because when
ppIROp() in ir_defs.c was called to print the names of Iop_Avg64Ux2 and
Iop_Avg64Sx2, it ran into the `default' case instead and called vpanic.

Thus I added cases for these two operations to ppIROp().  While at it, I
also added handling for Ijk_SigFPE to ppIRJumpKind().

With these changes the tracing works, and skimming over the disassembled
insns didn't reveal any obvious problems.  I'm going to push the patch
with these changes now.
Comment 78 Andreas Arnez 2018-09-26 17:34:02 UTC
Committed and pushed.