Valgrind currently lacks support for the z/Architecture vector integer instructions introduced with z13. These are documented in the z/Architecture Principles of Operation, Eleventh Edition (March, 2015), chapter 22: "Vector Integer Instructions".
Created attachment 110339 [details] Vector integer & string z13 support for s390 z13 Contains the implementation of all vector integer (and string) instructions for z13.
I have not checked if these patches still apply, but for the general patches consider these patches acked from my s390x perspective.
(In reply to Christian Borntraeger from comment #2) > I have not checked if these patches still apply, but for the general patches > consider these patches acked from my s390x perspective. I've ensured that this patched can be applied on master.
Created attachment 114122 [details] Vector integer & string z13 support for s390 z13 Changes: Now this patch can be applied on master (on d44563c49e55f47016e23591f708c7aa57f7a098 or later commit)
(In reply to Vadim Barkov from comment #4) > Now this patch can be applied on master (on > d44563c49e55f47016e23591f708c7aa57f7a098 or later commit) Good! Tried the patch, and it looks good, with one exception: the test case output for the VTM instruction in vector_integer.stdout.out shows a diff on my system: - r_result = 0000000000000001 + r_result = 0000000000000003 This happens for all four of the vtm iterations, exactly with the same diff. Haven't investigated this any further yet.
Created attachment 114145 [details] More tests for VTM instruction Changes: Added tests for all possible result of VTM instruction.
Andreas, I can't reproduce you test results so I added tests for every possible result of VTM instruction. Could you apply patch from attachment 114145 [details] and write the results to me? I need more information to reproduce the bug. Tell me please: - CPU version (z13 or z14?) - GNU/Linux distributive that you use (Fedora, OpenSuse, etc) - compiler version ($ gcc --version)
(In reply to Vadim Barkov from comment #7) > Andreas, > > I can't reproduce you test results so I added tests for every possible > result of VTM instruction. Could you apply patch from attachment 114145 [details] > [details] and write the results to me? Sure. The diff looks similar. I'll attach it. > I need more information to reproduce the bug. Tell me please: > - CPU version (z13 or z14?) z14 > - GNU/Linux distributive that you use (Fedora, OpenSuse, etc) Fedora 28 > - compiler version ($ gcc --version) gcc 8.0.1
Created attachment 114150 [details] Test case diff with vtm failure
Created attachment 114152 [details] Fix vector register number calculation Independent from the vtm issue above, I noticed another problem with the current vector register logic. Whenever decoding an instruction that references one of the upper 16 vector registers, the register number may be calculated incorrectly. This is fixed.
Vadim, Andreas, please can you split this into two patches: * one with all the tests * one with the final bug-fixed implementation so it's easier to review.
Andreas, I've reproduced the bug. This occurs only when compiling valgrind with gcc-8 (I've used "gcc (SUSE Linux) 8.1.1 20180719") and with default optimisation flags (I guess it is "-O2"). I've compiled with same compiler and "-O0" and the problem dissapeared. When I've used gcc-7 compiler optimisation flag doesn't matter (all tests pass). So I guess there is new wrong optimisation bug in gcc-8. I'll try to resolve somehow this problem in code.
I've made the union in s390x_dirtyhelper_vec_op volatile and the bug magically dissapeared. Fixed. Now I'll rearrange all stuff into two patches (tests & code) and submit here.
(In reply to Vadim Barkov from comment #13) > I've made the union in s390x_dirtyhelper_vec_op volatile and the bug > magically dissapeared. > > Fixed. No. Just hidden. More likely is that there's some violation of C "defined behaviour" that the compiler is exploiting whilst optimising. Or some type error. I would prefer that you investigated this more and found the root cause of the problem.
Created attachment 114206 [details] Vector & string z13 implementation (tests)
Created attachment 114207 [details] Vector & string z13 implementation (code)
(In reply to Julian Seward from comment #14) Okay, I've refixed that issue without volatile (replaced UChar[6] array with ULong in union). Please, check the final version out. It includes the Andreas' patch and some tests for it.
(In reply to Julian Seward from comment #14) > More likely is that there's some violation of C "defined behaviour" > that the compiler is exploiting whilst optimising. Or some type > error. I would prefer that you investigated this more and found the > root cause of the problem. Right. The main problem seems to be caused by the inline assembly, where only the first byte of the_insn was declared as input: [insn] "m" (the_insn.bytes[0]) The whole insn is of course needed at this time. So this should be changed to [insn] "m" (the_insn) Another suspicous construct is the type cast (const s390x_vec_op_details_t*) &details where 'details' is a ULong. I believe this violates the strict aliasing rule.
(In reply to Andreas Arnez from comment #18) > Another suspicous construct is the type cast > (const s390x_vec_op_details_t*) &details > where 'details' is a ULong. I believe this violates the strict aliasing rule. Indeed. In short words strict aliasing says you are not allowed to view the same piece of memory through lenses of different types *other* than 'char *'. I debugged once a nasty problem in SPEC2017 CPU due to a strict aliasing violation combined with LTO. The question is also whether strict aliasing is enabled for this compilation unit...
(In reply to Ivo Raisr from comment #19) > (In reply to Andreas Arnez from comment #18) > > > Another suspicous construct is the type cast > > (const s390x_vec_op_details_t*) &details > > where 'details' is a ULong. I believe this violates the strict aliasing rule. Yes. I would much prefer not to have such violations in tree. > The question is also whether strict aliasing is enabled for this compilation > unit... To tell the truth, we have built with -fno-strict-aliasing for about a decade now :-/ But I still would prefer not to have such casting.
(In reply to Julian Seward from comment #20) > To tell the truth, we have built with -fno-strict-aliasing for about a > decade now :-/ But I still would prefer not to have such casting. Hm, for guest_s390_helpers.c I get a GCC command line that contains ... -fno-strict-aliasing -fno-builtin -fomit-frame-pointer -Wbad-function-cast -fstrict-aliasing ... Which means that strict aliasing is disabled and then re-enabled. Not sure if that's intentional, though.
Created attachment 114230 [details] Fix strict aliasing violation This fix applies to Vadim's new version. It also undoes the change from UChar[6] to ULong, because that change was unnecessary and I prefer the original approach. (A 6-byte character array seems more appropriate for representing an s390x instruction, and the original approach avoids pointer conversions.)
Julian, is this OK now?
In the meantime I did some more testing, and one thing I found is that VFENE also marks bytes in the input vectors beyond the zero-terminator as read. This leads to lots of false positives with real workloads. There might be similar issues with the other vector string instructions.
I applied the patches in attachment 114207 [details] and attachment 114230 [details] on top of the master branch (commit b9cfb2d15413d16f330878938af3d6fa1617f8b4). I get a crash in the inline expansion of strlen in ld.so of a z13-enabled glibc build, around here: ==18364== Invalid read of size 1 ==18364== at 0x401892E: _dl_sysdep_start (dl-sysdep.c:236) ==18364== by 0x40188A5: frob_brk (dl-sysdep.c:36) ==18364== by 0x40188A5: _dl_sysdep_start (dl-sysdep.c:227) ==18364== Address 0x1fff001000 is not stack'd, malloc'd or (recently) free'd 0x0000000004018914 <+860>: lghi %r3,0 0x0000000004018918 <+864>: risbg %r1,%r2,60,191,0 0x000000000401891e <+870>: je 0x401897c <_dl_sysdep_start+964> 0x0000000004018922 <+874>: lghi %r4,15 0x0000000004018926 <+878>: aghi %r3,16 0x000000000401892a <+882>: sgr %r4,%r1 => 0x000000000401892e <+886>: vll %v0,%r4,0(%r2) 0x0000000004018934 <+892>: vfenezbs %v0,%v0,%v0 0x000000000401893a <+898>: je 0x4018952 <_dl_sysdep_start+922> 0x000000000401893e <+902>: vl %v0,0(%r3,%r2) 0x0000000004018944 <+908>: aghi %r3,16 0x0000000004018948 <+912>: vfenezbs %v0,%v0,%v0 0x000000000401894e <+918>: jne 0x401893e <_dl_sysdep_start+902> 0x0000000004018952 <+922>: vlgvb %r1,%v0,7 0x0000000004018958 <+928>: llgcr %r1,%r1 0x000000000401895c <+932>: cgr %r1,%r4 (gdb) print/x $r2 $1 = 0x1fff000ffb (gdb) print (char *)$r2 $17 = 0x1fff000ffb "z13" (gdb) print _rtld_local_ro._dl_platform $18 = 0x1fff000ffb "z13" (gdb) print _rtld_local_ro._dl_platform == $r2 $19 = 1 So strlen is called with the correct value of GLRO(dl_platform), it seems, but the emulation does not like it. Are there additional patches I need?
The SIGSEGV happens here in the generated code: (gdb) disas 0x00000010038c9230,+20 Dump of assembler code from 0x10038c9230 to 0x10038c9244: 0x00000010038c9230: lghi %r5,8 0x00000010038c9234: lg %r3,2792(%r13) => 0x00000010038c923a: ic %r4,0(%r3) 0x00000010038c923e: lg %r3,2520(%r13) End of assembler dump. (gdb) print/x $r3 $1 = 0x1fff001000 (gdb) Looks like the load limit for VLL does not work. Consider this data from the valgrind emulation: (gdb) print $r4 $22 = 4 (gdb) print/x $r2 + $r4 $24 = 0x1fff000fff So the highest-indexed byte is still on the same page, but it is the last byte on the page.
Hopefully someone can see what is wrong with the generated IR: vll %v0,%r4,0(%r2) ------ IMark(0x401892E, 6, 0) ------ t6 = Add64(0x0:I64,GET:I64(592)) t7 = GET:I32(612) PUT(64) = SetElem8x16(GET:V128(64),0x0:I8,ITE(CmpLE32U(0x0:I32,t7),LDbe:I8(Add64(t6,0x0:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x1:I8,ITE(CmpLE32U(0x1:I32,t7),LDbe:I8(Add64(t6,0x1:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x2:I8,ITE(CmpLE32U(0x2:I32,t7),LDbe:I8(Add64(t6,0x2:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x3:I8,ITE(CmpLE32U(0x3:I32,t7),LDbe:I8(Add64(t6,0x3:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x4:I8,ITE(CmpLE32U(0x4:I32,t7),LDbe:I8(Add64(t6,0x4:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x5:I8,ITE(CmpLE32U(0x5:I32,t7),LDbe:I8(Add64(t6,0x5:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x6:I8,ITE(CmpLE32U(0x6:I32,t7),LDbe:I8(Add64(t6,0x6:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x7:I8,ITE(CmpLE32U(0x7:I32,t7),LDbe:I8(Add64(t6,0x7:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x8:I8,ITE(CmpLE32U(0x8:I32,t7),LDbe:I8(Add64(t6,0x8:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0x9:I8,ITE(CmpLE32U(0x9:I32,t7),LDbe:I8(Add64(t6,0x9:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xA:I8,ITE(CmpLE32U(0xA:I32,t7),LDbe:I8(Add64(t6,0xA:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xB:I8,ITE(CmpLE32U(0xB:I32,t7),LDbe:I8(Add64(t6,0xB:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xC:I8,ITE(CmpLE32U(0xC:I32,t7),LDbe:I8(Add64(t6,0xC:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xD:I8,ITE(CmpLE32U(0xD:I32,t7),LDbe:I8(Add64(t6,0xD:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xE:I8,ITE(CmpLE32U(0xE:I32,t7),LDbe:I8(Add64(t6,0xE:I64)),0x0:I8)) PUT(64) = SetElem8x16(GET:V128(64),0xF:I8,ITE(CmpLE32U(0xF:I32,t7),LDbe:I8(Add64(t6,0xF:I64)),0x0:I8)) PUT(720) = 0x4018934:I64
Found it: /* A ternary if-then-else operator. It returns iftrue if cond is nonzero, iffalse otherwise. Note that it is STRICT, ie. both iftrue and iffalse are evaluated in all cases. ppIRExpr output: ITE(<cond>,<iftrue>,<iffalse>), eg. ITE(t6,t7,t8) */ So we need to emit a guard before the loads that go beyond the load limit.
Created attachment 114302 [details] Use guarded loads in the guest This patch changes the guest code to use guarded loads. However, this insufficient because the isel implementation does not handle Ist_LoadG. I don't know how to fix this. s390_insn_cond_move does not perform the U8 → U32 zero extension, so it is not a good fit here (and even Ist_LoadG with ILGop_8Uto32 is a bit off in this context, we would need ILGop_Ident8).
Created attachment 114305 [details] Implement early exit in s390_vr_loadWithLength Updated patch. This seems to fix the VLL issue. Limited testing only. However, these VLL instructions are typically part of the GCC inline expansion of strlen, and the result is used with VFENEZBS. Since VLL is used to over-load bytes until the next boundary, the vector result is partially undefined. VFENEZBS is implemented with a host assist only, though, so we get tons of warnings from inline copies of strlen: ==52238== Conditional jump or move depends on uninitialised value(s) ==52238== at 0x401B202: strdup (strdup.c:41) ==52238== by 0x40090AF: _dl_map_object (dl-load.c:2173) As a result, running just /bin/true on a z13-compiled distribution results in lots of errors: ==52238== ERROR SUMMARY: 7969 errors from 450 contexts (suppressed: 0 from 0)
(In reply to Florian Weimer from comment #29) > Created attachment 114302 [details] > Use guarded loads in the guest > > This patch changes the guest code to use guarded loads. > > However, this insufficient because the isel implementation does not handle > Ist_LoadG. I don't know how to fix this. Look at how the amd64 back end does it. It basically is going to be a completely normal unconditional load that has a conditional branch in front of it. > s390_insn_cond_move does not perform the U8 → U32 zero extension, so it is > not a good fit here (and even Ist_LoadG with ILGop_8Uto32 is a bit off in > this context, we would need ILGop_Ident8). This doesn't matter very much. Presumably s390 has an instruction that loads 8 bits from memory and zero extends the loaded value out to 64 bits. Then you can use the ILGop_8Uto64 descriptor in your conditional load. The narrowing operations that you will have to use afterwards (64to8, or 32to8) become no-ops in the generated code, so there's no performance disadvantage to using an 8bits-to-64bits-and-back-to-8bits scheme.
(In reply to Florian Weimer from comment #30) > Created attachment 114305 [details] > Implement early exit in s390_vr_loadWithLength > > Updated patch. This seems to fix the VLL issue. Limited testing only. It would be better to implement guarded loads and use that scheme, so that we don't have to end the superblock every time we see VLL. > However, these VLL instructions are typically part of the GCC inline > expansion of strlen, and the result is used with VFENEZBS. Since VLL is > used to over-load bytes until the next boundary, the vector result is > partially undefined. VFENEZBS is implemented with a host assist only, > though, so we get tons of warnings from inline copies of strlen: What does VFENEZBS do? I imagine this will need to be fixed by making a translation of it that is more Memcheck-friendly, and/or adjusting Memcheck to deal with the underlying IROps better.
(In reply to Andreas Arnez from comment #23) > Julian, is this OK now? Considering comment 30, alas, no. Let's try and get the Memcheck problems with this fixed first.
(In reply to Julian Seward from comment #32) > What does VFENEZBS do? VFENE V1, V2, V3 -- vector find element not equal It elementwise compares V2 and V3. If an unequal element is found, its index is stored in byte 7th (counting from 0th byte) of V1. If not, the value 16 is stored in 7th byte (counting from 0th byte) of V1. All other bytes of V1 are zeroed. VFENEZBS: B -- element size is 1 byte Z -- also compares elements of V2 with zero S -- set condition code with result of comparation So, "VFENEZBS %%v0, %%v0, %%v0" just searchs the first zero byte in %%v0, stores its index (or 16 if not found) and indicates if it was found in Conditional Code (CC).
(In reply to Florian Weimer from comment #30) Are you sure that VLL is the reason of the bug with strlen? If so, I believe that the conditional load implementation is the most natural. I'll add support for conditional load (and store) in the host part of s390x VEX. It shouldn't be hard (I've already implemented conditional dirty helpers in "vector basic" patch)
(In reply to Vadim Barkov from comment #35) > (In reply to Florian Weimer from comment #30) > > Are you sure that VLL is the reason of the bug with strlen? Exactly which bug are you referring to? If you mean the problem that Florian describes in comment 25 to comment 28, then I agree with his analysis -- I think it is correct.
Created attachment 114327 [details] New VLL implementation Based on Florian's patch with guarded loads (attachment 114302 [details]) I've just added host part to his code and replaced Iop_SetElem8x16 to direct write. This patch passes tests but additional testing is needed.
(In reply to Florian Weimer from comment #30) Floarian, I can't reproduce the problem that you describe. Could you take my patch (attachment 114327 [details]) and test against your executable?
(In reply to Vadim Barkov from comment #38) > (In reply to Florian Weimer from comment #30) > > Floarian, > > I can't reproduce the problem that you describe. Could you take my patch > (attachment 114327 [details]) and test against your executable? Do you have GCC 7 installed? Then you can build glibc like this: git clone --depth 1 git://sourceware.org/git/glibc.git cd glibc mkdir build cd build ../configure --prefix=/usr CFLAGS="-O2 -g -march=z13 -mtune=z13" make (Parallel make is supported.) After that, if valgrind is on PATH, you can run /bin/true like this: ./testrun.sh --tool=valgrind /bin/true This should report no issues.
Created attachment 114334 [details] simple z13 executable with strlen inlined Here is a hopefully simpler reproducer for the following program compiled with GCC 8.2.1 with gcc -march=z13 -mtune=z14 -Wall -g -O2 -o t t.c # cat t.c #include <string.h> #include <stdlib.h> int main (int argc, char **argv) { int status = argc > 1; if (status) { char *str = strdup (argv[1]); int len1 = strlen (argv[1]); int len2 = strlen (str); status &= len1 < 8 && len2 > 4; free (str); } exit (!status); } # gcc -march=z13 -mtune=z14 -Wall -g -O2 -o t t.c # ./vg-in-place -q ./t hello ==6001== Conditional jump or move depends on uninitialised value(s) ==6001== at 0x10005E8: main (t.c:12) ==6001== ==6001== Conditional jump or move depends on uninitialised value(s) ==6001== at 0x1000610: main (t.c:12) ==6001== ==6001== Conditional jump or move depends on uninitialised value(s) ==6001== at 0x1000614: main (t.c:12) ==6001== ==6001== Conditional jump or move depends on uninitialised value(s) ==6001== at 0x100062C: main (t.c:13) ==6001== # ./vg-in-place --vgdb-error=0 ./t hello # gdb ./t 0x00000000040013c0 in _start () from /lib/ld64.so.1 (gdb) c Continuing. Program received signal SIGTRAP, Trace/breakpoint trap. 0x00000000010005e8 in main (argc=<optimized out>, argv=0x1ffefffcb8) at t.c:12 12 int len2 = strlen (str); (gdb) disassemble Dump of assembler code for function main: 0x0000000001000528 <+0>: stmg %r12,%r15,96(%r15) 0x000000000100052e <+6>: lay %r15,-160(%r15) 0x0000000001000534 <+12>: cijh %r2,1,0x1000550 <main+40> 0x000000000100053a <+18>: lhi %r12,0 0x000000000100053e <+22>: lr %r2,%r12 0x0000000001000540 <+24>: xilf %r2,1 0x0000000001000546 <+30>: lgfr %r2,%r2 0x000000000100054a <+34>: brasl %r14,0x10004c8 <exit@plt> 0x0000000001000550 <+40>: lgr %r12,%r3 0x0000000001000554 <+44>: lg %r2,8(%r3) 0x000000000100055a <+50>: brasl %r14,0x1000508 <__strdup@plt> 0x0000000001000560 <+56>: lg %r3,8(%r12) 0x0000000001000566 <+62>: lghi %r4,0 0x000000000100056a <+66>: risbg %r1,%r3,60,191,0 0x0000000001000570 <+72>: je 0x100059a <main+114> 0x0000000001000574 <+76>: lghi %r5,15 0x0000000001000578 <+80>: sgr %r5,%r1 0x000000000100057c <+84>: vll %v0,%r5,0(%r3) 0x0000000001000582 <+90>: aghi %r4,16 0x0000000001000586 <+94>: vfenezbs %v0,%v0,%v0 0x000000000100058c <+100>: je 0x10005a2 <main+122> 0x0000000001000590 <+104>: vl %v0,0(%r4,%r3) 0x0000000001000596 <+110>: j 0x1000582 <main+90> 0x000000000100059a <+114>: lghi %r5,15 0x000000000100059e <+118>: j 0x1000590 <main+104> 0x00000000010005a2 <+122>: vlgvb %r1,%v0,7 0x00000000010005a8 <+128>: llgcr %r1,%r1 0x00000000010005ac <+132>: cgr %r1,%r5 0x00000000010005b0 <+136>: la %r5,1(%r5) 0x00000000010005b4 <+140>: locgrh %r4,%r5 0x00000000010005b8 <+144>: jh 0x100059a <main+114> 0x00000000010005bc <+148>: lay %r5,-16(%r4,%r1) 0x00000000010005c2 <+154>: lghi %r4,0 0x00000000010005c6 <+158>: risbg %r1,%r2,60,191,0 0x00000000010005cc <+164>: je 0x10005f6 <main+206> 0x00000000010005d0 <+168>: lghi %r3,15 0x00000000010005d4 <+172>: sgr %r3,%r1 0x00000000010005d8 <+176>: vll %v0,%r3,0(%r2) 0x00000000010005de <+182>: aghi %r4,16 0x00000000010005e2 <+186>: vfenezbs %v0,%v0,%v0 => 0x00000000010005e8 <+192>: je 0x10005fe <main+214> 0x00000000010005ec <+196>: vl %v0,0(%r4,%r2) 0x00000000010005f2 <+202>: j 0x10005de <main+182> 0x00000000010005f6 <+206>: lghi %r3,15 0x00000000010005fa <+210>: j 0x10005ec <main+196> 0x00000000010005fe <+214>: vlgvb %r1,%v0,7 0x0000000001000604 <+220>: llgcr %r1,%r1 0x0000000001000608 <+224>: cgr %r1,%r3 0x000000000100060c <+228>: la %r3,1(%r3) 0x0000000001000610 <+232>: locgrh %r4,%r3 0x0000000001000614 <+236>: jh 0x10005f6 <main+206> 0x0000000001000618 <+240>: lay %r1,-16(%r4,%r1) 0x000000000100061e <+246>: cijh %r5,7,0x100063c <main+276> 0x0000000001000624 <+252>: chi %r1,4 0x0000000001000628 <+256>: lhi %r12,0 0x000000000100062c <+260>: lochih %r12,1 0x0000000001000632 <+266>: brasl %r14,0x10004a8 <free@plt> 0x0000000001000638 <+272>: j 0x100053e <main+22> 0x000000000100063c <+276>: lhi %r12,0 0x0000000001000640 <+280>: j 0x1000632 <main+266> End of assembler dump.
Created attachment 114361 [details] New VFENE Implementation Fixed the strlen issue via new VFENE implementation
Good news! I've solved the strlen inlining issue. The old VFENE implementation used the dirty helper, which marked second and third argument fully read (16 bytes). This is semantically wrong because this instruction reads its arguments byte by byte until needed element will be found (or all bytes are read). The new implementation is correct from this perspective. Now I am going to review other string instructions to this bug and fix them too.
Generally speaking I think that warning in inline strlen is correct because the VFENE specification says: If ZS (zero search) flag is set, then each element in the second operand is compared for equality with zero. So, if we have this instruction: vfenezbs %v0, %v0, %v0 then all bytes of %v0 (second operand) are read in all cases (if the needed element found or not) So, if some bytes of %v0 are uninitialized for some reason, then the resulting condition code of VFENEZBS instruction is undefined. Anyway, the VFENE implementation is rewritten because in general case (when the second and third operand are different) the third operand is not read fully. What do you think about it?
(In reply to Vadim Barkov from comment #43) > What do you think about it? Honestly, I do not understand. You need to explain much more clearly: * what these instructions actually do * what your implementation strategy is * what the problem cases are I have the impression from comment 34 that vfenezbs %v0, %v0, %v0 finds the index of the first zero in the operand and writes that somewhere in the result register. (But why do they all have to to be v0 ?). We have solved such problems in the past for amd64 and so it would help if you could relate your work to the amd64 implementation. For amd64, we generate the sequence t1 = CmpEQ8x16(vec, zero-vector) t2 = pmovmskb(t1), which moves one bit from each lane into t2 t3 = count-leading-zeroes(t2) Because the count-leading-zeroes operation can be done in a way in which the result is defined even if there are undefined bits to the right of the leftmost 1 bit, this means that the resulting t3 value will be defined if the input vector consisted of defined non-zero bytes terminated by a defined-zero, even if the bytes after that are undefined.
(In reply to Julian Seward from comment #44) The new VFENE implementation has no problems. I just want to point you that in my opinion the memchecks' "Conditional jump or move depends on uninitialised value(s)" error for "VFENEZB %v0, %v0, %v0" instruction (which is used in inline ) can be considered as correct error (from some point of view it is not false positive). Let's assume that we have a vector register %v0: 0xaa11223344556677, 0x8800???????????? -- the contents of %v0 "?" sign means that that bytes are uninitialized/undefined for some reason (e.g. because of "vector load with length" VLL instruction) We have "VFENEZBS %v0, %v0, %v0" instruction in code. It searchs for the zero byte in %v0, writes the index of zero byte in byte 7 of %v0 and upgrades condition code (CC, used for branches) The result of execution is: 0x000000000000009, 0x0000000000000000 -- the new contents of %v0 On the one hand this result is memcheck clean (there is issues like "Conditional jump or move depends on uninitialised value(s)") because the CPU works using this algorithm: int index; for(index = 0; index < 16; index++) { if(v0[index] == 0) break; } return index; But on the other hand the S390x specification ("z/Archtecture. Principles of Operation", SA22-7832-10) says that VFENE with ZS flag set (if mnemonic of operation contains the letter "Z") reads all bytes of second operand (in our example from %v0). So we get situation when instruction reads uninitialized bytes. Is the result of this instruction detemined in the terms of valgrind's terminology? That's my question.
Since the issue is solved, is it ready to merge? Or additional testing is needed?
(In reply to Vadim Barkov from comment #46) > Since the issue is solved, I don't consider it to be solved. You didn't answer my questions in comment 44: * what your implementation strategy is * about the relationship to the amd64 implementation. I don't believe that what you have will generate correct definedness results for strlen-like inputs with defined bytes up to a defined zero and undefined bytes after that. It might be correct, but since there is no description of the implementation, there's no way to assess that. It would also help if you had a set of test cases to demonstrate that * the vfene sequence applied to string-end sequences of the form DDDD0UUUU generates a defined index value * and that when applied to sequences of the form DDDDUUUU without a defined zero, it generates and undefined index value
With the current set of patches, I still see failures for the following glibc string function tests: string/test-memcmp string/test-strcasecmp string/test-strcmp string/test-strncasecmp string/test-strncmp You can build those tests using: make subdirs=string check These failures happen with --tool=none as well. Furthermore, with the system binutils on a non-z13 distribution (not recompiled for z13), I get this when running readelf: ./testrun.sh --tool=valgrind /usr/bin/readelf -s /bin/true … *** buffer overflow detected ***: /usr/bin/readelf terminated Also happens with --tool=none as well. This could the same issue as above, or a different one.
(In reply to Florian Weimer from comment #48) > With the current set of patches, I still see failures for the following > glibc string function tests: > > string/test-memcmp > string/test-strcasecmp > string/test-strcmp > string/test-strncasecmp > string/test-strncmp This issue occurs when I've compiled glibc with gcc-8. There is no failures when I use gcc-7 (gcc (SUSE Linux) 7.3.1 20180323).(In reply to Florian Weimer from comment #48) > With the current set of patches, I still see failures for the following > glibc string function tests: > > string/test-memcmp > string/test-strcasecmp > string/test-strcmp > string/test-strncasecmp > string/test-strncmp I confirm this failures when compiling glibc with gcc-8. When I use gcc-7 (gcc (SUSE Linux) 7.3.1 20180323), it's fine.
Vadim, thanks for the reworks of VLL and VFENE. These help quite considerably, and the false positives we've seen before are gone. Now I'm debugging an internal error that seems related to the z13 patches, but not necessarily to VLL and VFENE ("valgrind: the 'impossible' happened: LibVEX called failure_exit()"). I'll let you know when I find out more about its cause. In the meantime I noticed that we have another fundamental problem that comes with GCC's inlined strlen: The last vector load (VL, or with length = VLL) may read past an allocated buffer, causing an "invalid read". But of course reading beyond the buffer is actually OK, as long as no page boundary is crossed and the data beyond the buffer is ignored. I'm not sure how to get rid of these false positives. Any advice?
Created attachment 114453 [details] Fix vector register move insns This fixes the internal error I've mentioned in comment #50.
Created attachment 114484 [details] Refactoring Refactoring Simplified code for most of vector integer instructions (and some vector basic ones). Mostly I've replaced a lot of switches with IROp arrays (it's shorter and nicer to read).
Created attachment 114485 [details] Refactoring Fixed wrong line endings
(In reply to Andreas Arnez from comment #50) > In the meantime I noticed that we have another fundamental problem that > comes with GCC's inlined strlen: The last vector load (VL, or with length = > VLL) may read past an allocated buffer, causing an "invalid read". But of > course reading beyond the buffer is actually OK, as long as no page boundary > is crossed and the data beyond the buffer is ignored. I'm not sure how to > get rid of these false positives. Any advice? Memcheck's "partial-loads-OK" heuristic, which is now enabled by default, should remove these complaints. So the question is, why isn't it? See mc_LOADV_128_or_256_slow in mc_main.c. To be clear, what this heuristic allows is: * a 16 byte load, with 16 byte alignment, which loads memory over a transition over a buffer end, will not report an error * instead, the bytes loaded from the area inside the buffer are marked with the definedness of the loaded data. The bytes corresponding to the area beyond the end of the buffer are marked as undefined. * Hence, provided the loaded data after the terminating zero is ignored, there will be no error. It would help a lot to have a proper answers to the questions I asked in comments 44 and 47. To be blunt, I am not happy to have this stuff land until there's a proper discussion of the implementation strategy for these vector loads. It is not enough that programs using these vector load instructions run correctly. It is also necessary that they describe the data flow accurately enough that Memcheck is not going to produce a lot of false positives on strlen-like code. But so far I am not convinced that that is the case.
(In reply to Julian Seward from comment #47) > (In reply to Vadim Barkov from comment #46) > > Since the issue is solved, > > I don't consider it to be solved. > > You didn't answer my questions in comment 44: > > * what your implementation strategy is > The VFENE instruction is implemented in this way: /* VFENE %%v1, %%v2, %%v3 */ i = 0; while(i < 16) { elem1 = GetElem(V128(v2), i) elem2 = GetElem(V128(v3), i) if(elem1 != elem2) break; i++ } v1 = 64HLto128(U64(i), U64(0)) VFENEZ also breaks the loop if (elem1 == 0) I think this code should generate correct valgrind warnings () > * about the relationship to the amd64 implementation. > I am not sure that understand it. PS I am sorry for late response.
(In reply to Julian Seward from comment #54) > [...] > Memcheck's "partial-loads-OK" heuristic, which is now enabled by default, > should remove these complaints. So the question is, why isn't it? See > mc_LOADV_128_or_256_slow in mc_main.c. This heuristic doesn't currently apply because VLL is implemented with up to 16 single-byte loads. I'll attach a patch that performs a single 16-byte load instead. Unless the VLL crosses a 16-byte boundary, that load will aligned as well.
Created attachment 114932 [details] Implement VLL with aligned loads This fixes complaints from memcheck in cases where VLL is used for strlen, strcpy, etc. This is achieved by implementing a VLL that stays within a 16-byte aligned chunk with an aligned load instead. Then memcheck will apply its "partial-loads-OK" heuristic and suppress complaints about reading past the allocated string. One problem still remains with VLBB (vector load to block boundary), which assures that no page boundary is crossed and is often used for loading the initial bytes of a string. VLBB may be safely used to load from an unaligned address, but memcheck's heuristic does not apply in this case. Splitting the load into two aligned 16-byte loads wouldn't help either, because the second load would not necessarily overlap with the allocated string at all. I wonder whether memcheck's heuristic could be adjusted such that it allows for unaligned 16-byte loads as long as no page boundary is crossed? Or are there any other suggestions?
(In reply to Andreas Arnez from comment #57) > Created attachment 114932 [details] > Implement VLL with aligned loads > Subject: [PATCH] s390x: Implement VLL/VLBB with aligned loads Looks ok to me. OK to land with the following changes: * Add a short comment before the uses of ShlV128/ShlV128 explaining that the shift value can never be > 127. * you use a construction "mkexpr(mktemp(type, expr))" which I am not sure what the purpose is. For expressions that will get used more than once, you should bind them to an IRTemp, otherwise multiple uses of the expression will lead to it being code-generated multiple times. Eg instead of IRExpr *zeroed = mkexpr(mktemp(Ity_I64, binop(Iop_Sub64, mkU64(15), maxIdx))); do IRTemp zeroed = newTemp(Ity_I64); assign(zeroed, binop(Iop_Sub64, mkU64(15), maxIdx)); and then use mkexpr(zeroed) instead of simply |zeroed| at all the use points. This forces Vex to compute |zeroed| into a register and use that register multiple times. (Vex can fix this stuff itself by doing CSE later, but that's expensive and so is only sometimes done). I see that |offset|, |zeroed| and |back| are all used multiple times, so please bind them to temps as above.
(In reply to Julian Seward from comment #58) > * you use a construction "mkexpr(mktemp(type, expr))" which I am not > sure what the purpose is. For expressions that will get used more > than once, you should bind them to an IRTemp, otherwise multiple uses > of the expression will lead to it being code-generated multiple times. Actually, using an IRTemp is the purpose of this construction. "mktemp(type, expr)" is just a shortcut for "temp = newTemp(type)" followed by "assign(temp, expr)". But since this obviously doesn't help readability, I'll update the patch to avoid mktemp().
Created attachment 114958 [details] Implement VLL/VLBB with aligned loads I believe this addresses all of Julian's comments. In addition this version also adjusts the criteria (on s390x only) for applying the partial-loads-ok exemption in mc_LOADV_128_or_256_slow such that it accepts unaligned 16-byte loads as long as no page boundary is crossed. This should fix all known issues with VLL and VLBB.
Is this OK now?
(In reply to Andreas Arnez from comment #60) > Created attachment 114958 [details] > Implement VLL/VLBB with aligned loads Yes. Good. That looks fine to me. Thank you for redoing it.
(In reply to Julian Seward from comment #62) > Yes. Good. That looks fine to me. Thank you for redoing it. Great! Can the patches attached to this Bugzilla go upstream then?
Which attached patches (there are 7 now) and in which order should they be applied?
(In reply to Mark Wielaard from comment #64) > Which attached patches (there are 7 now) and in which order should they be > applied? The answer to that is (it is actually 8 patches) and they should all be applied in order. Should some of them be combined before being committed to git? I tried this, plus the trapping instruction patch from bug #396839, on a system that has a glibc build with gcc -march=z13. Previously we would die very quickly inside ld.so. But with this patch at least valgrind /bin/true does work as expected.
(In reply to Mark Wielaard from comment #65) > The answer to that is (it is actually 8 patches) and they should all be > applied in order. Correct. > Should some of them be combined before being committed to > git? I don't have a strong opinion on this. Since all of these are required to fix this Bugzilla, we could combine them all. Or we could separate Vadim's patches from mine, to distinguish authors. Or we could just leave them separate, to document our struggle ;-). What do you prefer? > I tried this, plus the trapping instruction patch from bug #396839, on a > system that has a glibc build with gcc -march=z13. Previously we would die > very quickly inside ld.so. But with this patch at least valgrind /bin/true > does work as expected. Great! Thanks for testing.
(In reply to Andreas Arnez from comment #66) > (In reply to Mark Wielaard from comment #65) > > The answer to that is (it is actually 8 patches) and they should all be > > applied in order. > Correct. > > Should some of them be combined before being committed to > > git? > I don't have a strong opinion on this. Since all of these are required to > fix this Bugzilla, we could combine them all. Or we could separate Vadim's > patches from mine, to distinguish authors. Or we could just leave them > separate, to document our struggle ;-). What do you prefer? To be honest I don't care if these patches will be applied separate or combined.
Created attachment 115209 [details] s390x: Vector integer and string instruction support As discussed on IRC with Julian, here's a merged patch (without the tests) for easier review. I also made some formatting improvements (reduced overlong lines, fixed indentation, fixed two spelling mistakes). Otherwise it should be exactly the same as combining all the active patches above.
Created attachment 115210 [details] s390x: Vector string and insn support -- tests
(In reply to Andreas Arnez from comment #69) > Created attachment 115210 [details] > s390x: Vector string and insn support -- tests This is fine. OK to land.
(In reply to Andreas Arnez from comment #68) > Created attachment 115209 [details] > s390x: Vector integer and string instruction support Thank you for finishing this up. OK to land provided the stuff below is fixed (none of it is a big deal) and the test in the next para passes. It is also important to check that all new or modified test cases run on Memcheck without asserting or otherwise failing. The regression test driver will only check them with the "none" tool since the tests are inside none/tests/s390. You can do this manually by running the relevant tests directly on memcheck (just run "by hand"). --- a/VEX/priv/guest_s390_defs.h +++ b/VEX/priv/guest_s390_defs.h +/* Arguments of s390x_dirtyhelper_vec_op(...) which are packed into one + ULong variable. + */ +typedef union { .. +} s390x_vec_op_details_t; add STATIC_ASSERT(sizeof(s390x_vec_op_details_t) == sizeof(ULong)) --- a/VEX/priv/guest_s390_helpers.c +++ b/VEX/priv/guest_s390_helpers.c +static void +s390_cc_set_expr(IRExpr* cc) +{ + vassert(typeOfIRExpr(irsb->tyenv, cc) == Ity_I64); + + s390_cc_thunk_fill(mkU64(S390_CC_OP_SET), cc, mkU64(0), mkU64(0)); +} Similar to conversation last week this is again a case where, if |cc| is used more than once (maybe in the callee of this function) then code for it will get generated to evaluate it multiple times, and it's quite likely that those duplicates will not get removed during IR optimisation. It would be safer to pass around an IRTemp and use |mkexpr(cc)| instead. If this is a big amount of work, don't worry; but if it's small and easy to fix then it would be good to do so now. +static IRExpr* +s390_V128_compareLT128x1(IRExpr* arg1, IRExpr* arg2, Bool allow_equal) Just a sanity check: is the logic with |allow_equal| here correct? I am not saying it isn't, but I think a second look would not be a bad idea. +/* Permorms "arg1 + arg2 + carry_out_bit(arg1 + arg2)". "Performs" + if (UNLIKELY(vex_traceflags & VEX_TRACE_FE)) + s390_disasm(ENC5(MNM, VR, UDXB, VR, UINT), mnm, v1, d2, 0, b2, v3, m4); and many similar s390_disasm calls: have you checked that s390_disasm can print all of these new insns? One way to try is to run your test suites manually with --trace-flags=10000000 --trace-notbelow=0. You'll get gigabytes of output, but you can at least see if easily if s390_disasm aborts, or prints some kind of "I can't handle this" message. + const IROp ops[] = { Iop_InterleaveHI8x16, Iop_InterleaveHI16x8, + Iop_InterleaveHI32x4, Iop_InterleaveHI64x2 }; + vassert(m4 < sizeof(ops)); + put_vr_qw(v1, binop(ops[m4], get_vr_qw(v2), get_vr_qw(v3))); That assertion is surely not right, or at least it assumes that sizeof(IROp) == 1, which I think isn't true. Shouldn't it be "m4 < sizeof(ops)/sizeof(ops[0]))" -- viz, the traditional idiom for "the number of elements in this array" ? Please fix, also any other similar that you can find in this diff. I saw it also in the following, and maybe there are more? s390_irgen_VMRH(UChar v1, UChar v2, UChar v3, UChar m4) s390_irgen_VMRL(UChar v1, UChar v2, UChar v3, UChar m4) s390_irgen_VPK(UChar v1, UChar v2, UChar v3, UChar m4) s390_irgen_VUPH(UChar v1, UChar v2, UChar m3) s390_irgen_VUPLH(UChar v1, UChar v2, UChar m3) s390_irgen_VUPL(UChar v1, UChar v2, UChar m3) s390_irgen_VUPLL(UChar v1, UChar v2, UChar m3) s390_irgen_VPKS(UChar v1, UChar v2, UChar v3, UChar m4, UChar m5) Also s390_irgen_VMO(UChar v1, UChar v2, UChar v3, UChar m4) and its friends. static const HChar * s390_irgen_VSEL(UChar v1, UChar v2, UChar v3, UChar v4) { - IRExpr* vA = get_vr_qw(v3); - IRExpr* vB = get_vr_qw(v2); - IRExpr* vC = get_vr_qw(v4); - - /* result = (vA & ~vC) | (vB & vC) */ - put_vr_qw(v1, - binop(Iop_OrV128, - binop(Iop_AndV128, vA, unop(Iop_NotV128, vC)), - binop(Iop_AndV128, vB, vC) - ) - ); + IRExpr* vIfTrue = get_vr_qw(v2); + IRExpr* vIfFalse = get_vr_qw(v3); + IRExpr* vCond = get_vr_qw(v4); + + put_vr_qw(v1, s390_V128_bitwiseITE(vCond, vIfTrue, vIfFalse)); return "vsel"; } This kind of thing potentially also suffers from the IRExpr* duplication problem, assuming that s390_V128_bitwiseITE uses vCond twice. You could fix it directly in s390_V128_bitwiseITE by allocating a new temp, binding (assign()-ing) vCond to that and then using the temp twice (if it doesn't already do that). --- a/VEX/priv/host_s390_defs.h +++ b/VEX/priv/host_s390_defs.h + S390_VEC_COUNT_LEADING_ZEROEZ, + S390_VEC_COUNT_TRAILING_ZEROEZ, Really, _ZEROEZ and not _ZEROES ? Can we use the traditional spelling, unless ZEROEZ is really what was intended? --- a/VEX/priv/host_s390_isel.c +++ b/VEX/priv/host_s390_isel.c +#define IRCONST_IS_EQUAL_U8(arg, val) \ + ( ((arg)->tag == Iex_Const) && ((arg)->Iex.Const.con->Ico.U8 == (val)) ) This misses out (or assumes) the check that |arg| has type Ity_I8. This might work, but it's somewhat dangerous in the case of programming mistakes. Please change it to be ((arg)->tag == Iex_Const) && ((arg)->Iex.Const.tag == Ico_U8) <--- the missing check && ((arg)->Iex.Const.con->Ico.U8 == (val)) - s390_unop_t vec_op = 0; + UInt vec_op = 0; Is it necessary to "weaken" the type of vec_op here? + case Iop_CmpNEZ128x1: { + IRExpr* lowPartNEZ = IRExpr_Unop(Iop_CmpNEZ64, + IRExpr_Unop(Iop_V128to64, arg)); + IRExpr* highPartNEZ = IRExpr_Unop(Iop_CmpNEZ64, + IRExpr_Unop(Iop_V128HIto64, arg)); + reg1 = s390_isel_int_expr(env, + IRExpr_Binop(Iop_Or64, + IRExpr_Unop(Iop_1Sto64, lowPartNEZ), + IRExpr_Unop(Iop_1Sto64, highPartNEZ) + ) + ); + + dst = newVRegV(env); + addInstr(env, s390_insn_vec_binop(size, S390_VEC_INIT_FROM_GPRS, + dst, reg1, reg1)); + + return dst; + } This is (I think!) correct but it's also very inefficient. We're given a 128 bit value |arg| and we want to make |dst| be 0..(126)..0 if |arg| is |all| zeroes and 1..(126)..1 if arg is all ones. Better would be like this IRExpr* low64 = IRExpr_Unop(Iop_V128to64, arg); IRExpr* high64 = IRExpr_Unop(Iop_V128HIto64, arg); IRExpr* both = IRExpr_Binop(Iop_Or64, low64, high64); IRExpr* anyNonZ = IRExpr_Unop(Iop_CmpNEZ64, both); IRExpr* anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ); reg1 = s390_isel_int_expr(env, anyNonZ64); Even then, it's pretty much a hack to create s390 code by first generating *even more* IR and then handing that off to some other part of the instruction selector. It also means that the tree for |arg| will get instruction-selected twice. Much better would just be to generate the relevant s390 insns directly. But fixing that is beyond the scope of this round of reviewing/bug fixing. + case Iop_PwAddL8Ux16: { + /* There is no such instruction. We have to emulate it. */ + dst = s390_isel_vec_expr(env, + IRExpr_Binop(Iop_Add16x8, + IRExpr_Binop(Iop_InterleaveEvenLanes8x16, + IRExpr_Const(IRConst_V128(0x00)), + arg), + IRExpr_Binop(Iop_InterleaveOddLanes8x16, + IRExpr_Const(IRConst_V128(0x00)), + arg) + ) + ); + + return dst; + } ditto (re "pretty much a hack".) At least, please format to fit inside 80 cols. Please write the arguments to IRConst_V128 here as 0x0000. This is consistent with the style of other use of it and emphasises the fact that the argument is a 16-bit value (one bit for each byte of a 16 byte vector). + case Iop_PwAddL16Ux8: + if ( (arg->tag == Iex_Unop) && (arg->Iex.Unop.op == Iop_PwAddL8Ux16) ) + { + size = 1; + arg = arg->Iex.Unop.arg; + } + else + { + size = 2; + } + vec_op = S390_VEC_PWSUM_W; + goto Iop_Pairwise_wrk; Please change to use the house style, not so verbose: + case Iop_PwAddL16Ux8: + if (arg->tag == Iex_Unop && arg->Iex.Unop.op == Iop_PwAddL8Ux16) { + size = 1; + arg = arg->Iex.Unop.arg; + } else { + size = 2; + } + vec_op = S390_VEC_PWSUM_W; + goto Iop_Pairwise_wrk;
(In reply to Julian Seward from comment #71) > Thank you for finishing this up. OK to land provided the stuff below > is fixed (none of it is a big deal) and the test in the next para > passes. Thanks for reviewing! > It is also important to check that all new or modified test cases run > on Memcheck without asserting or otherwise failing. The regression > test driver will only check them with the "none" tool since the tests > are inside none/tests/s390. You can do this manually by running the > relevant tests directly on memcheck (just run "by hand"). Done. > [...] > +typedef union { > .. > +} s390x_vec_op_details_t; > > add STATIC_ASSERT(sizeof(s390x_vec_op_details_t) == sizeof(ULong)) Done. > +static void > +s390_cc_set_expr(IRExpr* cc) > +{ > + vassert(typeOfIRExpr(irsb->tyenv, cc) == Ity_I64); > + > + s390_cc_thunk_fill(mkU64(S390_CC_OP_SET), cc, mkU64(0), mkU64(0)); > +} > > Similar to conversation last week this is again a case where, if |cc| > is used more than once (maybe in the callee of this function) then > code for it will get generated to evaluate it multiple times, and it's > quite likely that those duplicates will not get removed during IR > optimisation. It would be safer to pass around an IRTemp and use > |mkexpr(cc)| instead. Done. Most of the time there was an IRTemp already, so it wasn't a real problem. But I agree we're on the safe side this way. Also renamed the function to s390_cc_set and the previously named s390_cc_set to s390_cc_set_val. > +static IRExpr* > +s390_V128_compareLT128x1(IRExpr* arg1, IRExpr* arg2, Bool allow_equal) > > Just a sanity check: is the logic with |allow_equal| here correct? I > am not saying it isn't, but I think a second look would not be a bad > idea. Yes, the logic looks correct to me: `allow_equal' is checked when the high halves are equal, because in that case the operands might be fully equal. Otherwise they certainly differ and `allow_equal' doesn't matter. > +/* Permorms "arg1 + arg2 + carry_out_bit(arg1 + arg2)". > > "Performs" Done. > [...] have you checked that s390_disasm can print all of these new > insns? [...] I haven't checked whether all insns can be printed correctly. The ones I looked at were OK. I will look into that more thoroughly tomorrow. > + const IROp ops[] = { Iop_InterleaveHI8x16, Iop_InterleaveHI16x8, > + Iop_InterleaveHI32x4, Iop_InterleaveHI64x2 }; > + vassert(m4 < sizeof(ops)); > + put_vr_qw(v1, binop(ops[m4], get_vr_qw(v2), get_vr_qw(v3))); > > That assertion is surely not right [...] > > Please fix, also any other similar that you can find in this diff. > [...] Hm, yeah, this is broken -- luckily it doesn't impact the execution of correct code. Fixed all occurrences I found. > static const HChar * > s390_irgen_VSEL(UChar v1, UChar v2, UChar v3, UChar v4) > { > - IRExpr* vA = get_vr_qw(v3); > - IRExpr* vB = get_vr_qw(v2); > - IRExpr* vC = get_vr_qw(v4); > - > - /* result = (vA & ~vC) | (vB & vC) */ > - put_vr_qw(v1, > - binop(Iop_OrV128, > - binop(Iop_AndV128, vA, unop(Iop_NotV128, vC)), > - binop(Iop_AndV128, vB, vC) > - ) > - ); > + IRExpr* vIfTrue = get_vr_qw(v2); > + IRExpr* vIfFalse = get_vr_qw(v3); > + IRExpr* vCond = get_vr_qw(v4); > + > + put_vr_qw(v1, s390_V128_bitwiseITE(vCond, vIfTrue, vIfFalse)); > return "vsel"; > } > > This kind of thing potentially also suffers from the IRExpr* > duplication problem, assuming that s390_V128_bitwiseITE uses vCond > twice. You could fix it directly in s390_V128_bitwiseITE by > allocating a new temp, binding (assign()-ing) vCond to that and then > using the temp twice (if it doesn't already do that). Done. > --- a/VEX/priv/host_s390_defs.h > +++ b/VEX/priv/host_s390_defs.h > > + S390_VEC_COUNT_LEADING_ZEROEZ, > + S390_VEC_COUNT_TRAILING_ZEROEZ, > > Really, _ZEROEZ and not _ZEROES ? Can we use the traditional > spelling, unless ZEROEZ is really what was intended? Hehe, these were cool names though ;-). Anyhow -- fixed. > --- a/VEX/priv/host_s390_isel.c > +++ b/VEX/priv/host_s390_isel.c > > +#define IRCONST_IS_EQUAL_U8(arg, val) \ > + ( ((arg)->tag == Iex_Const) && ((arg)->Iex.Const.con->Ico.U8 == (val)) ) > > This misses out (or assumes) the check that |arg| has type Ity_I8. > This might work, but it's somewhat dangerous in the case of > programming mistakes. Please change it to be > > ((arg)->tag == Iex_Const) > && ((arg)->Iex.Const.tag == Ico_U8) <--- the missing check > && ((arg)->Iex.Const.con->Ico.U8 == (val)) Done. > - s390_unop_t vec_op = 0; > + UInt vec_op = 0; > > Is it necessary to "weaken" the type of vec_op here? Reverted, and also removed the dubious initialization. The compiler doesn't complain, so the weakening probably wasn't necessary to please the compiler, and I haven't found any other reason. > + case Iop_CmpNEZ128x1: { > [...] > + } > > This is (I think!) correct but it's also very inefficient. We're > given a 128 bit value |arg| and we want to make |dst| be 0..(126)..0 > if |arg| is |all| zeroes and 1..(126)..1 if arg is all ones. > > Better would be like this > > IRExpr* low64 = IRExpr_Unop(Iop_V128to64, arg); > IRExpr* high64 = IRExpr_Unop(Iop_V128HIto64, arg); > IRExpr* both = IRExpr_Binop(Iop_Or64, low64, high64); > IRExpr* anyNonZ = IRExpr_Unop(Iop_CmpNEZ64, both); > IRExpr* anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ); > reg1 = s390_isel_int_expr(env, anyNonZ64); Done. > > Even then, it's pretty much a hack to create s390 code by first > generating *even more* IR and then handing that off to some other part > of the instruction selector. It also means that the tree for |arg| > will get instruction-selected twice. > > Much better would just be to generate the relevant s390 insns > directly. But fixing that is beyond the scope of this round of > reviewing/bug fixing. Yeah, I didn't spend time on that yet. > + case Iop_PwAddL8Ux16: { > [...] > + } > > ditto (re "pretty much a hack".) At least, please format to fit > inside 80 cols. Done. > Please write the arguments to IRConst_V128 here as 0x0000. This is > consistent with the style of other use of it and emphasises the fact > that the argument is a 16-bit value (one bit for each byte of a 16 > byte vector). Done. > + case Iop_PwAddL16Ux8: > [...] > > Please change to use the house style, not so verbose: > > + case Iop_PwAddL16Ux8: > + if (arg->tag == Iex_Unop && arg->Iex.Unop.op == Iop_PwAddL8Ux16) { > + size = 1; > + arg = arg->Iex.Unop.arg; > + } else { > + size = 2; > + } > + vec_op = S390_VEC_PWSUM_W; > + goto Iop_Pairwise_wrk; Done.
Created attachment 115231 [details] Fixes based on Julian's comment #71 For reference, here's what I changed based on the comments above. I'll look more into the disassembler tomorrow and commit the patches by end of tomorrow if no more findings appear.
(In reply to Andreas Arnez from comment #73) > Created attachment 115231 [details] > Fixes based on Julian's comment #71 > > For reference, here's what I changed based on the comments above. I'll look > more into the disassembler tomorrow and commit the patches by end of > tomorrow if no more findings appear. Looks fine to me. Just two minor nits: --- a/VEX/priv/host_s390_isel.c +++ b/VEX/priv/host_s390_isel.c - UInt vec_op = 0; + s390_unop_t vec_op; Myself, I would have left the redundant initialiser there. I like initialisation :-) + IRExpr *low64 = IRExpr_Unop(Iop_V128to64, arg); + IRExpr *high64 = IRExpr_Unop(Iop_V128HIto64, arg); + IRExpr *both = IRExpr_Binop(Iop_Or64, low64, high64); + IRExpr *anyNonZ = IRExpr_Unop(Iop_CmpNEZ64, both); + IRExpr *anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ); Can you put the * next to the type? Yeah, I know it's technically wrong/misleading etc. But in the tree we use "T* v" (non-ugly, but misleading) when there's only one variable, and "T *v1, *v2" in the case where we have to be Technically Correct :-)
(In reply to Julian Seward from comment #74) > Looks fine to me. Just two minor nits: > > --- a/VEX/priv/host_s390_isel.c > +++ b/VEX/priv/host_s390_isel.c > > - UInt vec_op = 0; > + s390_unop_t vec_op; > > Myself, I would have left the redundant initialiser there. I like > initialisation :-) Hm, I'd really prefer to drop this initialization, because (1) zero translates to S390_ZERO_EXTEND_8, which I don't consider a useful initialization value in this case and (2) the initialization would prevent the compiler from identifying code paths where the variable may be used before assigned to. > > + IRExpr *low64 = IRExpr_Unop(Iop_V128to64, arg); > + IRExpr *high64 = IRExpr_Unop(Iop_V128HIto64, arg); > + IRExpr *both = IRExpr_Binop(Iop_Or64, low64, high64); > + IRExpr *anyNonZ = IRExpr_Unop(Iop_CmpNEZ64, both); > + IRExpr *anyNonZ64 = IRExpr_Unop(Iop_1Sto64, anyNonZ); > > Can you put the * next to the type? Yeah, I know it's technically > wrong/misleading etc. But in the tree we use "T* v" (non-ugly, but > misleading) when there's only one variable, and "T *v1, *v2" in the > case where we have to be Technically Correct :-) I can surely do that. The reason I put the * next to the name is because the vast majority of "IRExpr *" declarations in this file is written like that.
(In reply to Andreas Arnez from comment #75) > (In reply to Julian Seward from comment #74) > > Looks fine to me. Just two minor nits: Ok, no problem; I don't have a strong preference for my suggestions. Leave it as it is, then.
Regarding the question whether the disassembler works for the new insns, I've experimented a bit with --trace-flags=10000000. It turned out that the vector_integer test case crashed with this option, because when ppIROp() in ir_defs.c was called to print the names of Iop_Avg64Ux2 and Iop_Avg64Sx2, it ran into the `default' case instead and called vpanic. Thus I added cases for these two operations to ppIROp(). While at it, I also added handling for Ijk_SigFPE to ppIRJumpKind(). With these changes the tracing works, and skimming over the disassembled insns didn't reveal any obvious problems. I'm going to push the patch with these changes now.
Committed and pushed.