Summary: | s390x: z13 vector floating-point instructions not implemented | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Andreas Arnez <arnez> |
Component: | vex | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | fweimer, vbrkov |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
z13 vector floating point support (tests)
z13 vector floating point support (code) Vector floating point support (v2) Tests for vector FP support (v2) Change unsigned int to UInt Address feedback for vector FP patch |
Description
Andreas Arnez
2017-10-05 17:59:59 UTC
I've started to work on it. (see https://github.com/barkovv/valgrind/issues/1 for code and status) Created attachment 115426 [details]
z13 vector floating point support (tests)
Created attachment 115427 [details]
z13 vector floating point support (code)
Please take a look
(In reply to Vadim Barkov from comment #3) > Created attachment 115427 [details] > z13 vector floating point support (code) > > Please take a look Thanks, this looks pretty good already. One thing I'd like to note is that the patch contains many lines longer than 80 characters. Please avoid that. Apart from that, I'm currently testing the patch with various binaries, with mostly pleasing results. I have encountered one potential problem, which may or may not be caused by your changes. I'll look further into it and let you know when I know more. (In reply to Andreas Arnez from comment #4) > [...] I have encountered one potential problem, which > may or may not be caused by your changes. I'll look further into it and let > you know when I know more. OK, I found two problems and documented them in Bugs 400490 and 400491. Both of these bugs were already contained in the previous z13 patches. I also attached fixes, with which I can now execute the previously misbehaving vector-FP binaries without any problems. (In reply to Andreas Arnez from comment #5) Good news! Is any additional work needed on this issue? (In reply to Vadim Barkov from comment #6) > Good news! Is any additional work needed on this issue? As I said before, long lines should be avoided. And there are a few other nits in the implementation (attachment 115427 [details]): > Subject: [PATCH 2/2] Vector floating point implementation (code) > [...] > --- a/VEX/priv/guest_s390_toIR.c > +++ b/VEX/priv/guest_s390_toIR.c > [...] > return "vmalh"; > } > > +static void > +s390_vector_fp_from_or_to_operation(IROp op, IRType fromType, IRType toTy… Very long line; please split appropriately. It may also help to shorten the function name to something like "s390_vector_fp_convert". I think this would improve readability of the function's invocations as well. Source lines in Valgrind should generally not exceed 80 columns. You can find long lines in your patch with a grep command like this: grep -nH '^+ .\{81\}' my.patch > [...] > + for Iop_F64toF32 we do this: > + f64[0] -> f32[0] > + f64[1] -> f32[2] > + > + The magic below with scaling factors is used to achive the logic de… Replace "achive" by "achieve". > [...] > + vassert(m3 == 3); > + > + IRExpr* result; > + switch (m5) { > + case 0: { Space at end of line. > [...] > + get_vr_qw(v3)); > + result = triop(addOrSub, > + irrm, > + mulResult, > + get_vr_qw(v4)); You should probably add a comment here, explaining that the separate multiply and add/sub operations are combined to a fused multiply-add/sub by s390_isel_vec_expr_wrk() in host_s390_isel.c. > [...] > + if (!s390_vr_is_cs_set(m6)) { > + if (LIKELY(!isSingleElementOp)) { > + put_vr_qw(v1, binop(Iop_CmpEQ64Fx2, get_vr_qw(v2), get_vr_qw(v3)… > + } else { > + IRExpr* comparationResult = binop(Iop_CmpF64, get_vr(v2, Ity_F64… Replace "comparation" by "comparison". Same in s390_irgen_VFCH and s390_irgen_VFCHE. > [...] > --- a/VEX/priv/host_s390_isel.c > +++ b/VEX/priv/host_s390_isel.c > [...] > > + Iop_irrm_V_wrk: { > + set_bfp_rounding_mode_in_fpc(env, arg1); > + reg1 = s390_isel_vec_expr(env, arg2); > + Spaces at end of line. > [...] > + case Iop_Add64Fx2: > + size = 8; > + > + /* Add64Fx2(Mul64Fx2(arg1, arg2), arg3) -> MAdd(arg1, arg2, arg3… > + if (UNLIKELY((arg2->tag == Iex_Triop) Space at end of line. > + && (arg2->Iex.Triop.details->op == Iop_Mul64Fx2) > + && (arg1 == arg2->Iex.Triop.details->arg1)) > + ) { > + vec_op = S390_VEC_FLOAT_MADD; > + goto Iop_irrm_MAddOrSub; OK, so this is where you combine separate multiply and add operations to a fused multiply-add. Is it guaranteed that this logic triggers whenever the original instruction was a multiply-add, and only then? We must not do this if the program executes the instructions separately, because it would change the result. > [...] > + case Iop_Sub64Fx2: > + size = 8; > + > + /* Sub64Fx2(Mul64Fx2(arg1, arg2), arg3) -> MSub(arg1, arg2, arg3… > + if (UNLIKELY((arg2->tag == Iex_Triop) Space at end of line. (In reply to Vadim Barkov from comment #2) > Created attachment 115426 [details] > z13 vector floating point support (tests) Here are some more comments about the test cases: > [...] > --- a/none/tests/s390x/vector.h > +++ b/none/tests/s390x/vector.h > [...] > @@ -38,22 +42,57 @@ typedef union { > [...] > +void print_f32(const V128 value) { > + printf("%.5e | %.5e | %.5e | %.5e\n", value.f32[0], value.f32[1], valu… Why don't we show all significant digits? This may be relevant when checking the effect of the various rounding modes. Thus I'd prefer to use the "%a" format here instead of "%.5e". Also note that this function is currently unused; is this intentional? > [...] > +void print_f64(const V128 value) { > + printf("%.9e | %.9e\n", value.f64[0], value.f64[1]); Same here, better use "%a" instead of "%.9e". > [...] > --- /dev/null > +++ b/none/tests/s390x/vector_float.c > @@ -0,0 +1,174 @@ > [...] > +int main() { > [...] > + s390_call_float_test(cdlgb, (V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_IN… > + s390_call_float_test(cgdb, (V128_V_RES_AS_INT | V128_V_ARG1_AS_FLOAT64… > + s390_call_float_test(clgdb, V128_V_RES_AS_INT | V128_V_ARG1_AS_FLOAT64… > + s390_call_float_test(fidb, V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_FLOA… > + s390_call_float_test(ledb, V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_FLOA… For vledb, the result is in fact a pair of FLOAT32 values. But switching to V128_V_RES_AS_FLOAT32 is not quite right either, because the odd elements in the result are "unpredictable". Your implementation of VFLR fills the odd elements with 0xff, whereas the hardware typically fills them with zeroes. Thus the test fails with a diff in stdout when displaying floating-point values with "%a" as suggested above. To fix this, the test should probably just skip printing the odd elements. Maybe add another flag to indicate this, something like V128_V_RES_EVEN_ONLY. > [...] > + > + test_with_selective_printing(vfmadb, (V128_V_RES_AS_FLOAT64 | V128_V_A… > + test_with_selective_printing(wfmadb, (V128_V_RES_AS_FLOAT64 | V128_V_A… > + test_with_selective_printing(vfmsdb, (V128_V_RES_AS_FLOAT64 | V128_V_A… > + test_with_selective_printing(wfmsdb, (V128_V_RES_AS_FLOAT64 | V128_V_A… Very long lines (140 chars); please split. Or maybe you can pass the flags in a variable. It may also help to shorten the flags' names to something like "V_ARG1_F64x2" etc. For the record, Julian Seward commented the following in IRC: * Regarding the fused multiply-add/subs: "I think the *right* fix here is to create new Iops, Iop_MAddF64x2 and IOpMSubF64x2 and use those instead (see libvex_ir.h, Iop_MAddF64 for description) or (ugly slow hack) in s390_vector_fp_mulAddOrSub_operation, for the vector case, split up the each operand into 2 64-bit scalars, use the existing scalar FMA operations, and reconstruct the vector (it will generate worse code, but it is less effort to implement because you don't need to change any code-generator stuff)." * About formal aspects: "...also (1) please get it inside 80 cols [as you requested] and (2) I think it would be clearer to remove the LIKELY/UNLIKELY hints in the new code in guest_s390_toIR.c." Created attachment 116468 [details] Vector floating point support (v2) Revised version of vector FP support. This should address all the feedback from comment #7 and comment #9. The fused vector multiply-add/sub is now implemented with two scalar multiply-add/sub operations. Created attachment 116469 [details] Tests for vector FP support (v2) Revised version of the tests for the vector FP support. This version addresses the feedback from comment #8 and also makes some internal updates that had been missing before. (In reply to Andreas Arnez from comment #10) > Created attachment 116468 [details] > Vector floating point support (v2) > > Revised version of vector FP support. This should address all the feedback > from comment #7 and comment #9. The fused vector multiply-add/sub is now > implemented with two scalar multiply-add/sub operations. Nicely done; looks pretty clean. OK to land providing you fix the (very minor) things below. diff --git a/VEX/priv/guest_s390_defs.h b/VEX/priv/guest_s390_defs.h index 3bfecbe31..d72cc9f6d 100644 --- a/VEX/priv/guest_s390_defs.h +++ b/VEX/priv/guest_s390_defs.h + struct { + unsigned int op1 : 8; + unsigned int v1 : 4; + unsigned int v2 : 4; etc For new structs like this, I'd prefer you use the house type "UInt" rather than "unsigned int". +/* Check if "Single-Element-Control" bit is set. s/if/if the +static const HChar * +s390_irgen_VCDG(UChar v1, UChar v2, UChar m3, UChar m4, UChar m5) +{ + vassert(m3 == 3); (and the same assertion in various functions below). The assertion really checks that there is some logic error in the s390 front end, right? In particular you're sure that it won't fail even in the case of unknown/undecodable instructions? @@ -19357,6 +20039,18 @@ s390_decode_6byte_and_irgen(const UChar *bytes) unsigned int rxb : 4; unsigned int op2 : 8; } VRR; + struct { + unsigned int op1 : 8; Same comment as above re UInt. (In reply to Andreas Arnez from comment #11) > Created attachment 116469 [details] > Tests for vector FP support (v2) > > Revised version of the tests for the vector FP support. This version > addresses the feedback from comment #8 and also makes some internal updates > that had been missing before. This looks fine to me. I would suggest running it by hand on Memcheck just to check that there's no use of uninitialised memory, etc, and also to check that the instrumentation code that Memcheck generates for these operations is actually handled by the s390 back end. OK to commit once that's done. More generally, it would be advisable also to run, by hand, all of the none/tests/s390 tests on Memcheck, each four times, for the 4 flag combinations --track-origins={yes,no} --expensive-definedness-checks={yes,no}. This has shown up failures on other targets in the past. (In reply to Andreas Arnez from comment #9) > For the record, Julian Seward commented the following in IRC: > > * Regarding the fused multiply-add/subs: > > "I think the *right* fix here is to create new Iops, Iop_MAddF64x2 and > IOpMSubF64x2 and use those instead (see libvex_ir.h, Iop_MAddF64 for > description) or (ugly slow hack) in s390_vector_fp_mulAddOrSub_operation, > for the vector case, split up the each operand into 2 64-bit scalars, use > the existing scalar FMA operations, and reconstruct the vector (it will > generate worse code, but it is less effort to implement because you don't > need to change any code-generator stuff)." Regarding the split/scalar-op/reconstruct strategy. That's probably not too bad for dealing with 64 bit lanes in a 128 bit vector. But as the vector width grows (next year, 256 bit vectors in s390, maybe?) and the lane size gets smaller (F32 ops, I16 ops) etc, this begins to generate terribly slow and verbose code compared to the "right" fix. I mention this because, in the amd64 (x86_64) front end, this problem has now become extreme -- eg, a 256 vector instruction comprising 8 x (convert i32 to f32) becomes easily 200 instructions at the back end. So investing in the "right" fix might be necessary at some point. (In reply to Julian Seward from comment #12) > (In reply to Andreas Arnez from comment #10) > [...] > + unsigned int v2 : 4; > etc > > For new structs like this, I'd prefer you use the house type "UInt" rather > than "unsigned int". Sure, I can change this. Just note that there are many such bit fields in the s390-specific code already, and they generally seem to be declared as "unsigned int". This particularly applies to the code surrounding the added structs, which means that their different style will stick out. Or should we better create a separate patch for changing all such occurrences in the s390-specific code at once? > +/* Check if "Single-Element-Control" bit is set. > s/if/if the OK. > +static const HChar * > +s390_irgen_VCDG(UChar v1, UChar v2, UChar m3, UChar m4, UChar m5) > +{ > + vassert(m3 == 3); > (and the same assertion in various functions below). The assertion > really checks that there is some logic error in the s390 front end, > right? In particular you're sure that it won't fail even in the > case of unknown/undecodable instructions? No. I stumbled upon that as well, but honestly didn't pay much attention to it. Unfortunately the current code structure in guest_s390_toIR.c doesn't provide a clean way of reporting a decoding error up through the call chain. See my patch below for a possible solution. > @@ -19357,6 +20039,18 @@ s390_decode_6byte_and_irgen(const UChar *bytes) > unsigned int rxb : 4; > unsigned int op2 : 8; > } VRR; > + struct { > + unsigned int op1 : 8; > > Same comment as above re UInt. The same as above applies. Created attachment 116556 [details]
Change unsigned int to UInt
This is a delta-patch that changes the bit fields members from unsigned int to UInt. This patch only performs this change for the structures added with the introduction of vector FP support and is meant to be merged with the base patch (if desired).
Created attachment 116557 [details] Address feedback for vector FP patch Another delta-patch that addresses the remaining aspects from comment #12. The vassert() invocations that actually check for specification exceptions are replaced by invocations to a new macro, s390_insn_assert(). If a specification exception is recognized, Valgrind will now emit a message like this: vex s390->IR: specification exception: E720 2000 20E3 ==158317== valgrind: Unrecognised instruction at address 0x10004c2. This patch is also meant to be merged with the base vector FP support before committing. Pushed after approval from Julian on IRC yesterday -- "(03:05:09 PM) sewardj: so, yes, land all the 385411 patches" -- as the following commits: 600a0099a Bug 385411 s390x: Add z13 vector floating point support 86bd88945 Bug 385411 s390x: Tests and internals for z13 vector FP support Where the patches from comment 16 and comment 17 have been merged into the former. |