385411 – s390x: z13 vector floating-point instructions not implemented

Bug 385411 - s390x: z13 vector floating-point instructions not implemented

Summary: s390x: z13 vector floating-point instructions not implemented

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (show other bugs)
Version:	unspecified
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-10-05 17:59 UTC by Andreas Arnez
Modified:	2018-11-30 14:18 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
z13 vector floating point support (tests) (88.72 KB, patch) 2018-10-05 11:35 UTC, Vadim Barkov	Details
z13 vector floating point support (code) (56.81 KB, patch) 2018-10-05 11:36 UTC, Vadim Barkov	Details
Vector floating point support (v2) (57.21 KB, patch) 2018-11-22 12:11 UTC, Andreas Arnez	Details
Tests for vector FP support (v2) (133.78 KB, patch) 2018-11-22 12:16 UTC, Andreas Arnez	Details
Change unsigned int to UInt (3.35 KB, patch) 2018-11-28 19:57 UTC, Andreas Arnez	Details
Address feedback for vector FP patch (9.52 KB, patch) 2018-11-28 20:02 UTC, Andreas Arnez	Details
Show Obsolete (2) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Andreas Arnez 2017-10-05 17:59:59 UTC

Valgrind currently lacks support for the z/Architecture vector floating-point instructions introduced with z13.  These are documented in the z/Architecture Principles of Operation, Eleventh Edition (March, 2015), chapter 24: "Vector Floating-Point Instructions".

Comment 1 Vadim Barkov 2018-10-03 02:58:04 UTC

I've started to work on it. (see https://github.com/barkovv/valgrind/issues/1 for code and status)

Comment 2 Vadim Barkov 2018-10-05 11:35:49 UTC

Created attachment 115426 [details]
z13 vector floating point support (tests)

Comment 3 Vadim Barkov 2018-10-05 11:36:34 UTC

Created attachment 115427 [details]
z13 vector floating point support (code)

Please take a look

Comment 4 Andreas Arnez 2018-10-12 17:49:52 UTC

(In reply to Vadim Barkov from comment #3)
> Created attachment 115427 [details]
> z13 vector floating point support (code)
> 
> Please take a look
Thanks, this looks pretty good already.  One thing I'd like to note is that the patch contains many lines longer than 80 characters.  Please avoid that.

Apart from that, I'm currently testing the patch with various binaries, with mostly pleasing results.  I have encountered one potential problem, which may or may not be caused by your changes.  I'll look further into it and let you know when I know more.

Comment 5 Andreas Arnez 2018-10-30 18:58:57 UTC

(In reply to Andreas Arnez from comment #4)
> [...]  I have encountered one potential problem, which
> may or may not be caused by your changes.  I'll look further into it and let
> you know when I know more.
OK, I found two problems and documented them in Bugs 400490 and 400491.  Both of these bugs were already contained in the previous z13 patches.  I also attached fixes, with which I can now execute the previously misbehaving vector-FP binaries without any problems.

Comment 6 Vadim Barkov 2018-11-04 06:38:38 UTC

(In reply to Andreas Arnez from comment #5)

Good news! Is any additional work needed on this issue?

Comment 7 Andreas Arnez 2018-11-06 19:29:03 UTC

(In reply to Vadim Barkov from comment #6)
> Good news! Is any additional work needed on this issue?
As I said before, long lines should be avoided.  And there are a few other nits in the implementation (attachment 115427 [details]):

> Subject: [PATCH 2/2] Vector floating point implementation (code)
> [...]
> --- a/VEX/priv/guest_s390_toIR.c
> +++ b/VEX/priv/guest_s390_toIR.c
> [...]
>     return "vmalh";
>  }
>  
> +static void
> +s390_vector_fp_from_or_to_operation(IROp op, IRType fromType, IRType toTy…
Very long line; please split appropriately.  It may also help to shorten
the function name to something like "s390_vector_fp_convert".  I think
this would improve readability of the function's invocations as well.

Source lines in Valgrind should generally not exceed 80 columns.  You can
find long lines in your patch with a grep command like this:

  grep -nH '^+ .\{81\}' my.patch

> [...]
> +      for Iop_F64toF32 we do this:
> +      f64[0] -> f32[0]
> +      f64[1] -> f32[2]
> +
> +      The magic below with scaling factors is used to achive the logic de…
Replace "achive" by "achieve".

> [...]
> +   vassert(m3 == 3);
> +
> +   IRExpr* result;
> +   switch (m5) {
> +   case 0: { 
Space at end of line.

> [...]
> +                                get_vr_qw(v3));
> +      result = triop(addOrSub,
> +                     irrm,
> +                     mulResult,
> +                     get_vr_qw(v4));
You should probably add a comment here, explaining that the separate
multiply and add/sub operations are combined to a fused multiply-add/sub
by s390_isel_vec_expr_wrk() in host_s390_isel.c.

> [...]
> +   if (!s390_vr_is_cs_set(m6)) {
> +      if (LIKELY(!isSingleElementOp)) {
> +         put_vr_qw(v1, binop(Iop_CmpEQ64Fx2, get_vr_qw(v2), get_vr_qw(v3)…
> +      } else {
> +         IRExpr* comparationResult = binop(Iop_CmpF64, get_vr(v2, Ity_F64…
Replace "comparation" by "comparison".  Same in s390_irgen_VFCH and
s390_irgen_VFCHE.

> [...]
> --- a/VEX/priv/host_s390_isel.c
> +++ b/VEX/priv/host_s390_isel.c
> [...]
>  
> +      Iop_irrm_V_wrk: {
> +         set_bfp_rounding_mode_in_fpc(env, arg1);
> +         reg1 = s390_isel_vec_expr(env, arg2);
> +        
Spaces at end of line.

> [...]
> +      case Iop_Add64Fx2:
> +         size = 8;
> +
> +         /* Add64Fx2(Mul64Fx2(arg1, arg2), arg3) -> MAdd(arg1, arg2, arg3…
> +         if (UNLIKELY((arg2->tag == Iex_Triop) 
Space at end of line.

> +                      && (arg2->Iex.Triop.details->op == Iop_Mul64Fx2)
> +                      && (arg1 == arg2->Iex.Triop.details->arg1))
> +             ) {
> +            vec_op = S390_VEC_FLOAT_MADD;
> +            goto Iop_irrm_MAddOrSub;
OK, so this is where you combine separate multiply and add operations to a
fused multiply-add.  Is it guaranteed that this logic triggers whenever
the original instruction was a multiply-add, and only then?  We must not
do this if the program executes the instructions separately, because it
would change the result.

> [...]
> +      case Iop_Sub64Fx2:
> +         size = 8;
> +
> +         /* Sub64Fx2(Mul64Fx2(arg1, arg2), arg3) -> MSub(arg1, arg2, arg3…
> +         if (UNLIKELY((arg2->tag == Iex_Triop) 
Space at end of line.

Comment 8 Andreas Arnez 2018-11-08 12:09:26 UTC

(In reply to Vadim Barkov from comment #2)
> Created attachment 115426 [details]
> z13 vector floating point support (tests)
Here are some more comments about the test cases:

> [...]
> --- a/none/tests/s390x/vector.h
> +++ b/none/tests/s390x/vector.h
> [...]
> @@ -38,22 +42,57 @@ typedef union {
> [...]
> +void print_f32(const V128 value) {
> +   printf("%.5e | %.5e | %.5e | %.5e\n", value.f32[0], value.f32[1], valu…
Why don't we show all significant digits?  This may be relevant when
checking the effect of the various rounding modes.  Thus I'd prefer to use
the "%a" format here instead of "%.5e".  Also note that this function is
currently unused; is this intentional?

> [...]
> +void print_f64(const V128 value) {
> +   printf("%.9e | %.9e\n", value.f64[0], value.f64[1]);
Same here, better use "%a" instead of "%.9e".

> [...]
> --- /dev/null
> +++ b/none/tests/s390x/vector_float.c
> @@ -0,0 +1,174 @@
> [...]
> +int main() {
> [...]
> +   s390_call_float_test(cdlgb, (V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_IN…
> +   s390_call_float_test(cgdb, (V128_V_RES_AS_INT | V128_V_ARG1_AS_FLOAT64…
> +   s390_call_float_test(clgdb, V128_V_RES_AS_INT | V128_V_ARG1_AS_FLOAT64…
> +   s390_call_float_test(fidb, V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_FLOA…
> +   s390_call_float_test(ledb, V128_V_RES_AS_FLOAT64 | V128_V_ARG1_AS_FLOA…
For vledb, the result is in fact a pair of FLOAT32 values.  But switching
to V128_V_RES_AS_FLOAT32 is not quite right either, because the odd
elements in the result are "unpredictable".  Your implementation of VFLR
fills the odd elements with 0xff, whereas the hardware typically fills
them with zeroes.  Thus the test fails with a diff in stdout when
displaying floating-point values with "%a" as suggested above.  To fix
this, the test should probably just skip printing the odd elements.  Maybe
add another flag to indicate this, something like V128_V_RES_EVEN_ONLY.

> [...]
> +
> +   test_with_selective_printing(vfmadb, (V128_V_RES_AS_FLOAT64 | V128_V_A…
> +   test_with_selective_printing(wfmadb, (V128_V_RES_AS_FLOAT64 | V128_V_A…
> +   test_with_selective_printing(vfmsdb, (V128_V_RES_AS_FLOAT64 | V128_V_A…
> +   test_with_selective_printing(wfmsdb, (V128_V_RES_AS_FLOAT64 | V128_V_A…
Very long lines (140 chars); please split.  Or maybe you can pass the
flags in a variable.  It may also help to shorten the flags' names to
something like "V_ARG1_F64x2" etc.

Comment 9 Andreas Arnez 2018-11-14 18:56:52 UTC

For the record, Julian Seward commented the following in IRC:

* Regarding the fused multiply-add/subs:

"I think the *right* fix here is to create new Iops, Iop_MAddF64x2 and
IOpMSubF64x2 and use those instead (see libvex_ir.h, Iop_MAddF64 for
description) or (ugly slow hack) in s390_vector_fp_mulAddOrSub_operation,
for the vector case, split up the each operand into 2 64-bit scalars, use
the existing scalar FMA operations, and reconstruct the vector (it will
generate worse code, but it is less effort to implement because you don't
need to change any code-generator stuff)."

* About formal aspects:

"...also (1) please get it inside 80 cols [as you requested] and (2) I think
it would be clearer to remove the LIKELY/UNLIKELY hints in the new code in
guest_s390_toIR.c."

Comment 10 Andreas Arnez 2018-11-22 12:11:06 UTC

Created attachment 116468 [details]
Vector floating point support (v2)

Revised version of vector FP support.  This should address all the feedback from comment #7 and comment #9.  The fused vector multiply-add/sub is now implemented with two scalar multiply-add/sub operations.

Comment 11 Andreas Arnez 2018-11-22 12:16:29 UTC

Created attachment 116469 [details]
Tests for vector FP support (v2)

Revised version of the tests for the vector FP support.  This version addresses the feedback from comment #8 and also makes some internal updates that had been missing before.

Comment 12 Julian Seward 2018-11-28 07:30:22 UTC

(In reply to Andreas Arnez from comment #10)
> Created attachment 116468 [details]
> Vector floating point support (v2)
> 
> Revised version of vector FP support.  This should address all the feedback
> from comment #7 and comment #9.  The fused vector multiply-add/sub is now
> implemented with two scalar multiply-add/sub operations.

Nicely done; looks pretty clean.  OK to land providing you fix the (very minor)
things below.


diff --git a/VEX/priv/guest_s390_defs.h b/VEX/priv/guest_s390_defs.h
index 3bfecbe31..d72cc9f6d 100644
--- a/VEX/priv/guest_s390_defs.h
+++ b/VEX/priv/guest_s390_defs.h
+      struct {
+         unsigned int op1 : 8;
+         unsigned int v1  : 4;
+         unsigned int v2  : 4;
etc

For new structs like this, I'd prefer you use the house type "UInt" rather
than "unsigned int".


+/* Check if "Single-Element-Control" bit is set.
s/if/if the


+static const HChar *
+s390_irgen_VCDG(UChar v1, UChar v2, UChar m3, UChar m4, UChar m5)
+{
+   vassert(m3 == 3);
(and the same assertion in various functions below).  The assertion
really checks that there is some logic error in the s390 front end,
right?  In particular you're sure that it won't fail even in the
case of unknown/undecodable instructions?


@@ -19357,6 +20039,18 @@ s390_decode_6byte_and_irgen(const UChar *bytes)
         unsigned int rxb : 4;
         unsigned int op2 : 8;
       } VRR;
+      struct {
+         unsigned int op1 : 8;

Same comment as above re UInt.

Comment 13 Julian Seward 2018-11-28 07:43:10 UTC

(In reply to Andreas Arnez from comment #11)
> Created attachment 116469 [details]
> Tests for vector FP support (v2)
> 
> Revised version of the tests for the vector FP support.  This version
> addresses the feedback from comment #8 and also makes some internal updates
> that had been missing before.

This looks fine to me.  I would suggest running it by hand on Memcheck just to
check that there's no use of uninitialised memory, etc, and also to check that
the instrumentation code that Memcheck generates for these operations is
actually handled by the s390 back end.  OK to commit once that's done.

More generally, it would be advisable also to run, by hand, all of the
none/tests/s390 tests on Memcheck, each four times, for the 4 flag
combinations --track-origins={yes,no} --expensive-definedness-checks={yes,no}.
This has shown up failures on other targets in the past.

Comment 14 Julian Seward 2018-11-28 07:48:38 UTC

(In reply to Andreas Arnez from comment #9)
> For the record, Julian Seward commented the following in IRC:
> 
> * Regarding the fused multiply-add/subs:
> 
> "I think the *right* fix here is to create new Iops, Iop_MAddF64x2 and
> IOpMSubF64x2 and use those instead (see libvex_ir.h, Iop_MAddF64 for
> description) or (ugly slow hack) in s390_vector_fp_mulAddOrSub_operation,
> for the vector case, split up the each operand into 2 64-bit scalars, use
> the existing scalar FMA operations, and reconstruct the vector (it will
> generate worse code, but it is less effort to implement because you don't
> need to change any code-generator stuff)."

Regarding the split/scalar-op/reconstruct strategy.  That's probably not
too bad for dealing with 64 bit lanes in a 128 bit vector.  But as the
vector width grows (next year, 256 bit vectors in s390, maybe?) and the
lane size gets smaller (F32 ops, I16 ops) etc, this begins to generate
terribly slow and verbose code compared to the "right" fix.  I mention
this because, in the amd64 (x86_64) front end, this problem has now become
extreme -- eg, a 256 vector instruction comprising 8 x (convert i32 to f32)
becomes easily 200 instructions at the back end.

So investing in the "right" fix might be necessary at some point.

Comment 15 Andreas Arnez 2018-11-28 19:52:30 UTC

(In reply to Julian Seward from comment #12)
> (In reply to Andreas Arnez from comment #10)
> [...]
> +         unsigned int v2  : 4;
> etc
> 
> For new structs like this, I'd prefer you use the house type "UInt" rather
> than "unsigned int".
Sure, I can change this.  Just note that there are many such bit fields in the s390-specific code already, and they generally seem to be declared as "unsigned int".  This particularly applies to the code surrounding the added structs, which means that their different style will stick out.  Or should we better create a separate patch for changing all such occurrences in the s390-specific code at once?

> +/* Check if "Single-Element-Control" bit is set.
> s/if/if the
OK.

> +static const HChar *
> +s390_irgen_VCDG(UChar v1, UChar v2, UChar m3, UChar m4, UChar m5)
> +{
> +   vassert(m3 == 3);
> (and the same assertion in various functions below).  The assertion
> really checks that there is some logic error in the s390 front end,
> right?  In particular you're sure that it won't fail even in the
> case of unknown/undecodable instructions?
No.  I stumbled upon that as well, but honestly didn't pay much attention to it.  Unfortunately the current code structure in guest_s390_toIR.c doesn't provide a clean way of reporting a decoding error up through the call chain.  See my patch below for a possible solution.

> @@ -19357,6 +20039,18 @@ s390_decode_6byte_and_irgen(const UChar *bytes)
>          unsigned int rxb : 4;
>          unsigned int op2 : 8;
>        } VRR;
> +      struct {
> +         unsigned int op1 : 8;
> 
> Same comment as above re UInt.
The same as above applies.

Comment 16 Andreas Arnez 2018-11-28 19:57:22 UTC

Created attachment 116556 [details]
Change unsigned int to UInt

This is a delta-patch that changes the bit fields members from unsigned int to UInt.  This patch only performs this change for the structures added with the introduction of vector FP support and is meant to be merged with the base patch (if desired).

Comment 17 Andreas Arnez 2018-11-28 20:02:45 UTC

Created attachment 116557 [details]
Address feedback for vector FP patch

Another delta-patch that addresses the remaining aspects from comment #12.  The vassert() invocations that actually check for specification exceptions are replaced by invocations to a new macro, s390_insn_assert().  If a specification exception is recognized, Valgrind will now emit a message like this:
  vex s390->IR: specification exception: E720 2000 20E3
  ==158317== valgrind: Unrecognised instruction at address 0x10004c2.
This patch is also meant to be merged with the base vector FP support before committing.

Comment 18 Andreas Arnez 2018-11-30 14:18:13 UTC

Pushed after approval from Julian on IRC yesterday --
  "(03:05:09 PM) sewardj: so, yes, land all the 385411 patches"
-- as the following commits:

  600a0099a Bug 385411 s390x: Add z13 vector floating point support
  86bd88945 Bug 385411 s390x: Tests and internals for z13 vector FP support

Where the patches from comment 16 and comment 17 have been merged into the former.