Bug 276784

Summary: Add support for IBM Power ISA 2.06 -- stage 2
Product: [Developer tools] valgrind Reporter: Maynard Johnson <maynardj>
Component: generalAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal    
Priority: NOR    
Version First Reported In: 3.7 SVN   
Target Milestone: ---   
Platform: Unlisted Binaries   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:
Attachments: Patch to add support for additional POWER7 instructions
Testcase patch
Version 2 of stage 2 patch for Power ISA 2.06
Version 2 of testcase patch for stage 2 Power ISA 2.06

Description Maynard Johnson 2011-06-29 21:32:12 UTC
The purpose of this bug is to add support for another subset of the new instructions in IBM Power ISA 2.06 (i.e., IBM POWER7 processor).  The patch attached to this bug is stage 2 of a set of patches that will be submitted over time.

Many of the new instructions supported by this patch are those which are generated by gcc when the LAMP stack packages are compiled with '-mcpu=power7'.  Thus, with this patch added to Valgrind, all LAMP stack binaries built with this option are now able to be run under Valgrind on a POWER7 system.
Comment 1 Maynard Johnson 2011-06-29 21:34:12 UTC
Created attachment 61459 [details]
Patch to add support for additional POWER7 instructions

This is the "code" patch to add support in VEX for new POWER7 instructions.
Comment 2 Maynard Johnson 2011-06-29 21:38:56 UTC
Created attachment 61460 [details]
Testcase patch

This patch adds a new testcase in none/tests/ppc{32|64} to test the stage 2 POWER7 support patch.
Comment 3 Maynard Johnson 2011-06-29 21:57:23 UTC
Julian,
I needed to add three new Iop's for this new POWER7 support: Iop_DivS64E, Iop_DivU32E, and Iop_F64toI64U.  I presume you won't have an objection to the Iop_F64toI64U, since it complements the other F64-to-x Iop's already there.  But I suspect you won't be too happy about the new Iop_Div* stuff.  These are for the "divide extended" operations in ISA 2.06.  These were done early on in the development of the patch (back in March, I think), and to be honest, I can't remember why I saw the need for the new Iop's.

I had already created this bug and attached the patches and then remembered to run the new testcase under memcheck.  Unfortunately, it fails under memcheck because I hadn't yet added support for Iop_DivU32E.  I'll wait to get your initial feedback before I do anything to fix the memcheck failure.  Thanks.
Comment 4 Maynard Johnson 2011-06-30 14:02:23 UTC
(In reply to comment #3)
> Julian,
> I needed to add three new Iop's for this new POWER7 support: Iop_DivS64E,
> Iop_DivU32E, and Iop_F64toI64U.  I presume you won't have an objection to the
> Iop_F64toI64U, since it complements the other F64-to-x Iop's already there. 
> But I suspect you won't be too happy about the new Iop_Div* stuff.  These are
> for the "divide extended" operations in ISA 2.06.  These were done early on in
> the development of the patch (back in March, I think), and to be honest, I
> can't remember why I saw the need for the new Iop's.

I refreshed my memory on why these new "Div" ops were necessary.

For divde* instructions (64-bit category), the hardware takes the 64-bit dividend passed in RA and concatenates it with 64 0's that provides the bottom half of a 128-bit dividend.  The divisor is 64-bit (passed in RB).

Similarly, for the divwe* instructions, the dividend is 32-bits from RA (if 64-bit mode, the lower 32 bits of RA are used), and then the hardware concatenates that with 32 0's that provides the bottom half of a 64-bit dividend. The divisor is 32-bit (passed RB).

I did not see a straightforward way to emulate this behavior with existing Iop's, which is why I created the new Iop's and implemented these div instructions natively.

> 
> I had already created this bug and attached the patches and then remembered to
> run the new testcase under memcheck.  Unfortunately, it fails under memcheck
> because I hadn't yet added support for Iop_DivU32E.  I'll wait to get your
> initial feedback before I do anything to fix the memcheck failure.  Thanks.
Comment 5 Julian Seward 2011-07-01 13:23:54 UTC
> For divde* instructions (64-bit category), the hardware takes the 64-bit
> dividend passed in RA and concatenates it with 64 0's that provides the bottom
> half of a 128-bit dividend.  The divisor is 64-bit (passed in RB).

IIUC, the meaning is

   DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y

and similarly for the 64-bit version.

I can't see an easy way to construct the same behaviour from the
existing IROps, true.  But I would like to know, what overflow 
behaviour do you intend?  It's clear that there will be an
overflowed result in any situation where x >= y.
Comment 6 Maynard Johnson 2011-07-01 14:08:00 UTC
(In reply to comment #5)
> > For divde* instructions (64-bit category), the hardware takes the 64-bit
> > dividend passed in RA and concatenates it with 64 0's that provides the bottom
> > half of a 128-bit dividend.  The divisor is 64-bit (passed in RB).
> 
> IIUC, the meaning is
> 
>    DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y
> 
> and similarly for the 64-bit version.
Yes, that's correct.
> 
> I can't see an easy way to construct the same behaviour from the
> existing IROps, true.  But I would like to know, what overflow 
> behaviour do you intend?  It's clear that there will be an
> overflowed result in any situation where x >= y.
Yes, the ISA 2.06 doc says the following:

-----------------------
If (RA) . (RB), or if an attempt is made to perform the division
        <anything> / 0
then the contents of register RT are undefined as are (if Rc=1) the contents of the LT, GT, and EQ bits of CR Field 0. In these cases, if OE=1 then OV is set to 1.
-----------------------

So, after setting up the Iop_DivU32E (in guest_ppc_toIR.c), I check if the OE flag was set in the instruction; if so, I call set_XER_OV_32 with PPCG_FLAG_OP_DIVWEU so that the OV flag is set only when the above conditions are met.
Comment 7 Maynard Johnson 2011-07-01 14:10:04 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > > For divde* instructions (64-bit category), the hardware takes the 64-bit
> > > dividend passed in RA and concatenates it with 64 0's that provides the bottom
> > > half of a 128-bit dividend.  The divisor is 64-bit (passed in RB).
> > 
> > IIUC, the meaning is
> > 
> >    DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y
> > 
> > and similarly for the 64-bit version.
> Yes, that's correct.
> > 
> > I can't see an easy way to construct the same behaviour from the
> > existing IROps, true.  But I would like to know, what overflow 
> > behaviour do you intend?  It's clear that there will be an
> > overflowed result in any situation where x >= y.
> Yes, the ISA 2.06 doc says the following:
> 
> -----------------------
> If (RA) . (RB), or if an attempt is made to perform the division
Darn . . . copy/paste from the ISA pdf didn't work so well.  The above is supposed to be:
  "If (RA)>=(RB)...."
>         <anything> / 0
> then the contents of register RT are undefined as are (if Rc=1) the contents of
> the LT, GT, and EQ bits of CR Field 0. In these cases, if OE=1 then OV is set
> to 1.
> -----------------------
> 
> So, after setting up the Iop_DivU32E (in guest_ppc_toIR.c), I check if the OE
> flag was set in the instruction; if so, I call set_XER_OV_32 with
> PPCG_FLAG_OP_DIVWEU so that the OV flag is set only when the above conditions
> are met.
Comment 8 Julian Seward 2011-07-05 10:23:36 UTC
OK for the new IR ops DivS64E and DivU32E, but as per previous
discussion please add a comment defining overflow and div-by-zero
behaviour.


+         /* There is no Iop_Div32Fx4, so there are two choices:
Not so .. it's at libvex_ir.h:985


This seems to massively	mash around the	existing FP infrastructure
(which isn't optimal; is there any way to partition it more cleanly?)                    


Does it cause any performance regressions?


+// This file should be kept in sync between none/tests/ppc32 and
none/tests/ppc64.
Don't duplicate	the file; instead make the ppc64 one be a symlink                        
to the ppc32 one.  Try (eg) ls -l none/tests/ppc64/jm-insns.c
It's OK	for the	expected results files to be separate, though;                           
the output could be different for the 32- vs 64-bit versions
in general.


Do all the new tests run memcheck-clean, in both 32- and 64-bit mode?
Comment 9 Maynard Johnson 2011-07-05 14:43:07 UTC
(In reply to comment #8)
> OK for the new IR ops DivS64E and DivU32E, but as per previous
> discussion please add a comment defining overflow and div-by-zero
> behaviour.
> 
> 
> +         /* There is no Iop_Div32Fx4, so there are two choices:
> Not so .. it's at libvex_ir.h:985

The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not implemented for ppc64.  The Iop_{Add|Sub}32Fx4 ops were previously implemented (in host_ppc_isel.c and host_ppc_defs.c) for the add and subtract VMX instructions (e.g., vaddfp, vsubfp), but there was no corresponding divide VMX instruction, so Iop_Div32Fx4 was never implemented for ppc64.  I could have done so here, implementing the xvdivsp insn natively.  But as the rest of my comment said, I opted not to "due to the general philosophy of reusing existing implementations when practical."

> 
> 
> This seems to massively    mash around the    existing FP infrastructure
> (which isn't optimal; is there any way to partition it more cleanly?)           
> 
> Does it cause any performance regressions?

Sorry, I don't know what you're referring to here.  Is this comment in regard to the implementation for Iop_Div32Fx4?
> 
> 
> +// This file should be kept in sync between none/tests/ppc32 and
> none/tests/ppc64.
> Don't duplicate    the file; instead make the ppc64 one be a symlink            
> to the ppc32 one.  Try (eg) ls -l none/tests/ppc64/jm-insns.c
> It's OK    for the    expected results files to be separate, though;            
> the output could be different for the 32- vs 64-bit versions
> in general.

OK, I can do that.  It seems like there was a change in philosophy on this -- from symbolic links, to copies, and back to symbolic links, and I started my stage 1 POWER7 patch using valgrind 3.6.0 when copies were used (at least insofar as none/tests/ppc{32|64}/jm-insns.c).  But I see now that in current SVN, ppc64/jm-insns.c has gone back to a symbolic link.  BTW, my stage 1 POWER7 testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to the 32-bit.  We should probably change that to a symbolic link, too.
> 
> 
> Do all the new tests run memcheck-clean, in both 32- and 64-bit mode?
No.  See paragraph 2 in comment #3.  Assuming you are OK with the three new Iops I introduced, I will respin the patch and add the support for the missing Iop_DivU32E to memcheck.  But I'll wait to respin until I get your feedback on my responses from the other issues you've raised.

Thanks a lot for the review effort so far.
Comment 10 Julian Seward 2011-07-06 06:45:07 UTC
(In reply to comment #9)
> > +         /* There is no Iop_Div32Fx4, so there are two choices:
> > Not so .. it's at libvex_ir.h:985
> 
> The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not

Ok fine; but pls fix the comment then.


> > This seems to massively    mash around the    existing FP infrastructure
> > (which isn't optimal; is there any way to partition it more cleanly?)           
> > 
> > Does it cause any performance regressions?
> 
> Sorry, I don't know what you're referring to here.  Is this comment in regard
> to the implementation for Iop_Div32Fx4?

I mean for the existing FP implementation as a whole, not just for
Iop_Div32Fx4.

w.r.t. "partition it more cleanly", what I mean is, move the
instruction decoding for these new instructions into their own
function, rather than have it intermingled with the existing
instruction decoder.  eg, in guest_arm_toIR.c, see
decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(),
decode_NEON_instruction(): these 3 instruction set extensions are
handled by separate functions.


> OK, I can do that.  It seems like there was a change in philosophy
> on this --

Hmm ok, I didn't know that.  Symlinks for large duplicated test source
files seem like a better approach.

> SVN, ppc64/jm-insns.c has gone back to a symbolic link.  BTW, my stage 1 POWER7
> testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to
> the 32-bit.  We should probably change that to a symbolic link, too.

Ok .. maybe file a new bug for that.


> > Do all the new tests run memcheck-clean, in both 32- and 64-bit mode?
> No.  See paragraph 2 in comment #3.  Assuming you are OK with the three new
> Iops I introduced, I will respin the patch and add the support for the missing
> Iop_DivU32E to memcheck.

By "memcheck-clean" I didn't mean "they don't assert on memcheck".  I
meant, does memcheck report any errors in them?
Comment 11 Maynard Johnson 2011-07-06 14:00:40 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > > +         /* There is no Iop_Div32Fx4, so there are two choices:
> > > Not so .. it's at libvex_ir.h:985
> > 
> > The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not
> 
> Ok fine; but pls fix the comment then.
Will do.
> 
> 
> > > This seems to massively    mash around the    existing FP infrastructure
> > > (which isn't optimal; is there any way to partition it more cleanly?)           
> > > 
> > > Does it cause any performance regressions?
> > 
> > Sorry, I don't know what you're referring to here.  Is this comment in regard
> > to the implementation for Iop_Div32Fx4?
> 
> I mean for the existing FP implementation as a whole, not just for
> Iop_Div32Fx4.
> 
> w.r.t. "partition it more cleanly", what I mean is, move the
> instruction decoding for these new instructions into their own
> function, rather than have it intermingled with the existing
> instruction decoder.  eg, in guest_arm_toIR.c, see
> decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(),
> decode_NEON_instruction(): these 3 instruction set extensions are
> handled by separate functions.
OK.  I'll make sure not to mix basic FP and VSX FP in the same decoding function.
> 
> 
> > OK, I can do that.  It seems like there was a change in philosophy
> > on this --
> 
> Hmm ok, I didn't know that.  Symlinks for large duplicated test source
> files seem like a better approach.
> 
> > SVN, ppc64/jm-insns.c has gone back to a symbolic link.  BTW, my stage 1 POWER7
> > testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to
> > the 32-bit.  We should probably change that to a symbolic link, too.
> 
> Ok .. maybe file a new bug for that.
I opened bug # 277199 for this issue.
> 
> 
> > > Do all the new tests run memcheck-clean, in both 32- and 64-bit mode?
> > No.  See paragraph 2 in comment #3.  Assuming you are OK with the three new
> > Iops I introduced, I will respin the patch and add the support for the missing
> > Iop_DivU32E to memcheck.
> 
> By "memcheck-clean" I didn't mean "they don't assert on memcheck".  I
> meant, does memcheck report any errors in them?
I'll have to finish updating memcheck so it doesn't assert on the Iop_DivU32E before I can answer.  But I'll be sure memcheck runs clean before attaching the updated patch.

I think I have enough feedback from you (at least for this round) that I can go ahead and update and re-submit the patch.  Assuming no major distractions, I should be able to get that done before the end of the week.  Thanks.
Comment 12 Maynard Johnson 2011-07-06 17:04:18 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #9)
[snip]
> > > > This seems to massively    mash around the    existing FP infrastructure
> > > > (which isn't optimal; is there any way to partition it more cleanly?)           
> > > > 
> > > > Does it cause any performance regressions?
> > > 
> > > Sorry, I don't know what you're referring to here.  Is this comment in regard
> > > to the implementation for Iop_Div32Fx4?
> > 
> > I mean for the existing FP implementation as a whole, not just for
> > Iop_Div32Fx4.
> > 
> > w.r.t. "partition it more cleanly", what I mean is, move the
> > instruction decoding for these new instructions into their own
> > function, rather than have it intermingled with the existing
> > instruction decoder.  eg, in guest_arm_toIR.c, see
> > decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(),
> > decode_NEON_instruction(): these 3 instruction set extensions are
> > handled by separate functions.
> OK.  I'll make sure not to mix basic FP and VSX FP in the same decoding
> function.
hmmm . . . I should have looked before I replied here . . .  All of the decoding for new VSX instructions is being done in new functions with names beginning with "dis_vx", so I'm not mixing any pure floating point ops with VSX floating point ops.  I even separate out the scalar from the vector ops of the same type (e.g. xsadddp is decoded in dis_vxs_arith and xvadddp is decoded in dis_vxv_dp_arith).  So if there's some additional partitioning you want, please let me know.
[snip]
Comment 13 Maynard Johnson 2011-07-07 17:44:52 UTC
Created attachment 61684 [details]
Version 2 of stage 2 patch for Power ISA 2.06

This version of the stage 2 patch for Power ISA 2.06 support addresses all of the review comments so far, with the exception of the floating point partitioning issue.  See comment #12.  If you see a need for further partitioning, please let me know.
Comment 14 Maynard Johnson 2011-07-07 17:49:28 UTC
Created attachment 61687 [details]
Version 2 of testcase patch for stage 2 Power ISA 2.06

This version of the testcase patch addresses the review comments so far.  In particular, the file none/tests/ppc64/test_isa_2_06_part2.c file (which was identical to the same-named file in none/tests/ppc32) has been removed.  When this patch is applied to upstream SVN, a symbolic link should be added in none/tests/ppc64 to the none/tests/ppc32/test_isa_2_06_part2.c file.

This testcase runs clean under memcheck.
Comment 15 Julian Seward 2011-07-24 14:26:25 UTC
Committed, r2184/11907/11908.
Comment 16 Maynard Johnson 2011-07-25 13:43:21 UTC
I did a checkout from svn after the code was committed and everything worked perfectly on my POWER7 system.  Thank you very much!