| Summary: | Add support for IBM Power ISA 2.06 -- stage 2 | ||
|---|---|---|---|
| Product: | [Developer tools] valgrind | Reporter: | Maynard Johnson <maynardj> |
| Component: | general | Assignee: | Julian Seward <jseward> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | ||
| Priority: | NOR | ||
| Version First Reported In: | 3.7 SVN | ||
| Target Milestone: | --- | ||
| Platform: | Unlisted Binaries | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
| Attachments: |
Patch to add support for additional POWER7 instructions
Testcase patch Version 2 of stage 2 patch for Power ISA 2.06 Version 2 of testcase patch for stage 2 Power ISA 2.06 |
||
|
Description
Maynard Johnson
2011-06-29 21:32:12 UTC
Created attachment 61459 [details]
Patch to add support for additional POWER7 instructions
This is the "code" patch to add support in VEX for new POWER7 instructions.
Created attachment 61460 [details]
Testcase patch
This patch adds a new testcase in none/tests/ppc{32|64} to test the stage 2 POWER7 support patch.
Julian, I needed to add three new Iop's for this new POWER7 support: Iop_DivS64E, Iop_DivU32E, and Iop_F64toI64U. I presume you won't have an objection to the Iop_F64toI64U, since it complements the other F64-to-x Iop's already there. But I suspect you won't be too happy about the new Iop_Div* stuff. These are for the "divide extended" operations in ISA 2.06. These were done early on in the development of the patch (back in March, I think), and to be honest, I can't remember why I saw the need for the new Iop's. I had already created this bug and attached the patches and then remembered to run the new testcase under memcheck. Unfortunately, it fails under memcheck because I hadn't yet added support for Iop_DivU32E. I'll wait to get your initial feedback before I do anything to fix the memcheck failure. Thanks. (In reply to comment #3) > Julian, > I needed to add three new Iop's for this new POWER7 support: Iop_DivS64E, > Iop_DivU32E, and Iop_F64toI64U. I presume you won't have an objection to the > Iop_F64toI64U, since it complements the other F64-to-x Iop's already there. > But I suspect you won't be too happy about the new Iop_Div* stuff. These are > for the "divide extended" operations in ISA 2.06. These were done early on in > the development of the patch (back in March, I think), and to be honest, I > can't remember why I saw the need for the new Iop's. I refreshed my memory on why these new "Div" ops were necessary. For divde* instructions (64-bit category), the hardware takes the 64-bit dividend passed in RA and concatenates it with 64 0's that provides the bottom half of a 128-bit dividend. The divisor is 64-bit (passed in RB). Similarly, for the divwe* instructions, the dividend is 32-bits from RA (if 64-bit mode, the lower 32 bits of RA are used), and then the hardware concatenates that with 32 0's that provides the bottom half of a 64-bit dividend. The divisor is 32-bit (passed RB). I did not see a straightforward way to emulate this behavior with existing Iop's, which is why I created the new Iop's and implemented these div instructions natively. > > I had already created this bug and attached the patches and then remembered to > run the new testcase under memcheck. Unfortunately, it fails under memcheck > because I hadn't yet added support for Iop_DivU32E. I'll wait to get your > initial feedback before I do anything to fix the memcheck failure. Thanks.
> For divde* instructions (64-bit category), the hardware takes the 64-bit
> dividend passed in RA and concatenates it with 64 0's that provides the bottom
> half of a 128-bit dividend. The divisor is 64-bit (passed in RB).
IIUC, the meaning is
DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y
and similarly for the 64-bit version.
I can't see an easy way to construct the same behaviour from the
existing IROps, true. But I would like to know, what overflow
behaviour do you intend? It's clear that there will be an
overflowed result in any situation where x >= y.
(In reply to comment #5) > > For divde* instructions (64-bit category), the hardware takes the 64-bit > > dividend passed in RA and concatenates it with 64 0's that provides the bottom > > half of a 128-bit dividend. The divisor is 64-bit (passed in RB). > > IIUC, the meaning is > > DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y > > and similarly for the 64-bit version. Yes, that's correct. > > I can't see an easy way to construct the same behaviour from the > existing IROps, true. But I would like to know, what overflow > behaviour do you intend? It's clear that there will be an > overflowed result in any situation where x >= y. Yes, the ISA 2.06 doc says the following: ----------------------- If (RA) . (RB), or if an attempt is made to perform the division <anything> / 0 then the contents of register RT are undefined as are (if Rc=1) the contents of the LT, GT, and EQ bits of CR Field 0. In these cases, if OE=1 then OV is set to 1. ----------------------- So, after setting up the Iop_DivU32E (in guest_ppc_toIR.c), I check if the OE flag was set in the instruction; if so, I call set_XER_OV_32 with PPCG_FLAG_OP_DIVWEU so that the OV flag is set only when the above conditions are met. (In reply to comment #6) > (In reply to comment #5) > > > For divde* instructions (64-bit category), the hardware takes the 64-bit > > > dividend passed in RA and concatenates it with 64 0's that provides the bottom > > > half of a 128-bit dividend. The divisor is 64-bit (passed in RB). > > > > IIUC, the meaning is > > > > DivU32E(x:I32, y:I32) = (x ++ 0--(32)--0) / y > > > > and similarly for the 64-bit version. > Yes, that's correct. > > > > I can't see an easy way to construct the same behaviour from the > > existing IROps, true. But I would like to know, what overflow > > behaviour do you intend? It's clear that there will be an > > overflowed result in any situation where x >= y. > Yes, the ISA 2.06 doc says the following: > > ----------------------- > If (RA) . (RB), or if an attempt is made to perform the division Darn . . . copy/paste from the ISA pdf didn't work so well. The above is supposed to be: "If (RA)>=(RB)...." > <anything> / 0 > then the contents of register RT are undefined as are (if Rc=1) the contents of > the LT, GT, and EQ bits of CR Field 0. In these cases, if OE=1 then OV is set > to 1. > ----------------------- > > So, after setting up the Iop_DivU32E (in guest_ppc_toIR.c), I check if the OE > flag was set in the instruction; if so, I call set_XER_OV_32 with > PPCG_FLAG_OP_DIVWEU so that the OV flag is set only when the above conditions > are met. OK for the new IR ops DivS64E and DivU32E, but as per previous discussion please add a comment defining overflow and div-by-zero behaviour. + /* There is no Iop_Div32Fx4, so there are two choices: Not so .. it's at libvex_ir.h:985 This seems to massively mash around the existing FP infrastructure (which isn't optimal; is there any way to partition it more cleanly?) Does it cause any performance regressions? +// This file should be kept in sync between none/tests/ppc32 and none/tests/ppc64. Don't duplicate the file; instead make the ppc64 one be a symlink to the ppc32 one. Try (eg) ls -l none/tests/ppc64/jm-insns.c It's OK for the expected results files to be separate, though; the output could be different for the 32- vs 64-bit versions in general. Do all the new tests run memcheck-clean, in both 32- and 64-bit mode? (In reply to comment #8) > OK for the new IR ops DivS64E and DivU32E, but as per previous > discussion please add a comment defining overflow and div-by-zero > behaviour. > > > + /* There is no Iop_Div32Fx4, so there are two choices: > Not so .. it's at libvex_ir.h:985 The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not implemented for ppc64. The Iop_{Add|Sub}32Fx4 ops were previously implemented (in host_ppc_isel.c and host_ppc_defs.c) for the add and subtract VMX instructions (e.g., vaddfp, vsubfp), but there was no corresponding divide VMX instruction, so Iop_Div32Fx4 was never implemented for ppc64. I could have done so here, implementing the xvdivsp insn natively. But as the rest of my comment said, I opted not to "due to the general philosophy of reusing existing implementations when practical." > > > This seems to massively mash around the existing FP infrastructure > (which isn't optimal; is there any way to partition it more cleanly?) > > Does it cause any performance regressions? Sorry, I don't know what you're referring to here. Is this comment in regard to the implementation for Iop_Div32Fx4? > > > +// This file should be kept in sync between none/tests/ppc32 and > none/tests/ppc64. > Don't duplicate the file; instead make the ppc64 one be a symlink > to the ppc32 one. Try (eg) ls -l none/tests/ppc64/jm-insns.c > It's OK for the expected results files to be separate, though; > the output could be different for the 32- vs 64-bit versions > in general. OK, I can do that. It seems like there was a change in philosophy on this -- from symbolic links, to copies, and back to symbolic links, and I started my stage 1 POWER7 patch using valgrind 3.6.0 when copies were used (at least insofar as none/tests/ppc{32|64}/jm-insns.c). But I see now that in current SVN, ppc64/jm-insns.c has gone back to a symbolic link. BTW, my stage 1 POWER7 testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to the 32-bit. We should probably change that to a symbolic link, too. > > > Do all the new tests run memcheck-clean, in both 32- and 64-bit mode? No. See paragraph 2 in comment #3. Assuming you are OK with the three new Iops I introduced, I will respin the patch and add the support for the missing Iop_DivU32E to memcheck. But I'll wait to respin until I get your feedback on my responses from the other issues you've raised. Thanks a lot for the review effort so far. (In reply to comment #9) > > + /* There is no Iop_Div32Fx4, so there are two choices: > > Not so .. it's at libvex_ir.h:985 > > The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not Ok fine; but pls fix the comment then. > > This seems to massively mash around the existing FP infrastructure > > (which isn't optimal; is there any way to partition it more cleanly?) > > > > Does it cause any performance regressions? > > Sorry, I don't know what you're referring to here. Is this comment in regard > to the implementation for Iop_Div32Fx4? I mean for the existing FP implementation as a whole, not just for Iop_Div32Fx4. w.r.t. "partition it more cleanly", what I mean is, move the instruction decoding for these new instructions into their own function, rather than have it intermingled with the existing instruction decoder. eg, in guest_arm_toIR.c, see decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(), decode_NEON_instruction(): these 3 instruction set extensions are handled by separate functions. > OK, I can do that. It seems like there was a change in philosophy > on this -- Hmm ok, I didn't know that. Symlinks for large duplicated test source files seem like a better approach. > SVN, ppc64/jm-insns.c has gone back to a symbolic link. BTW, my stage 1 POWER7 > testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to > the 32-bit. We should probably change that to a symbolic link, too. Ok .. maybe file a new bug for that. > > Do all the new tests run memcheck-clean, in both 32- and 64-bit mode? > No. See paragraph 2 in comment #3. Assuming you are OK with the three new > Iops I introduced, I will respin the patch and add the support for the missing > Iop_DivU32E to memcheck. By "memcheck-clean" I didn't mean "they don't assert on memcheck". I meant, does memcheck report any errors in them? (In reply to comment #10) > (In reply to comment #9) > > > + /* There is no Iop_Div32Fx4, so there are two choices: > > > Not so .. it's at libvex_ir.h:985 > > > > The comment is wrong . . . I intended to say that Iop_Div32Fx4 is not > > Ok fine; but pls fix the comment then. Will do. > > > > > This seems to massively mash around the existing FP infrastructure > > > (which isn't optimal; is there any way to partition it more cleanly?) > > > > > > Does it cause any performance regressions? > > > > Sorry, I don't know what you're referring to here. Is this comment in regard > > to the implementation for Iop_Div32Fx4? > > I mean for the existing FP implementation as a whole, not just for > Iop_Div32Fx4. > > w.r.t. "partition it more cleanly", what I mean is, move the > instruction decoding for these new instructions into their own > function, rather than have it intermingled with the existing > instruction decoder. eg, in guest_arm_toIR.c, see > decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(), > decode_NEON_instruction(): these 3 instruction set extensions are > handled by separate functions. OK. I'll make sure not to mix basic FP and VSX FP in the same decoding function. > > > > OK, I can do that. It seems like there was a change in philosophy > > on this -- > > Hmm ok, I didn't know that. Symlinks for large duplicated test source > files seem like a better approach. > > > SVN, ppc64/jm-insns.c has gone back to a symbolic link. BTW, my stage 1 POWER7 > > testcase for 64-bit (none/tests/ppc64/test_isa_2_06_part1.c) is identical to > > the 32-bit. We should probably change that to a symbolic link, too. > > Ok .. maybe file a new bug for that. I opened bug # 277199 for this issue. > > > > > Do all the new tests run memcheck-clean, in both 32- and 64-bit mode? > > No. See paragraph 2 in comment #3. Assuming you are OK with the three new > > Iops I introduced, I will respin the patch and add the support for the missing > > Iop_DivU32E to memcheck. > > By "memcheck-clean" I didn't mean "they don't assert on memcheck". I > meant, does memcheck report any errors in them? I'll have to finish updating memcheck so it doesn't assert on the Iop_DivU32E before I can answer. But I'll be sure memcheck runs clean before attaching the updated patch. I think I have enough feedback from you (at least for this round) that I can go ahead and update and re-submit the patch. Assuming no major distractions, I should be able to get that done before the end of the week. Thanks. (In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) [snip] > > > > This seems to massively mash around the existing FP infrastructure > > > > (which isn't optimal; is there any way to partition it more cleanly?) > > > > > > > > Does it cause any performance regressions? > > > > > > Sorry, I don't know what you're referring to here. Is this comment in regard > > > to the implementation for Iop_Div32Fx4? > > > > I mean for the existing FP implementation as a whole, not just for > > Iop_Div32Fx4. > > > > w.r.t. "partition it more cleanly", what I mean is, move the > > instruction decoding for these new instructions into their own > > function, rather than have it intermingled with the existing > > instruction decoder. eg, in guest_arm_toIR.c, see > > decode_V6MEDIA_instruction(), decode_CP10_CP11_instruction(), > > decode_NEON_instruction(): these 3 instruction set extensions are > > handled by separate functions. > OK. I'll make sure not to mix basic FP and VSX FP in the same decoding > function. hmmm . . . I should have looked before I replied here . . . All of the decoding for new VSX instructions is being done in new functions with names beginning with "dis_vx", so I'm not mixing any pure floating point ops with VSX floating point ops. I even separate out the scalar from the vector ops of the same type (e.g. xsadddp is decoded in dis_vxs_arith and xvadddp is decoded in dis_vxv_dp_arith). So if there's some additional partitioning you want, please let me know. [snip] Created attachment 61684 [details] Version 2 of stage 2 patch for Power ISA 2.06 This version of the stage 2 patch for Power ISA 2.06 support addresses all of the review comments so far, with the exception of the floating point partitioning issue. See comment #12. If you see a need for further partitioning, please let me know. Created attachment 61687 [details]
Version 2 of testcase patch for stage 2 Power ISA 2.06
This version of the testcase patch addresses the review comments so far. In particular, the file none/tests/ppc64/test_isa_2_06_part2.c file (which was identical to the same-named file in none/tests/ppc32) has been removed. When this patch is applied to upstream SVN, a symbolic link should be added in none/tests/ppc64 to the none/tests/ppc32/test_isa_2_06_part2.c file.
This testcase runs clean under memcheck.
Committed, r2184/11907/11908. I did a checkout from svn after the code was committed and everything worked perfectly on my POWER7 system. Thank you very much! |