The IBM ISA 3.0 has been released. This bugzilla is the fourth in a series of five bugzillas for adding the needed ISA 3.0 support to Valgrind. The series of bugzillas are: 359767, 361207, 362329. Reproducible: Always
Created attachment 99327 [details] Patch 4 of 5 to add VEX support for the POWER ISA 3.0 instructions Patch 4 of 5 to add VEX support for the POWER ISA 3.0 instructions This patch add support to emulate more of the ISA 3.0 instruction in Valgrind.
Created attachment 99328 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions This patch adds test suite support for the instructions added in the first patch. Note, I had to manually chop this patch into multiple pieces due to the limit on the size of an attachment. This part has the test_isa_3.0.c changes and the first part of the expected results.
Created attachment 99329 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions, part 2 Second part of the testsuite patch. More of the expected output.
Created attachment 99330 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions, part 3 final part of the testsuite patch expected results. The vgtest file change is at the very end of this part.
(In reply to Carl Love from comment #1) > Created attachment 99327 [details] > Patch 4 of 5 to add VEX support for the POWER ISA 3.0 instructions Main comments are: * does this have any performance effect on plain boring integer code? I ask because it seems like it might affect the IR generated for CR321 and CR0, which are used a lot for comparisons, but I am not sure. "make perf", then perf/vg_perf --vg=/path/to/old/tree --vg=/path/to/new/tree --reps=5 gives you some idea * I am somewhat concerned that some of these translations are going to generate a huge amount of IR and hence a huge amount of code at the back end, and I wonder if the more complex ones would be better done with C helper functions. Can you trigger any JIT assertion failures with worst-case code-size settings, by running all the test cases with --tool=memcheck --track-origins=yes --vex-iropt-level=0 ? * And some minor things below. VEX/priv/guest_ppc_toIR.c You swapped putCR321 and putCR0 ? Any actual change? Is confusing. Please un-swap. +static IRExpr * CmpGT128U (IRExpr *src1, IRExpr *src2) + /* Do and unsigend compare of two 128-bit values */ Fix typos +static IRTemp increment_BCDstring (IRExpr *src, IRExpr *carry_in) This seems like it will generate a huge amount of IR. Is it something you can do in a helper function, in the style of amd64g_dirtyhelper_AESKEYGENASSIST ? +static Bool dis_int_misc ( UInt theInstr ) + * to be suspende. Instruction fetching and execution are resumed suspende += d Un-indent the default: case one level. It is misleadingly indented. + case 0x12D: // lxvll (Load VSX Vector Left-Justified with Length XX1 form) Without looking at the context I can't tell, but .. does this need "gating" (ie is only allowed for some hardware variants) ? + /* nb_zero is 0xFF..FF if the nb_field = 0 */ + assign( nb_zero, binop( Iop_64HLtoV128, + unop( Iop_1Sto64, + binop( Iop_CmpEQ64, + mkexpr( nb ), + mkU64( 0 ) ) ), + unop( Iop_1Sto64, + binop( Iop_CmpEQ64, + mkexpr( nb ), + mkU64( 0 ) ) ) ) ); Lift the duplicated 1Sto64(CmpEQ64, nb, 0) term onto its own IRTemp and use that. Or maybe we should enable CSE by default in iropt for POWER? That has a JIT-time cost, though. + case 0x1AD: // stxvll (Store VSX Vector Left-justified with length XX1-form) Comments as for lxvll +static Bool dis_av_bcd_misc ( UInt theInstr ) + switch (opc2) { + case 0x341: // bcdcpsgn. Decimal Copy Sign VX-form + { Please fix indentation to match house style
The test cases look OK to me. One thing: did you check this tree still builds/runs fine on older hardware? eg, at least on Power5 64 and 32-bit and preferably some older 32-bit too?
I ran the perfomance tests comparing the source tree with patch 3 and the source tree with patch 4. The results were run for 10 repetitions and 5 repetitions. The results for the two are very consistent. The 10 repetition results are as follows: -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 valgrind-patch3:0.19s no: 2.1s (11.0x, -----) me: 4.4s (23.0x, -----) bigcode1 valgrind-patch4:0.19s no: 2.1s (10.9x, 1.0%) me: 4.3s (22.9x, 0.5%) -- bigcode2 -- bigcode2 valgrind-patch3:0.19s no: 4.7s (24.6x, -----) me:10.7s (56.4x, -----) bigcode2 valgrind-patch4:0.19s no: 4.7s (24.6x, 0.0%) me:10.7s (56.4x, 0.1%) -- bz2 -- bz2 valgrind-patch3:0.57s no: 3.4s ( 5.9x, -----) me: 8.6s (15.1x, -----) bz2 valgrind-patch4:0.57s no: 3.4s ( 6.0x, -0.9%) me: 8.6s (15.1x, -0.2%) -- fbench -- fbench valgrind-patch3:0.25s no: 1.7s ( 6.7x, -----) me: 4.3s (17.3x, -----) fbench valgrind-patch4:0.25s no: 1.7s ( 6.8x, -1.8%) me: 4.3s (17.3x, -0.2%) -- ffbench -- ffbench valgrind-patch3:0.19s no: 0.7s ( 3.6x, -----) me: 1.8s ( 9.6x, -----) ffbench valgrind-patch4:0.19s no: 0.7s ( 3.6x, 0.0%) me: 1.8s ( 9.6x, 0.5%) -- heap -- heap valgrind-patch3:0.25s no: 1.7s ( 6.9x, -----) me: 7.5s (29.9x, -----) heap valgrind-patch4:0.25s no: 1.7s ( 6.8x, 1.2%) me: 7.5s (29.8x, 0.4%) -- heap_pdb4 -- heap_pdb4 valgrind-patch3:0.29s no: 1.8s ( 6.3x, -----) me:10.8s (37.2x, -----) heap_pdb4 valgrind-patch4:0.29s no: 1.8s ( 6.3x, 0.0%) me:10.8s (37.3x, -0.1%) -- many-loss-records -- many-loss-records valgrind-patch3:0.03s no: 0.4s (13.3x, -----) me: 1.6s (55.0x, -----) many-loss-records valgrind-patch4:0.03s no: 0.4s (13.7x, -2.5%) me: 1.7s (55.3x, -0.6%) -- many-xpts -- many-xpts valgrind-patch3:0.05s no: 0.6s (11.4x, -----) me: 2.2s (44.0x, -----) many-xpts valgrind-patch4:0.05s no: 0.6s (11.4x, 0.0%) me: 2.2s (43.6x, 0.9%) -- memrw -- memrw valgrind-patch3:0.05s no: 0.8s (17.0x, -----) me: 1.9s (39.0x, -----) memrw valgrind-patch4:0.05s no: 0.8s (16.8x, 1.2%) me: 2.0s (39.2x, -0.5%) -- sarp -- sarp valgrind-patch3:0.02s no: 0.3s (16.0x, -----) me: 2.5s (125.5x, -----) sarp valgrind-patch4:0.02s no: 0.3s (16.0x, 0.0%) me: 2.5s (125.5x, 0.0%) -- tinycc -- tinycc valgrind-patch3:0.21s no: 2.3s (11.0x, -----) me:10.1s (48.2x, -----) tinycc valgrind-patch4:0.21s no: 2.3s (11.0x, 0.0%) me:10.3s (49.1x, -1.9%) -- Finished tests in perf ---------------------------------------------- == 12 programs, 48 timings ================= I investigated the second comment by running the requested test. After fixing a bug that was found found with some missing register support, it ran fine. After some private discussions with Julian, the is_BCDstring(), increment_BCDstring(), and the support for bcdctz., bcdctn., bcdcfz., bcdcfn., bcdsetsgn.instructions were reimplemented using clean helpers. The minor comments:. There is no change in the functions, putCR321 and putCR0. They were unswapped so there is no change in the patch. The duplicated code was replaced with a temp. The typos were fixed. Note, it was found that one of the instructions in the test suite was missing. The test suite was updated. The VEX support patch and test suite patch have been updated.
Created attachment 99731 [details] Patch 4 of 5 to add VEX support for the POWER ISA 3.0 instructions
Created attachment 99732 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions, part 1
Created attachment 99733 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions, part 2
Created attachment 99734 [details] Patch 4 of 5 to add testsuite support for the POWER ISA 3.0 instructions, part 3
All looks ok to me. OK to land. Thanks for the fixings.
Patches committed VEX commit 3222 valgrind commit 15896 The patches were tested on 64-bit Power 6, Power 7, Power 8 big endian and Power 8 little endian machines. The regression tests showed no new test failures on these systems.