The Power BCD ADD and BCD SUB instructions generate a result with shadow bits set. This then results in valgrind generating a message: ==8979== Use of uninitialised value of size 8 ==8979== at 0x40F8BD8: _itoa_word (_itoa.c:180) ==8979== by 0x40FB05B: vfprintf@@GLIBC_2.17 (vfprintf.c:1641) ==8979== by 0x1000084F: main (test_vector_misc.c:83 The shadow bits for both of the source operands are all zeros. The vbits should all be set to zero as a result of executing the instruction, However, two of the 128 shadow bits are set to 1 causing Valgrind to generate the above error message. The gdb interface to print the contents of the Power VR registers and its shadow bits doesn't work. See bugzilla 360008. If you edit coregrind/m_gdbserver/valgrind-low-ppc64.c to put in debug prints that show the contents of the VSR32 and VSR33 as follows: case 72: *mod = False; break; // GDBTD???? VEX { "trap", 4512, 64 }, case 73: VG_(transfer) (&ppc64->guest_VSR32, buf, dir, size, mod); VG_(printf)("\nCARLL, VSR32 = 0x%x %x %x %x\n", ppc64->guest_VSR32[3], ppc64->guest_VSR32[2], ppc64->guest_VSR32[1], ppc64->guest_VSR32[0]); break; case 74: VG_(transfer) (&ppc64->guest_VSR33, buf, dir, size, mod); VG_(printf)("CARLL, VSR33 = 0x%x %x %x %x\n", ppc64->guest_VSR33[3], ppc64->guest_VSR33[2], ppc64->guest_VSR33[1], ppc64->guest_VSR33[0]); break; You can see the correct values in the registers. The P9/Tests/test_vector_convert test case demonstrates this. Run the test case under valgrind with the option --vgdb-shadow-registers=yes to display the register and shadow register contents. The command is: valgrind --vex-iropt-register-updates=allregs-at-each-insn --vgdb=yes --vgdb-error=0 --vgdb-shadow-registers=yes ./test_vector_misc Stop at function test_vector_convert. Put gdb into assembly mode with the command "layout asm". Do stepi to get to the bcdadd instruction. 0x10000894 <test_vector_convert+28> nop │0x10000898 <test_vector_convert+32> addi r9,r2,-32624 │0x1000089c <test_vector_convert+36> lxvd2x vs0,0,r9 │0x100008a0 <test_vector_convert+40> xxswapd vs0,vs0 │0x100008a4 <test_vector_convert+44> nop │0x100008a8 <test_vector_convert+48> addi r9,r2,-32560 │0x100008ac <test_vector_convert+52> lxvd2x vs12,0,r9 │0x100008b0 <test_vector_convert+56> xxswapd vs12,vs12 │0x100008b4 <test_vector_convert+60> xxlor vs32,vs0,vs0 │0x100008b8 <test_vector_convert+64> xxlor vs33,vs12,vs12 │0x100008bc <test_vector_convert+68> bcdadd. v0,v0,v1,0 │0x100008c0 <test_vector_convert+72> xxlor vs0,vs32,vs32 Note, the Power VS register 0 to 31 map to the VSR32 to VSR64 registers. The floating point registers to the upper 64-bits of the VSR0 to VSR31 registers. The instruction is using vr0 and vr1 which correspond to VSR32 and VSR33. When you stop at the bcdadd instruction the register contents are: register contents CARLL, VSR32 = 0xa0 0 0 8 CARLL, VSR33 = 0x0 99999999 99999999 9999959c shadow s1 CARLL, VSR32 = 0x0 0 0 0 CARLL, VSR33 = 0x0 0 0 0 shadow s2 CARLL, VSR32 = 0x0 0 0 0 CARLL, VSR33 = 0x0 0 0 0 on gdb use "info registers all" to have gdb display the register contents: vr0 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} vr1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ... vr0s1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} vr1s1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} Not that the contents are all zeros according to gdb but the debug prints say otherwise. Stepi one more time to execute the bcd instructiion and you get: register contents CARLL, VSR32 = 0x100 99999999 99999999 9999959c CARLL, VSR33 = 0x0 99999999 99999999 9999959c the s1 shadow contents CARLL, VSR32 = 0x0 0 0 c CARLL, VSR33 = 0x0 0 0 0 the s2 shadow contents CARLL, VSR32 = 0x0 0 0 0 CARLL, VSR33 = 0x0 0 0 0 gdb gives the following vr0 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} vr1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ... vr0s1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} vr1s1 {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0, 0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}} The value VSR32 = 0x100 99999999 99999999 9999959c is the correct result. It will be printed by the test case. Note, the shadow bits are not all zeros, that is the bug!!! So we see that the source register VBits were all valid before the instruction but the result of the bcdadd instruction has two of the shadow register vbits set indicating the bits are uninitialized. The vbits are set by the function do_shadow_store() in memcheck/MC_translate.c, line 5018. We are doing 128-bit operations so in the " else if (UNLIKELY(ty == Ity_V128)) {" block we have a call to MC_(helperc_STOREV64le) which calls mc_storev64() do the actual vbit store. mc_storev64() is defined in memcheck/mc_main.c at about line 4646. If you put an if statement in the code: /*------------------------------------------------------------*/ /*--- STOREV64 ---*/ /*------------------------------------------------------------*/ static INLINE void mc_STOREV64 ( Addr a, ULong vbits64, Bool isBigEndian ) { PROF_EVENT(MCPE_STOREV64); #ifndef PERF_FAST_STOREV // XXX: this slow case seems to be marginally faster than the fast case! // Investigate further. mc_STOREVn_slow( a, 64, vbits64, isBigEndian ); #else { UWord sm_off16, vabits16; SecMap* sm; if (vbits64 == 0xC) VG_(printf)("CARLL, write bad vbits\n"); if (UNLIKELY( UNALIGNED_OR_HIGH(a,64) )) { PROF_EVENT(MCPE_STOREV64_SLOW1); mc_STOREVn_slow( a, 64, vbits64, isBigEndian ); return; } sm = get_secmap_for_reading_low(a); sm_off16 = SM_OFF_16(a); vabits16 = ((UShort*)(sm->vabits8))[sm_off16]; ... } You can get it to print the message "write bad vbits" for this particular case which corresponds to the valgrind generated messages: ==8979== Use of uninitialised value of size 8 ==8979== at 0x40F8BD8: _itoa_word (_itoa.c:180) ==8979== by 0x40FB05B: vfprintf@@GLIBC_2.17 (vfprintf.c:1641) ==8979== by 0x1000084F: main (test_vector_misc.c:83 What I have not been able to do is to find exactly where the vbits64 value gets generated when the bcdadd instruction executes. I don't know if it is done via generated assembly code or a call out to a C function like the mc_STOREV64() function that stores the vgit value. Need help to figure out why the vbits64 value is calculation is wrong. Reproducible: Always Steps to Reproduce: 1. given in the details. Attached the test program and a script to compile it. 2. 3.
Created attachment 97657 [details] test case main program
Created attachment 97658 [details] assembly function for the bcd add instruction
Created attachment 97659 [details] script to compile test case
Added Will to CC
I looked at the V-bit code generation stuff in expr2vbits_Triop for these cases. It is: /* BCDIops */ case Iop_BCDAdd: case Iop_BCDSub: complainIfUndefined(mce, atom3, NULL); return assignNew('V', mce, Ity_V128, triop(op, vatom1, vatom2, atom3)); I think that might be the problem. What this does is to apply the BCD add/sub operation to the definedness (V) bits of the first two operands: triop(op, vatom1, vatom2, atom3) Applying the original operation to the definedness bits is appropriate in cases where the original operation rearranges/copies bits in the arguments, so we want to rearrange/copy the V bits in the same way, for example in the cases just above -- Slice64 and SetElem*. But that's not the case here: BCD{add,sub} actually do computations with their operands, at least the first two. I'm not clear what the "lane" behaviour of these instructions is. Are the two operands treated as single 128-bit entities? If so, I think the following is possibly appropriate: /* BCDIops */ case Iop_BCDAdd: case Iop_BCDSub: return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3); What this will do is to mark the entire 128 bit result as undefined if any of the bits in the three operands (V128, V128, I8) is undefined. As far as I can see, the third operand is an 8 bit control field copied out of the original instruction, and will always be defined, so passing its V bits to mkLazy3 is pointless, but it doesn't matter, and in any case the post-instrumentation optimisation pass should fold out any unnecessary code.
>I'm not clear what the "lane" behaviour of these instructions is. Are the two operands treated >as single 128-bit entities? Yes, the BCD add/subtract instructions are just scalar arithmetic operations that take two numbers stored as 128-bit values and produce a single 128-bit result. They are not vector operations where you are operating on a sequence of smaller values, i.e. lanes. Thanks for the explanation, I will try it out.
Created attachment 98176 [details] Patch to fix BCD v-bit error The fix required changing the BCDadd and subtract case to return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3); As mentioned by Julian. It also required adding support to mkLazy3() for V128 and support for V128 to mkPCastTo(). This patch needs review by someone more familiar with the V-bit code then I am to make sure I got it right.
(In reply to Carl Love from comment #7) > Created attachment 98176 [details] > Patch to fix BCD v-bit error That will work. It is probably possible to generate a faster piece of code that doesn't require so many PCasts, but if these BCD instructions are relatively rarely used, that's unlikely to be important. Only one thing that needs to be changed. You have: @@ -2892,7 +2917,7 @@ IRAtom* expr2vbits_Triop ( MCEnv* mce, case Iop_BCDAdd: case Iop_BCDSub: complainIfUndefined(mce, atom3, NULL); - return assignNew('V', mce, Ity_V128, triop(op, vatom1, vatom2, atom3)); + return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3); Please remove the complainIfUndefined call. The mkLazy3 call causes any undefinedness in the 3rd (and 1st and 2nd) args to be propagated into the result, so at some point later it will be reported if the result is used to control a conditional branch. So there's no point in adding a definedness check at this point too.
Not sure that we will see a lot of BCD instructions. They are somewhat specialized for financial applications.
The changes fix Valgrind from generating the incorrect vbit result when either of the first two V128 operands have an undefined bit. However, the vbits are not being set if one of the vbits from the third operand, I8, are set. The vbit-test picked this up when I ran the full regression test.
Created attachment 98192 [details] Updated patch to fix BCD add/subtract vbit error The previous patch broke the vbit-test test. The issue is the Iop_BCDAdd and Iop_BCDSub pass the ps value down to the instruction generation code. The issue is the ps value is a constant passed in the triop() function call in function dis_av_bcd ( ), VEX/priv/guest_ppc_toIR.c. The vbit-test assumes that the arguments of the triop() are expressions so it can set the vbits for each of the arguments. However, the ps value is a constant so the test couldn't set the vbits for ps. The implementation of the BCD add and subtract instructions was changed to not pass the ps value, the vbit-test was updated to remove some no longer needed comments and code for the ps argument. The memcheck code was updated to handle the new implementation of the BCD add and subtract instructions.
(In reply to Carl Love from comment #11) > Created attachment 98192 [details] > Updated patch to fix BCD add/subtract vbit error Looks fine to me; OK to land.
Patch committed VEX commit 3218 valgrind commit 15871 valgrind commit 15872 (updated the NEWS file)