360035 – POWER PC instruction bcdadd and bcdsubtract generate result with non-zero shadow bits

Bug 360035 - POWER PC instruction bcdadd and bcdsubtract generate result with non-zero shadow bits

Summary: POWER PC instruction bcdadd and bcdsubtract generate result with non-zero sha...

Status:	CLOSED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (show other bugs)
Version:	3.12 SVN
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2016-03-03 17:19 UTC by Carl Love
Modified:	2016-04-26 23:22 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
test case main program (2.22 KB, text/x-csrc) 2016-03-03 17:20 UTC, Carl Love	Details
assembly function for the bcd add instruction (1.02 KB, text/x-csrc) 2016-03-03 17:21 UTC, Carl Love	Details
script to compile test case (597 bytes, application/x-shellscript) 2016-03-03 17:22 UTC, Carl Love	Details
Patch to fix BCD v-bit error (2.52 KB, patch) 2016-03-31 15:51 UTC, Carl Love	Details
Updated patch to fix BCD add/subtract vbit error (16.19 KB, patch) 2016-04-01 21:07 UTC, Carl Love	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Carl Love 2016-03-03 17:19:13 UTC

The Power BCD ADD and BCD SUB instructions generate a result with shadow bits set.    
This then results in valgrind generating a message:                             
                                                                                
==8979== Use of uninitialised value of size 8                                   
==8979==    at 0x40F8BD8: _itoa_word (_itoa.c:180)                              
==8979==    by 0x40FB05B: vfprintf@@GLIBC_2.17 (vfprintf.c:1641)                
==8979==    by 0x1000084F: main (test_vector_misc.c:83                          
                                                                                
The shadow bits for both of the source operands are all zeros.                  
The vbits should all be set to zero as a result of executing the instruction,   
However, two of the 128 shadow bits are set to 1 causing Valgrind to generate    
the above error message.                                                        
                                                                                
The gdb interface to print the contents of the Power VR registers and its       
shadow bits doesn't work.   See bugzilla 360008.  If you edit                   
coregrind/m_gdbserver/valgrind-low-ppc64.c to put in debug prints that          
show the contents of the VSR32 and VSR33 as follows:                            
                                                                                
   case 72:  *mod = False; break; // GDBTD???? VEX { "trap", 4512, 64 },        
   case 73:  VG_(transfer) (&ppc64->guest_VSR32, buf, dir, size, mod);          
      VG_(printf)("\nCARLL, VSR32 = 0x%x %x %x %x\n",                           
                 ppc64->guest_VSR32[3], ppc64->guest_VSR32[2],                  
                 ppc64->guest_VSR32[1], ppc64->guest_VSR32[0]);                 
      break;                                                                    
                                                                                
   case 74:  VG_(transfer) (&ppc64->guest_VSR33, buf, dir, size, mod);          
      VG_(printf)("CARLL, VSR33 = 0x%x %x %x %x\n",                             
                  ppc64->guest_VSR33[3], ppc64->guest_VSR33[2],                 
                  ppc64->guest_VSR33[1], ppc64->guest_VSR33[0]);                
      break;                                                                    
                                                                                
You can see the correct values in the registers.                                
                                                                                
The P9/Tests/test_vector_convert test case demonstrates this.  Run the test     
case under valgrind with the option --vgdb-shadow-registers=yes to display      
the register and shadow register contents.   The command is:                    
                                                                                
   valgrind --vex-iropt-register-updates=allregs-at-each-insn --vgdb=yes        
               --vgdb-error=0 --vgdb-shadow-registers=yes ./test_vector_misc    
                                                                                
Stop at function test_vector_convert.  Put gdb into assembly mode with the      
command "layout asm".  Do stepi to get to the bcdadd instruction.            
                                                                                          
0x10000894 <test_vector_convert+28>     nop                                               
   │0x10000898 <test_vector_convert+32>     addi    r9,r2,-32624                          
   │0x1000089c <test_vector_convert+36>     lxvd2x  vs0,0,r9                              
   │0x100008a0 <test_vector_convert+40>     xxswapd vs0,vs0                               
   │0x100008a4 <test_vector_convert+44>     nop                                           
   │0x100008a8 <test_vector_convert+48>     addi    r9,r2,-32560                          
   │0x100008ac <test_vector_convert+52>     lxvd2x  vs12,0,r9                             
   │0x100008b0 <test_vector_convert+56>     xxswapd vs12,vs12                             
   │0x100008b4 <test_vector_convert+60>     xxlor   vs32,vs0,vs0                          
   │0x100008b8 <test_vector_convert+64>     xxlor   vs33,vs12,vs12                        
   │0x100008bc <test_vector_convert+68>     bcdadd. v0,v0,v1,0                            
   │0x100008c0 <test_vector_convert+72>     xxlor   vs0,vs32,vs32                         
                                                                                          
Note, the Power VS register 0 to 31 map to the VSR32 to VSR64 registers.  The             
floating point registers to the upper 64-bits of the VSR0 to VSR31 registers.      The instruction
is using vr0 and vr1 which correspond to VSR32 and VSR33.    
                                                                                          
When you stop at the bcdadd instruction the register contents are:                                        
                                                                                          
register  contents                                                                               
CARLL, VSR32 = 0xa0 0 0 8                                                                 
CARLL, VSR33 = 0x0 99999999 99999999 9999959c                                             
shadow s1                                                                                 
CARLL, VSR32 = 0x0 0 0 0                                                                  
CARLL, VSR33 = 0x0 0 0 0                                                                  
shadow s2
CARLL, VSR32 = 0x0 0 0 0                                                                  
CARLL, VSR33 = 0x0 0 0 0                                                                  
                                                                                          
on gdb use "info registers all" to have gdb display the register contents:                
                                                                                          
vr0            {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,  
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,      
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}                                       
vr1            {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,  
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,      
                                                                                          
...                                                                                       
vr0s1          {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,  
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,      
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}                                       
vr1s1          {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,  
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,      
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}

Not that the contents are all zeros according to gdb but the debug prints say otherwise.                                                                                     
                                                                                               
Stepi one more time to execute the bcd instructiion and you get:                               
   
register contents                                                                                            
CARLL, VSR32 = 0x100 99999999 99999999 9999959c                                                
CARLL, VSR33 = 0x0 99999999 99999999 9999959c                                                  
the s1 shadow contents
CARLL, VSR32 = 0x0 0 0 c                                                                       
CARLL, VSR33 = 0x0 0 0 0                                                                       
the s2 shadow contents
CARLL, VSR32 = 0x0 0 0 0                                                                       
CARLL, VSR33 = 0x0 0 0 0                                                                       
                                                                                              
  gdb gives the following
                                                                                       
vr0            {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,        
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,           
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}                                            
vr1            {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,       
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,           
                                                                                               
...                                                                                            
vr0s1          {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,       
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,           
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}                                            
vr1s1          {uint128 = 0x00000000000000000000000000000000, v4_float = {0x0, 0x0, 0x0,       
    0x0}, v4_int32 = {0x0, 0x0, 0x0, 0x0}, v8_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,           
    0x0, 0x0}, v16_int8 = {0x0 <repeats 16 times>}}
                                                                                               
The value VSR32 = 0x100 99999999 99999999 9999959c is the correct result.  It                  
will be printed by the test case.  Note, the shadow bits are not all zeros,                    
that is the bug!!!                                                                             
                                                                                               
So we see that the source register VBits were all valid before the instruction                 
but the result of the bcdadd instruction has two of the shadow register                        
vbits set indicating the bits are uninitialized.                                               
                                                                                               
The vbits are set by the function do_shadow_store() in                                         
memcheck/MC_translate.c, line 5018.  We are doing 128-bit                                      
operations so in the " else if (UNLIKELY(ty == Ity_V128)) {"                                   
block we have a call to MC_(helperc_STOREV64le) which calls                                    
mc_storev64() do the actual vbit store.  mc_storev64() is defined in                           
memcheck/mc_main.c at about line 4646.  If you put an if statement                             
in the code:                                                                                   
                                                                                               
/*------------------------------------------------------------*/                               
/*--- STOREV64                                             ---*/                               
/*------------------------------------------------------------*/                               
                                                                                               
static INLINE                                                                                  
void mc_STOREV64 ( Addr a, ULong vbits64, Bool isBigEndian )                                   
{                                                                                              
   PROF_EVENT(MCPE_STOREV64);                                                                  
                                                                                               
#ifndef PERF_FAST_STOREV                                                                       
   // XXX: this slow case seems to be marginally faster than the fast case!                    
   // Investigate further.                                                                     
   mc_STOREVn_slow( a, 64, vbits64, isBigEndian );                                             
#else                                                                                          
   {                                                                                           
      UWord   sm_off16, vabits16;                                                              
      SecMap* sm;                                                                              
                                                                                               
      if (vbits64 == 0xC)                                                                      
         VG_(printf)("CARLL, write bad vbits\n");                                              
                                                                                               
      if (UNLIKELY( UNALIGNED_OR_HIGH(a,64) )) {                                               
         PROF_EVENT(MCPE_STOREV64_SLOW1);                                                      
         mc_STOREVn_slow( a, 64, vbits64, isBigEndian );                                       
         return;                                                                               
      }    
                                                                                                                                 
      sm       = get_secmap_for_reading_low(a);                                                
      sm_off16 = SM_OFF_16(a);                                                                 
      vabits16 = ((UShort*)(sm->vabits8))[sm_off16];                                           
                                                                                               
...                                                                                            
}                                                                                              
                                                                                               
You can get it to print the message "write bad vbits" for this particular case which corresponds to the valgrind generated messages:                                                                   
                                                                                               
==8979== Use of uninitialised value of size 8                                                  
==8979==    at 0x40F8BD8: _itoa_word (_itoa.c:180)                                             
==8979==    by 0x40FB05B: vfprintf@@GLIBC_2.17 (vfprintf.c:1641)                               
==8979==    by 0x1000084F: main (test_vector_misc.c:83                                         
                                                                                                                                                            
What I have not been able to do is to find exactly where the vbits64 value                     
gets generated when the bcdadd instruction executes.  I don't know if it is                    
done via generated assembly code or a call out to a C function like the                        
mc_STOREV64() function that stores the vgit value.  Need help to figure out why                     
the vbits64 value is calculation is wrong.                         

Reproducible: Always

Steps to Reproduce:
1. given in the details.  Attached the test program and a script to compile it.
2.
3.

Comment 1 Carl Love 2016-03-03 17:20:11 UTC

Created attachment 97657 [details]
test case main program

Comment 2 Carl Love 2016-03-03 17:21:05 UTC

Created attachment 97658 [details]
assembly function for the bcd add instruction

Comment 3 Carl Love 2016-03-03 17:22:05 UTC

Created attachment 97659 [details]
script to compile test case

Comment 4 Carl Love 2016-03-03 17:22:47 UTC

Added Will to CC

Comment 5 Julian Seward 2016-03-29 09:03:28 UTC

I looked at the V-bit code generation stuff in expr2vbits_Triop for
these cases.  It is:

   /* BCDIops */
   case Iop_BCDAdd:
   case Iop_BCDSub:
      complainIfUndefined(mce, atom3, NULL);
      return assignNew('V', mce, Ity_V128, triop(op, vatom1, vatom2, atom3));

I think that might be the problem.  What this does is to apply the BCD
add/sub operation to the definedness (V) bits of the first two operands:

  triop(op, vatom1, vatom2, atom3)

Applying the original operation to the definedness bits is appropriate
in cases where the original operation rearranges/copies bits in the
arguments, so we want to rearrange/copy the V bits in the same way,
for example in the cases just above -- Slice64 and SetElem*.  But
that's not the case here: BCD{add,sub} actually do computations with
their operands, at least the first two.

I'm not clear what the "lane" behaviour of these instructions is.  Are
the two operands treated as single 128-bit entities?  If so, I think
the following is possibly appropriate:

      /* BCDIops */
      case Iop_BCDAdd:
      case Iop_BCDSub:
         return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3);

What this will do is to mark the entire 128 bit result as undefined if
any of the bits in the three operands (V128, V128, I8) is undefined.
As far as I can see, the third operand is an 8 bit control field
copied out of the original instruction, and will always be defined, so
passing its V bits to mkLazy3 is pointless, but it doesn't matter, and
in any case the post-instrumentation optimisation pass should fold out
any unnecessary code.

Comment 6 Carl Love 2016-03-29 15:27:48 UTC

>I'm not clear what the "lane" behaviour of these instructions is. Are the two operands treated 
>as single 128-bit entities?

Yes, the BCD add/subtract instructions are just scalar arithmetic operations that take two numbers stored as 128-bit values and produce a single 128-bit result.  They are not vector operations where you are operating on a sequence of smaller values, i.e. lanes. 

Thanks for the explanation, I will try it out.

Comment 7 Carl Love 2016-03-31 15:51:55 UTC

Created attachment 98176 [details]
Patch to fix BCD v-bit error

The fix required changing the BCDadd and subtract case to

   return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3);

As mentioned by Julian.  It also required adding support to mkLazy3() for V128 and support 
for V128 to mkPCastTo().

This patch needs review by someone more familiar with the V-bit code then I am to make sure I got it right.

Comment 8 Julian Seward 2016-03-31 16:17:33 UTC

(In reply to Carl Love from comment #7)
> Created attachment 98176 [details]
> Patch to fix BCD v-bit error

That will work.  It is probably possible to generate a faster piece of
code that doesn't require so many PCasts, but if these BCD instructions
are relatively rarely used, that's unlikely to be important.

Only one thing that needs to be changed.  You have:

@@ -2892,7 +2917,7 @@ IRAtom* expr2vbits_Triop ( MCEnv* mce,
       case Iop_BCDAdd:
       case Iop_BCDSub:
          complainIfUndefined(mce, atom3, NULL);
-         return assignNew('V', mce, Ity_V128, triop(op, vatom1, vatom2, atom3));
+         return mkLazy3(mce, Ity_V128, vatom1, vatom2, vatom3);

Please remove the complainIfUndefined call.  The mkLazy3 call causes any
undefinedness in the 3rd (and 1st and 2nd) args to be propagated into the result,
so at some point later it will be reported if the result is used to control a conditional
branch.  So there's no point in adding a definedness check at this point too.

Comment 9 Carl Love 2016-03-31 16:38:01 UTC

Not sure that we will see a lot of BCD instructions.  They are somewhat specialized for financial applications.

Comment 10 Carl Love 2016-03-31 17:12:49 UTC

The changes fix Valgrind from generating the incorrect vbit result when either of the first two V128 operands have an undefined bit.  However, the vbits are not being set if one of the vbits from the third operand, I8, are set.  The vbit-test picked this up when I ran the full regression test.

Comment 11 Carl Love 2016-04-01 21:07:39 UTC

Created attachment 98192 [details]
Updated patch to fix BCD add/subtract vbit error

The previous patch broke the vbit-test test.  

The issue is the Iop_BCDAdd and Iop_BCDSub pass the ps value down to the instruction generation code.  The issue is the ps value is a constant passed in the triop() function call in  function dis_av_bcd ( ), VEX/priv/guest_ppc_toIR.c.  The vbit-test assumes that the arguments of the triop() are expressions so it can set the vbits for each of the arguments.  However, the ps value is a constant so the test couldn't set the vbits for ps.  

The implementation of the BCD add and subtract instructions was changed to not pass the ps value, the vbit-test was updated to remove some no longer needed comments and code for the ps argument.  The memcheck code was updated to handle the new implementation of the BCD add and subtract instructions.

Comment 12 Julian Seward 2016-04-26 12:24:27 UTC

(In reply to Carl Love from comment #11)
> Created attachment 98192 [details]
> Updated patch to fix BCD add/subtract vbit error

Looks fine to me; OK to land.

Comment 13 Carl Love 2016-04-26 23:21:42 UTC

Patch committed
VEX commit   3218
valgrind commit 15871
valgrind commit 15872  (updated the NEWS file)