Bug 373166

Summary: drd and helgrind will crash wth vex amd64->IR: unhandled instruction bytes 0xFF 0xFF 0x48 0x85 0xC0 0x74 0x8 0x4D 0x89 0xE7
Product: [Developer tools] valgrind Reporter: Tushar <ntushar>
Component: vexAssignee: Bart Van Assche <bart.vanassche+kde>
Status: REPORTED ---    
Severity: crash CC: jseward
Priority: NOR    
Version: 3.12.0   
Target Milestone: ---   
Platform: RedHat Enterprise Linux   
OS: Linux   
Latest Commit: Version Fixed In:

Description Tushar 2016-12-01 22:59:41 UTC
The tests drd/tests/sem_open, sem_open2, sem_open3 and sem_open_traced all crash with unhandled instruction error

==14999== drd, a thread error detector
==14999== Copyright (C) 2006-2015, and GNU GPL'd, by Bart Van Assche.
==14999== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==14999== Command: ./sem_open -m
==14999== 
vex amd64->IR: unhandled instruction bytes: 0xFF 0xFF 0x48 0x85 0xC0 0x74 0x8 0x4D 0x89 0xE7
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==14999== valgrind: Unrecognised instruction at address 0x382bf0a2f0.
==14999==    at 0x382BF0A2F0: check_add_mapping (in /lib64/tls/libpthread-2.3.4.so)
==14999==    by 0x4A14EDF: sem_open_intercept (drd_pthread_intercepts.c:1391)
==14999==    by 0x4A14EDF: sem_open (drd_pthread_intercepts.c:1398)
==14999==    by 0x400AFA: main (sem_open.c:63)
==14999== Your program just tried to execute an instruction that Valgrind
==14999== did not recognise.  There are two possible reasons for this.
==14999== 1. Your program has a bug and erroneously jumped to a non-code
==14999==    location.  If you are running Memcheck and you just saw a
==14999==    warning about a bad jump, it's probably your program's fault.
==14999== 2. The instruction is legitimate but Valgrind doesn't handle it,
==14999==    i.e. it's Valgrind's fault.  If you think this is the case or
==14999==    you are not sure, please let us know and we'll try to fix it.
==14999== Either way, Valgrind will now raise a SIGILL signal which will
==14999== probably kill your program.
==14999== 
==14999== Process terminating with default action of signal 4 (SIGILL): dumping core
==14999==  Illegal opcode at address 0x382BF0A2F0
==14999==    at 0x382BF0A2F0: check_add_mapping (in /lib64/tls/libpthread-2.3.4.so)
==14999==    by 0x4A14EDF: sem_open_intercept (drd_pthread_intercepts.c:1391)
==14999==    by 0x4A14EDF: sem_open (drd_pthread_intercepts.c:1398)
==14999==    by 0x400AFA: main (sem_open.c:63)
==14999== 
==14999== For counts of detected and suppressed errors, rerun with: -v
==14999== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
Comment 1 Tushar 2016-12-01 23:08:09 UTC
cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
stepping        : 7
cpu MHz         : 2660.004
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm pni monitor ds_cpl est tm2 cx16 xtpr lahf_lm
bogomips        : 5323.68
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

glibc version

glibc-2.3.4-2.57.x86_64

Running the tests outside valgrind produces no errors

Context form  /lib64/tls/libpthread-2.3.4.so

  382bf0a2d1:   e8 92 ab ff ff          callq  382bf04e68 <memcpy@plt>
  382bf0a2d6:   49 83 fc ff             cmp    $0xffffffffffffffff,%r12
  382bf0a2da:   74 23                   je     382bf0a2ff <check_add_mapping+0x19f>
  382bf0a2dc:   48 8d 15 3d fe ff ff    lea    -0x1c3(%rip),%rdx        # 382bf0a120 <__sem_search>
  382bf0a2e3:   48 8d 35 66 a0 10 00    lea    0x10a066(%rip),%rsi        # 382c014350 <__sem_mappings>
  382bf0a2ea:   48 89 df                mov    %rbx,%rdi
  382bf0a2ed:   e8 46 ad ff ff          callq  382bf05038 <tsearch@plt>
  382bf0a2f2:   48 85 c0                test   %rax,%rax
  382bf0a2f5:   74 08                   je     382bf0a2ff <check_add_mapping+0x19f>
  382bf0a2f7:   4d 89 e7                mov    %r12,%r15
  382bf0a2fa:   e9 74 ff ff ff          jmpq   382bf0a273 <check_add_mapping+0x113>
  382bf0a2ff:   48 89 df                mov    %rbx,%rdi
  382bf0a302:   e8 71 ad ff ff          callq  382bf05078 <free@plt>
  382bf0a307:   e9 67 ff ff ff          jmpq   382bf0a273 <check_add_mapping+0x113>
  382bf0a30c:   45 31 c9                xor    %r9d,%r9d
  382bf0a30f:   45 89 f0                mov    %r14d,%r8d
Comment 2 Tushar 2016-12-02 02:58:18 UTC
This problem only seems to be happening with glibc 2.3.4 which is part of RHEL4.
glibc 2.5 (RHEL5) and glibc 2.12 (RHEL 6) don't exhibit this problem
Comment 3 Bart Van Assche 2016-12-02 03:12:44 UTC
Hi Tushar, are you aware that the "unhandled instruction bytes" message indicates a bug in VEX and not in DRD nor in Helgrind? See also http://valgrind.org/docs/manual/faq.html#faq.msgdeath.
Comment 4 Tushar 2016-12-02 03:24:06 UTC
Thank you for pointing to the right component.  Since the error was happening in the context of drd and helgrind, I picked those.
Comment 5 Tushar 2016-12-02 20:08:39 UTC
When running memcheck, this error does NOT occur.  Does VEX instrument libraries differently depending on the checker used?  

The same problem persists in version 3.8.0 and 3.11.0
Comment 6 Tushar 2016-12-07 22:15:01 UTC
Any help/suggestions on how to proceed?
Comment 7 Bart Van Assche 2016-12-08 19:18:03 UTC
Everyone RHEL user I know either uses RHEL 6 or RHEL 7. How relevant is RHEL 4 today?
Comment 8 Tushar 2016-12-08 20:04:53 UTC
Our computers are part of a $20+ million semiconductor capital equipment whose annual maintenance contract itself runs into hundreds of thousands of dollars a year.  So we can't drop support for legacy systems easily.  So it is very relevant to us.

I can help with investigation and testing but some pointers would definitely help.

To summarize my findings are

The crash only happens with 64 bit drd and helgrind test case.  memcheck is OK.
32 bit test cases have not problem with either of the checkers.

so I'm wondering if  this issue contained inside VEX or do the checkers also have a role to play.  Am I on the right track with my thought process?
Comment 9 Bart Van Assche 2016-12-08 20:36:56 UTC
Analyzing software with Helgrind and DRD may trigger other code paths compared to analyzing the same software with memcheck. My proposal is to start with looking into why VEX reports an unhandled instruction. Have you tried to disassemble the executable with objdump? If not, can you do that and post here the disassembly for the unhandled instruction bytes?
Comment 10 Tushar 2016-12-09 19:23:36 UTC
Based on the crash information, I did disassemble the library in question (/lib64/tls/libpthread-2.3.4.so) and looked at the instructions at the address 0x382BF0A2F0.  The instructions are

  382bf0a2d1:   e8 92 ab ff ff          callq  382bf04e68 <memcpy@plt>
  382bf0a2d6:   49 83 fc ff             cmp    $0xffffffffffffffff,%r12
  382bf0a2da:   74 23                   je     382bf0a2ff <check_add_mapping+0x19f>
  382bf0a2dc:   48 8d 15 3d fe ff ff    lea    -0x1c3(%rip),%rdx        # 382bf0a120 <__sem_search>
  382bf0a2e3:   48 8d 35 66 a0 10 00    lea    0x10a066(%rip),%rsi        # 382c014350 <__sem_mappings>
  382bf0a2ea:   48 89 df                mov    %rbx,%rdi
  382bf0a2ed:   e8 46 ad ff ff          callq  382bf05038 <tsearch@plt>
  382bf0a2f2:   48 85 c0                test   %rax,%rax
  382bf0a2f5:   74 08                   je     382bf0a2ff <check_add_mapping+0x19f>
  382bf0a2f7:   4d 89 e7                mov    %r12,%r15
  382bf0a2fa:   e9 74 ff ff ff          jmpq   382bf0a273 <check_add_mapping+0x113>
  382bf0a2ff:   48 89 df                mov    %rbx,%rdi
  382bf0a302:   e8 71 ad ff ff          callq  382bf05078 <free@plt>
  382bf0a307:   e9 67 ff ff ff          jmpq   382bf0a273 <check_add_mapping+0x113>
  382bf0a30c:   45 31 c9                xor    %r9d,%r9d
  382bf0a30f:   45 89 f0                mov    %r14d,%r8d

The thing that caught my attention is that data @ 0x382BF0A2F0 seems to be part of  callq instruction @ 382bf0a2ed.  VEX seems to think it is the start of a new instruction.

The disassembly of the test code (sem_open) itself at the are of interest is

  400aca:       ba f6 0d 40 00          mov    $0x400df6,%edx
  400acf:       be 20 00 00 00          mov    $0x20,%esi
  400ad4:       b8 00 00 00 00          mov    $0x0,%eax
  400ad9:       e8 e2 fd ff ff          callq  4008c0 <snprintf@plt>
  400ade:       48 8d 7d c0             lea    -0x40(%rbp),%rdi
  400ae2:       b9 01 00 00 00          mov    $0x1,%ecx
  400ae7:       ba 80 01 00 00          mov    $0x180,%edx
  400aec:       be c0 00 00 00          mov    $0xc0,%esi
  400af1:       b8 00 00 00 00          mov    $0x0,%eax
  400af6:       e8 25 fe ff ff          callq  400920 <sem_open@plt>
  400afb:       48 89 05 b6 07 20 00    mov    %rax,0x2007b6(%rip)        # 6012b8 <s_sem>
  400b02:       48 83 3d ae 07 20 00    cmpq   $0x0,0x2007ae(%rip)        # 6012b8 <s_sem>
  400b09:       00
  400b0a:       75 24                   jne    400b30 <main+0xfa>
  400b0c:       48 8d 55 c0             lea    -0x40(%rbp),%rdx
  400b10:       be 10 0e 40 00          mov    $0x400e10,%esi
  400b15:       48 8b 3d 7c 07 20 00    mov    0x20077c(%rip),%rdi        # 601298 <stderr@@GLIBC_2.2.5>
  400b1c:       b8 00 00 00 00          mov    $0x0,%eax
  400b21:       e8 ea fd ff ff          callq  400910 <fprintf@plt>



I hope this information helps
Comment 11 Tushar 2016-12-21 21:32:23 UTC
Is the above information useful?
Comment 12 Julian Seward 2017-03-06 15:34:25 UTC
I looked at it but it is not immediately obvious to me what is 
going on.  It's very strange that VEX is losing sync with the
instruction stream.  VEX is normally very reliable at that.
Comment 13 Tushar 2017-03-07 05:19:58 UTC
Julian,

Thanks for getting back. I tried to find out what is happening and didn't make much headway.  The problem can be reliably reproduced on an RHEL 4 system.  So if you want me to instrument something or if you have debug binary for me to try,  I will be more than happy to do so