Bug 115869

Summary: Trying to valgrind the qemu does not work (or unhandled instruction bytes 0xF3 0xB8 0x92 0xE0)
Product: [Developer tools] valgrind Reporter: Igor Kovalenko <igor.v.kovalenko>
Component: generalAssignee: Julian Seward <jseward>
Status: RESOLVED NOT A BUG    
Severity: crash    
Priority: NOR    
Version: 3.1 SVN   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:

Description Igor Kovalenko 2005-11-07 19:59:42 UTC
I in qemu list that valgrding could take the qemu trace - but failed to confirm
that. I also tried with --smc-check=all but result stays the same.
Running with --tool=memcheck --leak-check=no hangs the valgrind (it does
something until the point where even --trace-cfi=yes stops the output) but it
did not crash. Is it OK to assume that self-modifying code checks does not work yet?
Anyway, the trace of --tool=none follows.
Sorry for 'M' version reporting - the only local modification is for callgrind
to build with current /trunk version (as output by svn diff):
--CUT LOCAL MODIFICATION--
Index: valgrind.pc.in
===================================================================
--- valgrind.pc.in      (revision 5032)
+++ valgrind.pc.in      (working copy)
@@ -5,6 +5,7 @@
 arch=@VG_ARCH@
 os=@VG_OS@
 platform=@VG_PLATFORM@
+valt_load_address=@VALT_LOAD_ADDRESS@

 Name: Valgrind
 Description: A dynamic binary instrumentation framework
--CUT LOCAL MODIFICATION--

==15870== Nulgrind, a binary JIT-compiler.
==15870== Copyright (C) 2002-2005, and GNU GPL'd, by Nicholas Nethercote.
==15870== Using LibVEX rev 1446M, a library for dynamic binary translation.
==15870== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP.
==15870== Using valgrind-3.1.SVN, a dynamic binary instrumentation framework.
==15870== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==15870==
--15870-- Valgrind library directory: /home/garrison/local/lib/valgrind
--15870-- Command line
--15870--    qemu
--15870--    -hda
--15870--    fc3.img-vg
--15870--    -boot
--15870--    c
--15870--    -m
--15870--    16
--15870--    -localtime
--15870-- Startup, with flags:
--15870--    -v
--15870--    -v
--15870--    --suppressions=/home/garrison/local/lib/valgrind/garrison.supp
--15870--    --tool=none
--15870--    --error-limit=no
--15870-- Contents of /proc/version:
--15870--   Linux version 2.6.14-rc5-git7-alsa64 (root@skyserv) (gcc version
4.0.3-pre20051023 (Gentoo 4.0.3_pre20051023)) #2 Thu Oct 27 23:49:00 MSD 2005
--15870-- TT/TC: VG_(init_tt_tc) (startup of code management)
--15870-- TT/TC: cache: 8 sectors of 5870592 bytes each = 46964736 total
--15870-- TT/TC: table: 524168 total entries, max occupancy 419328 (80%)
--15870-- Reading syms from /home/garrison/work/qemu/qemu-cvs/inst/bin/qemu
(0x400000)
--15870-- Reading syms from /lib64/ld-2.3.90.so (0x4000000)
--15870-- summarise_context(loc_start = 0xA3): cannot summarise(why=1):
0xC9: 104(r1) u  c-32 u  u  u  u  u  u  u  u  u  u  u  u  u  u  c-8 u  u  u
--15870-- summarise_context(loc_start = 0x0): cannot summarise(why=1):
0x2E: 0(r4) u  c0 u  u  u  u  c8 c48 u  u  u  u  c16 c24 c32 c40 c56 u  u  u
--15870-- Reading syms from /home/garrison/local/lib/valgrind/amd64-linux/none
(0x70000000)
--15870--    object doesn't have a dynamic symbol table
--15870-- Host CPU: arch = AMD64, subarch = NONE
--15870-- TT/TC: initialise sector 0
--15870-- Reading syms from
/home/garrison/local/lib/valgrind/amd64-linux/vgpreload_core.so (0x491B000)
--15870-- Reading syms from /lib64/libm-2.3.90.so (0x4A1C000)
--15870-- Reading syms from /lib64/libz.so.1.2.3 (0x4BA1000)
--15870-- Reading syms from /lib64/libpthread-2.3.90.so (0x4CB5000)
--15870-- Reading syms from /usr/lib64/libSDL-1.2.so.0.7.2 (0x4DCA000)
--15870-- Reading syms from /lib64/libutil-2.3.90.so (0x4F2C000)
--15870-- Reading syms from /lib64/libc-2.3.90.so (0x502F000)
--15870-- summarise_context(loc_start = 0x0): cannot summarise(why=1):
0x2E: 0(r4) u  c0 u  u  u  u  c8 c48 u  u  u  u  c16 c24 c32 c40 c56 u  u  u
--15870-- summarise_context(loc_start = 0x35): cannot summarise(why=1):
0x7D: 0(r4) u  c128 u  u  u  u  c120 c160 u  u  u  u  c72 c80 c88 c96 c168 u  u  u
--15870-- Reading syms from /lib64/libdl-2.3.90.so (0x525A000)
--15870-- Reading syms from /usr/lib64/libX11.so.6.2 (0x535E000)
--15870-- Reading syms from /usr/lib64/libXext.so.6.4 (0x553E000)
--15870-- REDIR: 0xFFFFFFFFFF600000 (???) redirected to 0x7000CB07 (???)
--15870-- Reading syms from /usr/lib64/libXcursor.so.1.0.2 (0x6AA1000)
--15870-- Reading syms from /usr/lib64/libXrender.so.1.2.2 (0x6BAB000)
--15870-- Reading syms from /usr/lib64/X11/locale/lib64/common/xlcDef.so.2
(0x6CB4000)
--15870-- REDIR: 0xFFFFFFFFFF600400 (???) redirected to 0x7000CB11 (???)
vex amd64->IR: unhandled instruction bytes: 0xF3 0xB8 0x92 0xE0
==15870== Your program just tried to execute an instruction that Valgrind
==15870== did not recognise.  There are two possible reasons for this.
==15870== 1. Your program has a bug and erroneously jumped to a non-code
==15870==    location.  If you are running Memcheck and you just saw a
==15870==    warning about a bad jump, it's probably your program's fault.
==15870== 2. The instruction is legitimate but Valgrind doesn't handle it,
==15870==    i.e. it's Valgrind's fault.  If you think this is the case or
==15870==    you are not sure, please let us know.
==15870== Either way, Valgrind will now raise a SIGILL signal which will
==15870== probably kill your program.
==15870==
==15870== Process terminating with default action of signal 4 (SIGILL)
==15870==  Illegal opcode at address 0x17043D7
==15870==    at 0x17043D7: ???
==15870==
--15870--     tt/tc: 18,610 tt lookups requiring 20,077 probes
--15870--     tt/tc: 18,610 fast-cache updates, 2 flushes
--15870-- translate: new        8,906 (221,912 -> 1,474,175; ratio 66:10) [0 scs]
--15870-- translate: dumped     0 (0 -> ??)
--15870-- translate: discarded  0 (0 -> ??)
--15870-- scheduler: 1,353,867 jumps (bb entries).
--15870-- scheduler: 27/10,284 major/minor sched events.
--15870--    sanity: 28 cheap, 2 expensive checks.
--15870--    exectx: 30,011 lists, 0 contexts (avg 0 per list)
--15870--    exectx: 0 searches, 0 full compares (0 per 1000)
--15870--    exectx: 0 cmp2, 0 cmp4, 0 cmpAll
--15870--
--15870-- ------ Valgrind's internal memory use stats follow ------
--15870-- core    :  1 sbs,     2 bs,  1/1  free bs, 1048576 mmap,      32 loan
--15870-- tool    :  1 sbs,     6 bs,  3/3  free bs, 1048576 mmap,     280 loan
--15870-- symtab  :  6 sbs,   119 bs, 16/16 free bs, 11350016 mmap, 9503408 loan
--15870-- (null)  :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--15870-- demangle:  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--15870-- exectxt :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--15870-- errors  :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--15870-- ttaux   :  1 sbs,   241 bs, 27/27 free bs,   65536 mmap,   28568 loan
--15870-- ------
--15870-- core    :  1048576 mmap'd,       96/      32 max/curr
--15870-- tool    :  1048576 mmap'd,    12352/     280 max/curr
--15870-- symtab  : 11350016 mmap'd,  9504736/ 9503408 max/curr
--15870-- (null)  :        0 mmap'd,        0/       0 max/curr
--15870-- demangle:        0 mmap'd,        0/       0 max/curr
--15870-- exectxt :        0 mmap'd,        0/       0 max/curr
--15870-- errors  :        0 mmap'd,        0/       0 max/curr
--15870-- ttaux   :    65536 mmap'd,    28664/   28568 max/curr
--15870--
Comment 1 Igor Kovalenko 2005-11-07 20:02:50 UTC
yet once again sorry for custom suppressions - it it matters, without them the problem report is the same.
Comment 2 Julian Seward 2005-11-07 22:58:42 UTC
> vex amd64->IR: unhandled instruction bytes: 0xF3 0xB8 0x92 0xE0

Good that someone finally tried to use Valgrind on QEMU.  I guess
you are using qemu-softmmu, and no kqemu acceleration, yes?  It should
work if you use --smc-check=all.

0xF3 0xB8 really looks like an illegal amd64 instruction.  Is it
possible that some bug before this point made QEMU jump to an illegal
address?  Or that your custom suppressions are hiding a real problem?
Comment 3 Igor Kovalenko 2005-11-07 23:32:20 UTC
Yes, it is i386-softmmu one, no kqemu accelerator module. Host arch is x86_64, qemu itself is 64bit, and would try to run 32bit mode emulation. It goes up to the  point of displaying SDL-based emulator window, then it crashes under valgrind. I already tried --smc-check=all, and --vex-guest-max-insns=1 :) BTW qemu itself is working OK (I'm using it to run FC3 guest). A clean run trace of 'valgrind -v -v --smc-check=all --tool=none', this time without any custom suppressions is as follows (not much different I believe):

==16395== Nulgrind, a binary JIT-compiler.
==16395== Copyright (C) 2002-2005, and GNU GPL'd, by Nicholas Nethercote.
==16395== Using LibVEX rev 1446M, a library for dynamic binary translation.
==16395== Copyright (C) 2004-2005, and GNU GPL'd, by OpenWorks LLP.
==16395== Using valgrind-3.1.SVN, a dynamic binary instrumentation framework.
==16395== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==16395==
--16395-- Valgrind library directory: /home/garrison/local/lib/valgrind
--16395-- Command line
--16395--    qemu
--16395--    -hda
--16395--    fc3.img-vg
--16395--    -boot
--16395--    c
--16395--    -m
--16395--    16
--16395--    -localtime
--16395-- Startup, with flags:
--16395--    -v
--16395--    -v
--16395--    --smc-check=all
--16395--    --tool=none
--16395--    --error-limit=no
--16395-- Contents of /proc/version:
--16395--   Linux version 2.6.14-git10-alsa64 (root@skyserv) (gcc version 4.0.3-pre20051104 (Gentoo 4.0.3_pre20051104)) #1 Mon Nov 7 23:00:18 MSK 2005
--16395-- TT/TC: VG_(init_tt_tc) (startup of code management)
--16395-- TT/TC: cache: 8 sectors of 5870592 bytes each = 46964736 total
--16395-- TT/TC: table: 524168 total entries, max occupancy 419328 (80%)
--16395-- Reading syms from /home/garrison/work/qemu/qemu-cvs/inst/bin/qemu (0x400000)
--16395-- Reading syms from /lib64/ld-2.3.90.so (0x4000000)
--16395-- summarise_context(loc_start = 0xA3): cannot summarise(why=1):
0xC9: 104(r1) u  c-32 u  u  u  u  u  u  u  u  u  u  u  u  u  u  c-8 u  u  u
--16395-- summarise_context(loc_start = 0x0): cannot summarise(why=1):
0x2E: 0(r4) u  c0 u  u  u  u  c8 c48 u  u  u  u  c16 c24 c32 c40 c56 u  u  u
--16395-- Reading syms from /home/garrison/local/lib/valgrind/amd64-linux/none (0x70000000)
--16395--    object doesn't have a dynamic symbol table
--16395-- Host CPU: arch = AMD64, subarch = NONE
--16395-- TT/TC: initialise sector 0
--16395-- Reading syms from /home/garrison/local/lib/valgrind/amd64-linux/vgpreload_core.so (0x491B000)
--16395-- Reading syms from /lib64/libm-2.3.90.so (0x4A1C000)
--16395-- Reading syms from /lib64/libz.so.1.2.3 (0x4BA1000)
--16395-- Reading syms from /lib64/libpthread-2.3.90.so (0x4CB5000)
--16395-- Reading syms from /usr/lib64/libSDL-1.2.so.0.7.2 (0x4DCA000)
--16395-- Reading syms from /lib64/libutil-2.3.90.so (0x4F2C000)
--16395-- Reading syms from /lib64/libc-2.3.90.so (0x502F000)
--16395-- summarise_context(loc_start = 0x0): cannot summarise(why=1):
0x2E: 0(r4) u  c0 u  u  u  u  c8 c48 u  u  u  u  c16 c24 c32 c40 c56 u  u  u
--16395-- summarise_context(loc_start = 0x35): cannot summarise(why=1):
0x7D: 0(r4) u  c128 u  u  u  u  c120 c160 u  u  u  u  c72 c80 c88 c96 c168 u  u  u
--16395-- Reading syms from /lib64/libdl-2.3.90.so (0x525A000)
--16395-- Reading syms from /usr/lib64/libX11.so.6.2 (0x535E000)
--16395-- Reading syms from /usr/lib64/libXext.so.6.4 (0x553E000)
--16395-- REDIR: 0xFFFFFFFFFF600000 (???) redirected to 0x7000CB07 (???)
--16395-- Reading syms from /usr/lib64/libXcursor.so.1.0.2 (0x6AA1000)
--16395-- Reading syms from /usr/lib64/libXrender.so.1.2.2 (0x6BAB000)
--16395-- Reading syms from /usr/lib64/X11/locale/lib64/common/xlcDef.so.2 (0x6CB4000)
--16395-- REDIR: 0xFFFFFFFFFF600400 (???) redirected to 0x7000CB11 (???)
vex amd64->IR: unhandled instruction bytes: 0xF3 0xB8 0x92 0xE0
==16395== Your program just tried to execute an instruction that Valgrind
==16395== did not recognise.  There are two possible reasons for this.
==16395== 1. Your program has a bug and erroneously jumped to a non-code
==16395==    location.  If you are running Memcheck and you just saw a
==16395==    warning about a bad jump, it's probably your program's fault.
==16395== 2. The instruction is legitimate but Valgrind doesn't handle it,
==16395==    i.e. it's Valgrind's fault.  If you think this is the case or
==16395==    you are not sure, please let us know.
==16395== Either way, Valgrind will now raise a SIGILL signal which will
==16395== probably kill your program.
==16395==
==16395== Process terminating with default action of signal 4 (SIGILL)
==16395==  Illegal opcode at address 0x17043D7
==16395==    at 0x17043D7: ???
==16395==
--16395--     tt/tc: 21,280 tt lookups requiring 23,373 probes
--16395--     tt/tc: 21,280 fast-cache updates, 2 flushes
--16395-- translate: new        10,051 (194,622 -> 2,156,596; ratio 110:10) [10,051 scs]
--16395-- translate: dumped     0 (0 -> ??)
--16395-- translate: discarded  0 (0 -> ??)
--16395-- scheduler: 1,481,042 jumps (bb entries).
--16395-- scheduler: 29/11,839 major/minor sched events.
--16395--    sanity: 30 cheap, 2 expensive checks.
--16395--    exectx: 30,011 lists, 0 contexts (avg 0 per list)
--16395--    exectx: 0 searches, 0 full compares (0 per 1000)
--16395--    exectx: 0 cmp2, 0 cmp4, 0 cmpAll
--16395--
--16395-- ------ Valgrind's internal memory use stats follow ------
--16395-- core    :  1 sbs,     2 bs,  1/1  free bs, 1048576 mmap,      32 loan
--16395-- tool    :  1 sbs,     6 bs,  3/3  free bs, 1048576 mmap,     280 loan
--16395-- symtab  :  6 sbs,   119 bs, 16/16 free bs, 11350016 mmap, 9503408 loan
--16395-- (null)  :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--16395-- demangle:  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--16395-- exectxt :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--16395-- errors  :  0 sbs,     0 bs,  0/0  free bs,       0 mmap,       0 loan
--16395-- ttaux   :  1 sbs,   236 bs, 22/22 free bs,   65536 mmap,   26304 loan
--16395-- ------
--16395-- core    :  1048576 mmap'd,       96/      32 max/curr
--16395-- tool    :  1048576 mmap'd,    12352/     280 max/curr
--16395-- symtab  : 11350016 mmap'd,  9504736/ 9503408 max/curr
--16395-- (null)  :        0 mmap'd,        0/       0 max/curr
--16395-- demangle:        0 mmap'd,        0/       0 max/curr
--16395-- exectxt :        0 mmap'd,        0/       0 max/curr
--16395-- errors  :        0 mmap'd,        0/       0 max/curr
--16395-- ttaux   :    65536 mmap'd,    26336/   26304 max/curr
--16395--
Comment 4 Julian Seward 2005-11-08 00:56:02 UTC
> I already tried --smc-check=all, and --vex-guest-max-insns=1

You should only need the first one of those.

I still don't know what this instruction is.  Please try this:

1. run --smc-check=all --trace-flags=10000000 --trace-notbelow=99999
   This prints 1 line per bb translated, and gives each a number.

2. Replace 99999 with the highest bb number you saw.  This should
   print a disassembly of the bb where the problem occurred. 
   Send that.
Comment 5 Igor Kovalenko 2005-11-08 06:57:23 UTC
There seems to be some interaction with qemu timer - when I chase for BB it generally shows max BB 10007 but if I specify dumping from there it or move terminal windows with mouse it can raise to 10048:
--29864-- Startup, with flags:
--29864--    -v
--29864--    -v
--29864--    --smc-check=all
--29864--    --trace-flags=10000000
--29864--    --trace-notbelow=10000
--29864--    --log-file=LOGtrace1
--29864--    --tool=none
--29864--    --error-limit=no
...
==== BB 10047 (0x17043CF) BBs exec'd 1482299 ====

------------------------ Front end ------------------------

        0x17043CF:  movl -13631749(%%rip),%eax

              ------ IMark(0x17043CF, 6) ------
              t0 = Add64(0x17043D5:I64,0xFFFFFFFFFF2FFEFB:I64)
              PUT(0) = 32Uto64(LDle:I32(t0))

        0x17043D5:  jmp*  %rax

              ------ IMark(0x17043D5, 2) ------
              PUT(168) = 0x17043D5:I64
              t1 = GET:I32(0)
              t2 = GET:I64(0)
              goto {Boring} t2

. 0 17043CF 8
. 8B 50 FB FE 2F FF FF E0

==== BB 10048 (0x17043D7) BBs exec'd 1482300 ====

------------------------ Front end ------------------------

        0x17043D7:  vex amd64->IR: unhandled instruction bytes: 0xF3 0xB8 0x92 0xE0
              ------ IMark(0x17043D7, 0) ------
              PUT(168) = 0x17043D7:I64
              goto {NoDecode} 0x17043D7:I64

. 0 17043D7 0
.

==29864== Your program just tried to execute an instruction that Valgrind
==29864== did not recognise.  There are two possible reasons for this.
Comment 6 Julian Seward 2005-11-08 14:34:06 UTC
> There seems to be some interaction with qemu timer - when I chase for BB it
> generally shows max BB 10007


There's some interaction with whether output is directed to a file or not.

>         0x17043D5:  jmp*  %rax
>         0x17043D7:  vex amd64-IR: unhandled instruction bytes: 0xF3 0xB8


Er, I just don't know.  This is consistent with V jumping off into outer
space, but I don't know why it happened.

Maybe you were using V to look for a bug in qemu, and this is part of
the bug?  Were you looking for some specific bug, or just doing a general
bug-check?
Comment 7 Igor Kovalenko 2005-11-08 20:10:57 UTC
I'm just curious actually.

I do have a little problem with linux terminal emulation under qemu - but that should be much later than qemu bootstrap stage that I check with valgrind in the test which leaded to this case. The test is to load qemu with just created virtual disk; no OS, just BIOS et al.

What do you mean by this is consistent with jumping to outer space? Are there other similar looking problems?

I can try to check with other compiler/qemu/whatever-you-can-tell combinations, if you want me to; I'd like this little problem to pass.
Comment 8 Igor Kovalenko 2005-11-08 23:40:56 UTC
Last good BB seems like good code at 0x017043d5 but there is a problem decoding instruction starting at 0x017043d7. I asked qemu to output dyn-gen-code; qemu insists that it is a correct repz mov sequence:

###V's BB 10047
0x017043cf:  mov    -13631749(%rip),%eax        # 0xa042d0
0x017043d5:  jmpq   *%eax

###V's BB 10048
0x017043d7:  repz mov $0xe092,%eax
###V's choke here

0x017043dd:  mov    %eax,0x20(%rbp)
0x017043e0:  lea    -13631837(%rip),%ebx        # 0xa04289
0x017043e6:  retq
Comment 9 Tom Hughes 2005-11-09 00:04:59 UTC
In message <20051108224058.4889.qmail@ktown.kde.org> you wrote:

> Last good BB seems like good code at 0x017043d5 but there is a problem
> decoding instruction starting at 0x017043d7. I asked qemu to output
> dyn-gen-code; qemu insists that it is a correct repz mov sequence:


Except there is no such instruction in the AMD manual... I did see
one reference on google that suggested the rep prefix might just be
ignored on a mov but then why would qemu generate it?

Tom
Comment 10 Tom Hughes 2005-11-09 01:09:56 UTC
The Intel manual agrees with the AMD manual that the rep/repz prefixes can only be used on the string instructions (movs, cmps, scas, lods, stos, ins and outs) and as part of SSE instructions when the byte after the prefix is 0xf.

So although the disassembly produced by qemu is literally correct - it is a mov instruction with a repz prefix - it seems that it is not a valid instruction. Obviously processors are prepared to execute it though, presumably just ignoring the prefix.
Comment 11 Igor Kovalenko 2005-11-09 01:48:39 UTC
Ohhh this valgrind session was inspiring :)
It really must be qemu dynamic generator problem with gcc-produced code.
The original code was generated by gcc (3.4.x) from 'do { some code } while (0);' construct, which gives the assembly ending in 'repz retq' sequence.
qemu dynamic code generator parses that and strips the retq; but it leaves the repz in place. Than it appends other code to that sequence, which unhides the problem. Seems like this worked out by pure coincidence. I'll try asking at qemu list.

I modified qemu generator to also strip the repz prefix from retq and now I'm able to run qemu under valgrind with --tool=memcheck.

Many thanks everyone!
Comment 12 Tom Hughes 2005-11-09 08:16:07 UTC
In message <20051109004840.28351.qmail@ktown.kde.org> you wrote:

> Ohhh this valgrind session was inspiring :)
> It really must be qemu dynamic generator problem with gcc-produced code.
> The original code was generated by gcc (3.4.x) from 'do { some code } while
> (0);' construct, which gives the assembly ending in 'repz retq' sequence.
> qemu dynamic code generator parses that and strips the retq; but it leaves
> the repz in place. Than it appends other code to that sequence, which
> unhides the problem. Seems like this worked out by pure coincidence. I'll
> try asking at qemu list.


Ah. There is some logic to repz retq - if you search the bug tracker
and/or mailing list archives you'll see that it has come up before and
is a wierd AMD suggested optimisation.

Tom
Comment 13 Julian Seward 2005-11-09 18:59:20 UTC
Ok, so it's a bug in qemu, yes?
Comment 14 Igor Kovalenko 2005-11-09 19:27:55 UTC
Confirmed this is not V's fault. Do not know if this is really a gcc bug.