Bug 283060 - unhandled instruction bytes: 0xF 0x3 0xC1 0x66
Summary: unhandled instruction bytes: 0xF 0x3 0xC1 0x66
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (other bugs)
Version First Reported In: 3.7 SVN
Platform: Compiled Sources Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-30 03:38 UTC by Chris
Modified: 2013-10-28 15:06 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Proposed patch (LSL support in x86 VEX) (6.81 KB, patch)
2012-06-18 19:56 UTC, Jiří Hruška
Details
Simple test case (2.20 KB, application/octet-stream)
2012-06-18 19:57 UTC, Jiří Hruška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris 2011-09-30 03:38:11 UTC
Version:           3.7 SVN
OS:                MS Windows

32 bit app compiled with MinGW. Running into unimplemented x86 instruction LSL.

I'm using current SVN trunk. Any chance you can implement this instruction?

thanks, Chris


Reproducible: Always

Steps to Reproduce:
(instruction in question)
77872680 0f03c1          lsl     eax,ecx

(instruction information)
http://siyobik.info/main/reference/instruction/LSL


Actual Results:  
vex x86->IR: unhandled instruction bytes: 0xF 0x3 0xC1 0x66
--3744--   SCHED[1]: TRC: NODECODE
==3744== valgrind: Unrecognised instruction at address 0x77872680.


Expected Results:  
no assertion

(assembly context)
ntdll_77840000!ExInterlockedPopEntrySList:
77872658 53              push    ebx
77872659 55              push    ebp
7787265a 8be9            mov     ebp,ecx
7787265c 83ec08          sub     esp,8
7787265f 33c0            xor     eax,eax
77872661 890424          mov     dword ptr [esp],eax
77872664 64a130000000    mov     eax,dword ptr fs:[00000030h]
7787266a f6804c02000004  test    byte ptr [eax+24Ch],4
77872671 7408            je      ntdll_77840000!ExpInterlockedPopEntrySListResume (7787267b)
77872673 55              push    ebp
77872674 e84bfafeff      call    ntdll_77840000!ZwWow64InterlockedPopEntrySList (778620c4)
77872679 eb4e            jmp     ntdll_77840000!ExpInterlockedPopEntrySListEnd+0x16 (778726c9)
ntdll_77840000!ExpInterlockedPopEntrySListResume:
7787267b b953000000      mov     ecx,53h
77872680 0f03c1          lsl     eax,ecx       <------- F A I L -------
77872683 668bd0          mov     dx,ax
77872686 6681e2ff03      and     dx,3FFh
7787268b c1e216          shl     edx,16h
7787268e c1e002          shl     eax,2
77872691 0bc2            or      eax,edx
77872693 8b5504          mov     edx,dword ptr [ebp+4]
77872696 8bc8            mov     ecx,eax
77872698 c1e810          shr     eax,10h
7787269b 668bca          mov     cx,dx
7787269e 3bca            cmp     ecx,edx
...
Comment 1 Julian Seward 2011-10-13 09:55:40 UTC
This must have been made by some mutant assembler.  There's surely
a shorter encoding of "shl %cl,%eax" ?
Comment 2 Tom Hughes 2011-10-13 09:59:39 UTC
You misparsed lsl I'm afraid - it's Load Segment Limit not Logical Shift Left ;-)
Comment 3 Julian Seward 2011-10-13 16:23:07 UTC
Oh, that's a good one :-)  /me is amused.

Ok, well, load segment limit .. off the top of my head I have no
idea whether we can support that.
Comment 4 Chris 2011-10-18 05:48:07 UTC
hmmm... what about a dirty helper like x86g_dirtyhelper_*, amd64g_dirtyhelper_ and the like... in/out guest state and 2 args denoting register that holds the segment descriptor and target register.

Still getting familiar with VEX, I might try myself. However, it is a low priority item.
Comment 5 Jiří Hruška 2012-06-18 19:55:27 UTC
tl;dr: I implemented the x86 LSL instruction according to the reference, but
there is only one reported usage and even there, the instruction is not used
as would seem appropriate (to me, anyway). Sane testing is hard and providing
support above "doesn't crash" is questionable.

---

  The attached patch is my take at implementing the LSL instruction, a second
attachment contains possible sample program to test that it works.

  Adding support for the instruction was indeed quite simple, just following
the usual way of doing things as seen around for other opcodes on the
decoder/IR generation side, and reusing the code from x86g_use_seg_selector
to deal with the LDT/GDT at the helper side.

  But this is my first IR transformation - feel free to correct/optimize the
way I implemented it. If it wasn't such a weird, uncommon instruction, I'd
be a bit worried about the amount of necessary IRops.

  Anyway, so far so good. But... The usability of the unit test lies mostly
in emitting the LSL instruction (both its variants, to be exact) and
verifying it is correctly decoded and executed.

  Testing if it behaves well is a hard task, because Linux simply sets all
segments registers to a default linear 0:ffffffff selector. Or, on some
distributions, all registers but CS, which is limited for security purposes,
so it's out of the way.

  Now, on Windows (which is relevant here due to it being the only platform
where this opcode has been seen and makes some sense), there is interesting
limit of the FS segment, 0xFFF, which could be tested nicely. And it is
indeed returned as 0xFFF on a WinXP 32-bit machine.
  But on my Win7/x64 rig, the values returned don't make any sense to me,
like 0xBC00. Of course Valgrind4win fills the GDT with correct values using
Win32 API and returns 0xFFF as usual. I don't know if this is just some
consequence of mixing x86/x64 modes or whatever, but it sure is weird and
lessens the possibility of a good unit test again.

  Finally, the disassembly of the relevant user-land-side kernel function
where the LSL instruction is used shows it is meant to act as some kind of
a semaphore in lock-free list implementation or something, which questions
the relevance of the instruction even more, at least in its implementation
according to the specs...
Comment 6 Jiří Hruška 2012-06-18 19:56:52 UTC
Created attachment 71924 [details]
Proposed patch (LSL support in x86 VEX)
Comment 7 Jiří Hruška 2012-06-18 19:57:23 UTC
Created attachment 71925 [details]
Simple test case
Comment 8 EG 2013-10-28 15:06:01 UTC
I know this thread is very old, but the following bit of informationt might be useful for both Valgrind4win and whoever lands on this page, as information about lsl and segement descriptor 53h is pretty scarce.

In short: 

Segment 0x53 limit actually contains the "current processor number" (the ID of the logical processor executing that instruction)

In details...

Afaik, on Windows 7 64-bit, segment 0x53 is a 32-bits r/w data segment, with 1-byte granularity (as opposed to 4kb, i.e. lsl won't shift the limit 12 bits left and fill with 1s before returning it). Windows 7 64-bit (maybe also 2008 Server and Win8, I don't know) does not load that descriptor into any of the selector (even fs and gs are equal to ds, although they can have a non-zero base address, but it comes from a MSR instead of the segment descriptor). Also, segment limit is not used by the CPU in 64-bit long mode, so basically, this gives 20 free bits.  

Windows uses them to store the "current processor number" (logical ID, accounting for cores and hyperthreading), as returned by GetCurrentProcessorNumberEx Win32 API, albeit with a slight transformation:

bits 0-9: Processor group (always 0 on my machine)
bits 10-13: ???? Aways all 1s on my system, my guess is that it makes a segment limit of at least 15360 bytes (should the two other fields be 0) when this segment used by fs while running 32-bits apps under wow64.
bits 14-19: Processor number within group (for example, 0 to 7 on a desktop quad-core + hyper-threading CPU)

The same information can be obtained through cpuid instruction, with EAX=1 (result in EBX bits 24-31) and probably EAX=0xb (result in EDX) too (I never tested the latter)

I can't tell for sure how this ends up in a segment limit. It is probably set early in the OS loader when each processor/core is activated and initializes its own copy of the GDT. And why Windows does that this way? One might guess the GDT is one of the very few "processor-local" memory area that can be read (at least partly) from user-land, on top of being atomic (a single lsl instead of two cpuid) and pretty quick operation.

Given the code from Chris above, it seems that some interlocked operation depends on the current processor it is executed on. Returning a "real limit" such as 0xfff (namely, always returning the same value) might break the locking logic, unless of course Valgrind4win assumes a single-processor environment.