Bug 328100

Summary: XABORT not implemented
Product: [Developer tools] valgrind Reporter: Joël Krähemann <joel>
Component: vexAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: mark, samuel.thibault
Priority: NOR Keywords: usability
Version: unspecified   
Target Milestone: ---   
Platform: Fedora RPMs   
OS: Linux   
Latest Commit: Version Fixed In:
Attachments: Patch to implement xabort as nop

Description Joël Krähemann 2013-11-26 09:38:07 UTC
From valgrind mailing list:

Am Dienstag, den 26.11.2013, 00:18 +0100 schrieb Mark Wielaard:
> On Mon, Nov 25, 2013 at 10:40:40PM +0000, Tom Hughes wrote:
> > On 25/11/13 22:06, Joël Krähemann wrote:
> > 
> > > Hi, I get this nice error message out of valgrind while starting or
> > > using my application and I don't know why.
> > >
> > > http://sf.net/p/ags
> > >
> > > Please help.
> > 
> > Given the location of the instruction I'm guessing you have a Haswell 
> > processor and you've built glibc with transactional memory support so 
> > that it is trying to do lock ellision using the transactional memory 
> > instructions.
I'm using g_atomic_int_set, g_atomic_int_get, g_atomic_int_xor and
g_atomic_int_and to implement thread synchronization. Could theses
functions be the problem or aren't they working proper?
> > 
> > The transactional memory instructions aren't fully supported in valgrind 
> > yet, so you'll probably have to rebuild glibc without that support for now.
> 
> This is somewhat unfortunate. We do have prelimenary support, but only
> implemented xstart and xtest to indicate the transaction always immediately
> fails. The instruction hit is xabort, which in the valgrind case would
> always just be a NOP since no transaction is (can be) active.
> 
> If there isn't a bug report for XABORT not being implemented, please do
> file one. We really should add it.

Here's the output of lscpu.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
Stepping:              9
CPU MHz:               2754.000
CPU max MHz:           3700.0000
CPU min MHz:           1200.0000
BogoMIPS:              5387.62
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7

Reproducible: Always

Steps to Reproduce:
1. launch ags with valgrind
2.
3.
Actual Results:  
valgrind exits with:

--9862-- memcheck GC: 2827 nodes, 2355 survivors ( 83.3%)
--9862-- memcheck GC: 3997 new table size (stepup)
vex amd64->IR: unhandled instruction bytes: 0xC6 0xF8 0xFD 0xF 0xB7 0x6 0x66 0x85
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==9862== valgrind: Unrecognised instruction at address 0x34f3411390.
==9862==    at 0x34F3411390: __lll_trylock_elision (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x34F340A22B: pthread_mutex_trylock (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x4A1264: ags_gui_thread_do_gtk_iteration.60959 (ags_gui_thread.c:199)
==9862==    by 0x4A1539: ags_gui_thread_run (ags_gui_thread.c:251)
==9862==    by 0x34F70104C6: ??? (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x34F7029748: g_signal_emit_valist (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x34F702A3AE: g_signal_emit (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x49F625: ags_thread_run (ags_thread.c:1720)
==9862==    by 0x49F99E: ags_thread_loop (ags_thread.c:1643)
==9862==    by 0x34F3407F32: start_thread (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x34F2CF4EAC: clone (in /usr/lib64/libc-2.18.so)
==9862== Your program just tried to execute an instruction that Valgrind
==9862== did not recognise.  There are two possible reasons for this.
==9862== 1. Your program has a bug and erroneously jumped to a non-code
==9862==    location.  If you are running Memcheck and you just saw a
==9862==    warning about a bad jump, it's probably your program's fault.
==9862== 2. The instruction is legitimate but Valgrind doesn't handle it,
==9862==    i.e. it's Valgrind's fault.  If you think this is the case or
==9862==    you are not sure, please let us know and we'll try to fix it.
==9862== Either way, Valgrind will now raise a SIGILL signal which will
==9862== probably kill your program.
==9862== 
==9862== Process terminating with default action of signal 4 (SIGILL)
==9862==  Illegal opcode at address 0x34F3411390
==9862==    at 0x34F3411390: __lll_trylock_elision (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x34F340A22B: pthread_mutex_trylock (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x4A1264: ags_gui_thread_do_gtk_iteration.60959 (ags_gui_thread.c:199)
==9862==    by 0x4A1539: ags_gui_thread_run (ags_gui_thread.c:251)
==9862==    by 0x34F70104C6: ??? (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x34F7029748: g_signal_emit_valist (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x34F702A3AE: g_signal_emit (in /usr/lib64/libgobject-2.0.so.0.3800.2)
==9862==    by 0x49F625: ags_thread_run (ags_thread.c:1720)
==9862==    by 0x49F99E: ags_thread_loop (ags_thread.c:1643)
==9862==    by 0x34F3407F32: start_thread (in /usr/lib64/libpthread-2.18.so)
==9862==    by 0x34F2CF4EAC: clone (in /usr/lib64/libc-2.18.so)
==9862== 
==9862== HEAP SUMMARY:
==9862==     in use at exit: 2,241,959 bytes in 29,002 blocks
==9862==   total heap usage: 91,386 allocs, 62,384 frees, 7,205,816 bytes allocated
==9862== 
==9862== Searching for pointers to 27,363 not-freed blocks
==9862== Checked 36,580,456 bytes
==9862== 
==9862== LEAK SUMMARY:
==9862==    definitely lost: 2,909 bytes in 41 blocks
==9862==    indirectly lost: 14,861 bytes in 633 blocks
==9862==      possibly lost: 66,719 bytes in 921 blocks
==9862==    still reachable: 1,865,142 bytes in 25,768 blocks
==9862==         suppressed: 0 bytes in 0 blocks
==9862== Rerun with --leak-check=full to see details of leaked memory
==9862== 
==9862== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
--9862-- 
--9862-- used_suppression:      2 glibc-2.5.x-on-SUSE-10.2-(PPC)-2a /usr/lib64/valgrind/default.supp:1286
==9862== 
==9862== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)



Expected Results:  
valgrind helping me to find the bug.
Comment 1 Mark Wielaard 2013-11-26 10:06:10 UTC
trylock seems to just want to try to abort any pending transaction using xabort. When there isn't a transaction (as is always the case with valgrind atm, since xbegin just immediately goes into the fallback path), then xabort acts as a NOP.
Comment 2 Mark Wielaard 2013-11-26 21:59:03 UTC
Created attachment 83780 [details]
Patch to implement xabort as nop

XABORT can be called even when there is no current transaction.
In such a case XABORT acts as a NOP. Implement xabort as nop.
Comment 3 Mark Wielaard 2013-12-09 12:56:01 UTC
Committed as VEX r2800 with the addition of a check for opc == 0xC7 (xbegin) and opc == 0xC6 (for xabort) as discussed on irc with Julian to make the front end simpler and clearer.
Comment 4 Samuel Thibault 2014-08-28 16:00:55 UTC
It seems to behave badly with glibc:

pthread_mutex_t m;
int main(int argc, char *argv[]) {
	int err;
	pthread_mutex_trylock(&m);
	pthread_mutex_unlock(&m);
	if ((err = pthread_mutex_destroy(&m)))
		*(int*)0 = 0;
	return 0;
}

will crash, this is the state of the mutex:

$2 = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0xffffffff, __kind = 0x0, 
    __spins = 0x0, __elision = 0x3, __list = {__prev = 0x0, __next = 0x0}}, __size = {
    0x0 <repeats 12 times>, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3, 
    0x0 <repeats 17 times>}, __align = 0x0}

so apparently glibc did avoid locking on trylock, and didn't avoid unlocking on unlock, thus ending up with a bogus nusers.
Comment 5 Mark Wielaard 2014-08-28 16:45:15 UTC
(In reply to Samuel Thibault from comment #4)
> It seems to behave badly with glibc:
> 
> pthread_mutex_t m;
> int main(int argc, char *argv[]) {
> 	int err;
> 	pthread_mutex_trylock(&m);
> 	pthread_mutex_unlock(&m);
> 	if ((err = pthread_mutex_destroy(&m)))
> 		*(int*)0 = 0;
> 	return 0;
> }
> 
> will crash, this is the state of the mutex:
> 
> $2 = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers =
> 0xffffffff, __kind = 0x0, 
>     __spins = 0x0, __elision = 0x3, __list = {__prev = 0x0, __next = 0x0}},
> __size = {
>     0x0 <repeats 12 times>, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x0, 0x3, 
>     0x0 <repeats 17 times>}, __align = 0x0}
> 
> so apparently glibc did avoid locking on trylock, and didn't avoid unlocking
> on unlock, thus ending up with a bogus nusers.

That is consistent with running on actual hardware that has tsx. I am seeing the same on a i7-4600U with the rtm cpuid flag set and glibc-2.18-12 running that program directly and under valgrind svn.
Comment 6 Samuel Thibault 2014-08-28 16:51:40 UTC
So you get a crash on hardware supporting RTM?
Comment 7 Mark Wielaard 2014-08-28 17:14:28 UTC
Actually 2.18 glibc doesn't yet contain the TSX elission code. Let me try to find a hadware setup with TSX that has a newer glibc.

But given that the pthread_mutex_destroy seems to fail with EBUSY there might be something wrong with the program anyway.

Could you provide a bit more context where the program crashes exactly on which instruction and any differences with or without running under valgrind?
Comment 8 Samuel Thibault 2014-08-28 17:34:17 UTC
Well, in the execution within valgrind, what is wrong is that nusers has become -1, and thus destroy thinks the mutex is busy, while it's clearly not from the source code.  I believe I have found the bug, in glibc in the trylock case, where trylock does elision, but unlock doesn't (because it's not aware that trylock did). I'll report on the libc-alpha list.
Comment 9 Samuel Thibault 2014-08-28 17:36:02 UTC
(I know, in my test program, I didn't bother initializing the mutex, since PTHREAD_MUTEX_INITIALIZER is actually zeroes in nptl)
Comment 10 Samuel Thibault 2014-08-28 18:02:15 UTC
Reported and patch proposed on:

https://sourceware.org/ml/libc-alpha/2014-08/msg00464.html