From valgrind mailing list: Am Dienstag, den 26.11.2013, 00:18 +0100 schrieb Mark Wielaard: > On Mon, Nov 25, 2013 at 10:40:40PM +0000, Tom Hughes wrote: > > On 25/11/13 22:06, Joël Krähemann wrote: > > > > > Hi, I get this nice error message out of valgrind while starting or > > > using my application and I don't know why. > > > > > > http://sf.net/p/ags > > > > > > Please help. > > > > Given the location of the instruction I'm guessing you have a Haswell > > processor and you've built glibc with transactional memory support so > > that it is trying to do lock ellision using the transactional memory > > instructions. I'm using g_atomic_int_set, g_atomic_int_get, g_atomic_int_xor and g_atomic_int_and to implement thread synchronization. Could theses functions be the problem or aren't they working proper? > > > > The transactional memory instructions aren't fully supported in valgrind > > yet, so you'll probably have to rebuild glibc without that support for now. > > This is somewhat unfortunate. We do have prelimenary support, but only > implemented xstart and xtest to indicate the transaction always immediately > fails. The instruction hit is xabort, which in the valgrind case would > always just be a NOP since no transaction is (can be) active. > > If there isn't a bug report for XABORT not being implemented, please do > file one. We really should add it. Here's the output of lscpu. Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz Stepping: 9 CPU MHz: 2754.000 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 5387.62 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 Reproducible: Always Steps to Reproduce: 1. launch ags with valgrind 2. 3. Actual Results: valgrind exits with: --9862-- memcheck GC: 2827 nodes, 2355 survivors ( 83.3%) --9862-- memcheck GC: 3997 new table size (stepup) vex amd64->IR: unhandled instruction bytes: 0xC6 0xF8 0xFD 0xF 0xB7 0x6 0x66 0x85 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==9862== valgrind: Unrecognised instruction at address 0x34f3411390. ==9862== at 0x34F3411390: __lll_trylock_elision (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x34F340A22B: pthread_mutex_trylock (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x4A1264: ags_gui_thread_do_gtk_iteration.60959 (ags_gui_thread.c:199) ==9862== by 0x4A1539: ags_gui_thread_run (ags_gui_thread.c:251) ==9862== by 0x34F70104C6: ??? (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x34F7029748: g_signal_emit_valist (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x34F702A3AE: g_signal_emit (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x49F625: ags_thread_run (ags_thread.c:1720) ==9862== by 0x49F99E: ags_thread_loop (ags_thread.c:1643) ==9862== by 0x34F3407F32: start_thread (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x34F2CF4EAC: clone (in /usr/lib64/libc-2.18.so) ==9862== Your program just tried to execute an instruction that Valgrind ==9862== did not recognise. There are two possible reasons for this. ==9862== 1. Your program has a bug and erroneously jumped to a non-code ==9862== location. If you are running Memcheck and you just saw a ==9862== warning about a bad jump, it's probably your program's fault. ==9862== 2. The instruction is legitimate but Valgrind doesn't handle it, ==9862== i.e. it's Valgrind's fault. If you think this is the case or ==9862== you are not sure, please let us know and we'll try to fix it. ==9862== Either way, Valgrind will now raise a SIGILL signal which will ==9862== probably kill your program. ==9862== ==9862== Process terminating with default action of signal 4 (SIGILL) ==9862== Illegal opcode at address 0x34F3411390 ==9862== at 0x34F3411390: __lll_trylock_elision (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x34F340A22B: pthread_mutex_trylock (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x4A1264: ags_gui_thread_do_gtk_iteration.60959 (ags_gui_thread.c:199) ==9862== by 0x4A1539: ags_gui_thread_run (ags_gui_thread.c:251) ==9862== by 0x34F70104C6: ??? (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x34F7029748: g_signal_emit_valist (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x34F702A3AE: g_signal_emit (in /usr/lib64/libgobject-2.0.so.0.3800.2) ==9862== by 0x49F625: ags_thread_run (ags_thread.c:1720) ==9862== by 0x49F99E: ags_thread_loop (ags_thread.c:1643) ==9862== by 0x34F3407F32: start_thread (in /usr/lib64/libpthread-2.18.so) ==9862== by 0x34F2CF4EAC: clone (in /usr/lib64/libc-2.18.so) ==9862== ==9862== HEAP SUMMARY: ==9862== in use at exit: 2,241,959 bytes in 29,002 blocks ==9862== total heap usage: 91,386 allocs, 62,384 frees, 7,205,816 bytes allocated ==9862== ==9862== Searching for pointers to 27,363 not-freed blocks ==9862== Checked 36,580,456 bytes ==9862== ==9862== LEAK SUMMARY: ==9862== definitely lost: 2,909 bytes in 41 blocks ==9862== indirectly lost: 14,861 bytes in 633 blocks ==9862== possibly lost: 66,719 bytes in 921 blocks ==9862== still reachable: 1,865,142 bytes in 25,768 blocks ==9862== suppressed: 0 bytes in 0 blocks ==9862== Rerun with --leak-check=full to see details of leaked memory ==9862== ==9862== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) --9862-- --9862-- used_suppression: 2 glibc-2.5.x-on-SUSE-10.2-(PPC)-2a /usr/lib64/valgrind/default.supp:1286 ==9862== ==9862== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) Expected Results: valgrind helping me to find the bug.
trylock seems to just want to try to abort any pending transaction using xabort. When there isn't a transaction (as is always the case with valgrind atm, since xbegin just immediately goes into the fallback path), then xabort acts as a NOP.
Created attachment 83780 [details] Patch to implement xabort as nop XABORT can be called even when there is no current transaction. In such a case XABORT acts as a NOP. Implement xabort as nop.
Committed as VEX r2800 with the addition of a check for opc == 0xC7 (xbegin) and opc == 0xC6 (for xabort) as discussed on irc with Julian to make the front end simpler and clearer.
It seems to behave badly with glibc: pthread_mutex_t m; int main(int argc, char *argv[]) { int err; pthread_mutex_trylock(&m); pthread_mutex_unlock(&m); if ((err = pthread_mutex_destroy(&m))) *(int*)0 = 0; return 0; } will crash, this is the state of the mutex: $2 = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0xffffffff, __kind = 0x0, __spins = 0x0, __elision = 0x3, __list = {__prev = 0x0, __next = 0x0}}, __size = { 0x0 <repeats 12 times>, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3, 0x0 <repeats 17 times>}, __align = 0x0} so apparently glibc did avoid locking on trylock, and didn't avoid unlocking on unlock, thus ending up with a bogus nusers.
(In reply to Samuel Thibault from comment #4) > It seems to behave badly with glibc: > > pthread_mutex_t m; > int main(int argc, char *argv[]) { > int err; > pthread_mutex_trylock(&m); > pthread_mutex_unlock(&m); > if ((err = pthread_mutex_destroy(&m))) > *(int*)0 = 0; > return 0; > } > > will crash, this is the state of the mutex: > > $2 = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = > 0xffffffff, __kind = 0x0, > __spins = 0x0, __elision = 0x3, __list = {__prev = 0x0, __next = 0x0}}, > __size = { > 0x0 <repeats 12 times>, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, > 0x0, 0x3, > 0x0 <repeats 17 times>}, __align = 0x0} > > so apparently glibc did avoid locking on trylock, and didn't avoid unlocking > on unlock, thus ending up with a bogus nusers. That is consistent with running on actual hardware that has tsx. I am seeing the same on a i7-4600U with the rtm cpuid flag set and glibc-2.18-12 running that program directly and under valgrind svn.
So you get a crash on hardware supporting RTM?
Actually 2.18 glibc doesn't yet contain the TSX elission code. Let me try to find a hadware setup with TSX that has a newer glibc. But given that the pthread_mutex_destroy seems to fail with EBUSY there might be something wrong with the program anyway. Could you provide a bit more context where the program crashes exactly on which instruction and any differences with or without running under valgrind?
Well, in the execution within valgrind, what is wrong is that nusers has become -1, and thus destroy thinks the mutex is busy, while it's clearly not from the source code. I believe I have found the bug, in glibc in the trylock case, where trylock does elision, but unlock doesn't (because it's not aware that trylock did). I'll report on the libc-alpha list.
(I know, in my test program, I didn't bother initializing the mutex, since PTHREAD_MUTEX_INITIALIZER is actually zeroes in nptl)
Reported and patch proposed on: https://sourceware.org/ml/libc-alpha/2014-08/msg00464.html