helgrind on x86 triggers a crash in my application because the stack in the threads is not 16-bytes aligned. So the movdqa instruction will get SEGV. I changed the testcase none/tests/pth_stackalign so that it can check against a given alignment. Checking alignemnt results: native: ok memcheck: ok helgrind: bad Checking in gdb shows that mythread_wrapper from hg_intercepts.c is the root cause here. So gcc generates code that destroys the alignment. After changing "-mpreferred-stack-boundary=2" to "-mpreferred-stack-boundary=4" in configure.in everything works. But I am not sure if this is the correct thing to do as it might enlarge stack usage or make some assumptions in other code invalid. Or it should only be set for code that is running on the virtual cpu.
Created attachment 81937 [details] Improve testcase none/tests/pth_stackalign to check for given alignment With this change, stack alignment can be checked like this: # valgrind --tool=helgrind ./none/tests/pth_stackalign 16
This bug may be related to #254646
Thinking about this topic, I think for code that needs to cooperate with other code, we must keep the setting of preferred-stack-boundary identical to the other code, so most likely just not pass this option to the compiler for code linking against other libs or being executed on the virtual cpu.
This is what happens when running the modified test none/tests/pth_stackalign. It does not depend on gentoo, as the results on ubuntu are identical ubuntu without valgrind: alignment difference = 0 ubuntu with packaged memcheck (vg-3.7.0): alignment difference = 0 ubuntu with packaged helgrind (vg-3.7.0): alignment difference = 8 ubuntu with memcheck (SVN): alignment difference = 0 ubuntu with helgrind (SVN): alignment difference = 8 ubuntu with patch memcheck (SVN): alignment difference = 0 ubuntu with patch helgrind (SVN): alignment difference = 0 gentoo without valgrind: alignment difference = 0 gentoo with packaged memcheck (vg-3.8.1): alignment difference = 0 gentoo with packaged helgrind (vg-3.8.1): alignment difference = 8 gentoo with memcheck (SVN): alignment difference = 0 gentoo with helgrind (SVN): alignment difference = 8 gentoo with patch memcheck (SVN): alignment difference = 0 gentoo with patch helgrind (SVN): alignment difference = 0 If necessary I can create a testcase that calls movdqa on a stack variable
The gcc docs state that "libraries that use callbacks always use the default setting" From http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html: On Pentium and Pentium Pro, double and long double values should be aligned to an 8-byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16-byte aligned. To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary most likely misaligns the stack. It is recommended that libraries that use callbacks always use the default setting. This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
Created attachment 83198 [details] Testcase demonstrating crash of movdqa in thread
This testcase (pth_stackalign_movdqa.c) demonstrates the problem (the sse part is copied from none/tests/x86/insn_sse2.c). To test just do: # gcc -g -msse -o pth_stackalign_movdqa pth_stackalign_movdqa.c -lpthread # valgrind --tool=helgrind ./pth_stackalign_movdqa ==24650== Helgrind, a thread error detector ==24650== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al. ==24650== Using Valgrind-3.9.0.TEST1 and LibVEX; rerun with -h for copyright info ==24650== Command: ./pth_stackalign_movdqa ==24650== ==24650== ==24650== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==24650== General Protection Fault ==24650== at 0x8048682: movdqa_2 (pth_stackalign_movdqa.c:43) ==24650== by 0x8048764: ThreadFunction (pth_stackalign_movdqa.c:80) ==24650== by 0x400A566: mythread_wrapper (hg_intercepts.c:233) ==24650== by 0x4DA19F37: start_thread (pthread_create.c:305) ==24650== by 0x4D945C0D: clone (clone.S:133) ==24650== ==24650== For counts of detected and suppressed errors, rerun with: -v ==24650== Use --history-level=approx or =none to gain increased speed, at ==24650== the cost of reduced accuracy of conflicting-access information ==24650== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Memcheck is not affected: # valgrind --tool=memcheck ./pth_stackalign_movdqa ==25030== Memcheck, a memory error detector ==25030== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==25030== Using Valgrind-3.9.0.TEST1 and LibVEX; rerun with -h for copyright info ==25030== Command: ./pth_stackalign_movdqa ==25030== movdqa_2 ... ok finished ==25030== ==25030== HEAP SUMMARY: ==25030== in use at exit: 0 bytes in 0 blocks ==25030== total heap usage: 1 allocs, 1 frees, 136 bytes allocated ==25030== ==25030== All heap blocks were freed -- no leaks are possible ==25030== ==25030== For counts of detected and suppressed errors, rerun with: -v ==25030== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Test works fine without valgrind: # ./pth_stackalign_movdqa movdqa_2 ... ok finished Should valgrind really exit, or call the SIGILL handler in case of misalignment?
Created attachment 83221 [details] Realign stack for preloaded libs This patch gets the idea of Philippe Waroquiers to keep the 4 byte stack alignment for valgrind code and only change it for preloaded libraries. It basically adds "-mincoming-stack-boundary=2 -mpreferred-stack-boundary=4" to the compilation of preload libs. But if there are no other unaligned calls in valgrind, it may even be enough to have the preload libs compiled without any stack alignment options. Second issue: the flag is only tested for the primary architecture (configure tests without -m32/-m64 ...) But this affects whole build system and not only this flag. I should create a new ticket for this.
The patch fixes the stack alignment problem for me. Don't forget to call autoregen.sh after applying it.
Created attachment 86852 [details] Fix stack alignment for preload libs Fix stack alignment for preloaded libs in a better way. The core point is to seperate the core CFLAGS and the ones for preloaded libs. These need a stack alignment that is compatible to the application and lib compiler settings. That means keep the stack settings unchanged for code that could call back into the application.
Has anybody had a look at this fix? Can I do something to get this commited? Is there anything to be improved?
Is it planned to merge this for valgrind-3.10?
Created attachment 88564 [details] Fix stack alignment for preload libs and add testcase This is the updated patch to compile again (based on revision 14458 from 2014/09/04).
(In reply to Matthias Schwarzott from comment #12) > Is it planned to merge this for valgrind-3.10? Yes. It's next on my todo list. This evening.
Committed (modified version of the patch, without the tests) as r14471. Thank you for the patch. I think it is correct but is hard for me to test. Can you test the result?
Fixed, r14471.