Summary: | Make stack unwinding in Valgrind wrappers more reliable | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Evgeniy Stepanov <eugeni.stepanov> |
Component: | general | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | bart.vanassche+kde, borntraeger, flo2030, jakub, konstantin.s.serebryany |
Priority: | NOR | ||
Version: | 3.6 SVN | ||
Target Milestone: | --- | ||
Platform: | Compiled Sources | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Bug Depends on: | |||
Bug Blocks: | 243404 | ||
Attachments: |
update CFI when messing with the stack pointer
update CFI, try 2 update CFI, try 2 + a small fix Add CFI annotations around VALGRIND_CALL_NOREDIR_RAX for amd64 Force frame pointer in CALL_FN_* and reduce the lower limit on stack size Force frame pointer in CALL_FN_* and reduce the lower limit on stack size |
Evgeniy, interesting. I would like to ask: does this solve any specific problem? Or is it something that you just thought should be fixed? This is a *potential* problem for the platforms supported by official valgrind and a *real* problem (and a real fix) for Native Client (http://code.google.com/p/nativeclient/wiki/ValgrindMemcheck) Presumably this needs to happen (or at least be checked) also for arm and s390x. Hmm, the patch seems to make stack unwinding worse in some cases on Ubuntu 10.04 x86_64. eg for helgrind/tests/hg02_deadlock before: ==26793== Thread #3 was created ==26793== at 0x51346BE: clone (clone.S:77) ==26793== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) ==26793== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) ==26793== by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288) ==26793== by 0x40075B: main (hg02_deadlock.c:36) after: ==26788== Thread #3 was created ==26788== at 0x51346BE: clone (clone.S:77) ==26788== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) ==26788== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) so now the unwinder stops in the wrapper, when it didn't before. Any ideas? I think you are correct, the wrappers need some kind of CFI annotations, but maybe the patch needs further attention? (In reply to comment #4) > Hmm, the patch seems to make stack unwinding worse in some cases > on Ubuntu 10.04 x86_64. eg for helgrind/tests/hg02_deadlock > > before: > > ==26793== Thread #3 was created > ==26793== at 0x51346BE: clone (clone.S:77) > ==26793== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > ==26793== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > ==26793== by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288) > ==26793== by 0x40075B: main (hg02_deadlock.c:36) > > after: > > ==26788== Thread #3 was created > ==26788== at 0x51346BE: clone (clone.S:77) > ==26788== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > ==26788== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > > so now the unwinder stops in the wrapper, when it didn't before. > > Any ideas? I think you are correct, the wrappers need some kind of > CFI annotations, but maybe the patch needs further attention? It's more complex than that. The short version of the stacktrace is also produced by the stack walking code that is currently on the trunk on some of the most recent Linux distro's. Examples of such distro's are Fedora 13 and openSUSE 11.3. I'm not sure whether or not this is related to the fact that both distro's contain gcc 4.5.0. Yes, it's not that simple. In your example (hg02_deadlock) gcc picked %rbp as the CFA register, with zero offset. In this case moving %rsp does not change anything, and CFA offset must stay the same. In short, these directives are needed when (and only when) %rsp is the CFA register. At this point I don't see a clean fix for this problem. There is a GCC-specific solution. We can create our own frame pointer by calling __builtin_frame_address(0), writing it in some free register, and adding an appropriate CFI annotation. But that will take one extra register for what is essentially a compile-time constant. As far as I can see s390 would need a similar annotation, if the CFA is based on the content r15. But we also have the same issue that we dont know if CFA is based on r15 or not (e.g. frame pointer r11) Hmm, maybe it is possible to create a hand written eh_frame section for this code. Created attachment 50973 [details]
update CFI, try 2
The second patch allocates one more register for the frame pointer and uses it a base for CFA. In fact, it may use as many as 2 extra registers: __builtin_frame_address(0) makes gcc keep a frame pointer in %rbp. Alternatively, we can try tricks like calling __builtin_frame_address(0) or alloca(0) and casting it to something like (void)(volatile*) to prevent optimization. Any of these forces a frame pointer for the current function and gcc then defines CFA as %rbp+0. Of course, this may change in the future, but it seems unlikely. Another bonus: this works for gcc<4.3, which don't support cfi directives at all. Created attachment 50974 [details]
update CFI, try 2 + a small fix
Includes a fix for the stack frame being disposed _before_ the call in CALL_FN_W_8W
We should at least take the CALL_FN_W_8W fix right away, whilst we figure out what to do with the CFI issue. (In reply to comment #10) Evgeniy, > The second patch allocates one more register for the frame pointer and uses it > a base for CFA. In fact, it may use as many as 2 extra registers: > __builtin_frame_address(0) makes gcc keep a frame pointer in %rbp. > > Alternatively, we can try tricks like calling __builtin_frame_address(0) or > alloca(0) and casting it to something like (void)(volatile*) to prevent > optimization. Any of these forces a frame pointer for the current function and > gcc then defines CFA as %rbp+0. Of course, this may change in the future, but > it seems unlikely. Another bonus: this works for gcc<4.3, which don't support > cfi directives at all. Thanks for the patch. However, /me is confused. Which of the following are true? 1. the patch works for all gcc, not just gcc >= 4.3 2. the patch is gcc specific, because it uses __builtin_frame_address If (2) is true, that's a bit of a problem because Bart wants to be able to compile this file with MSVC now. Can we solve that by #ifdef __MSC_VER (or whatever the right symbol is) ? These macros haven't been ported yet to the Microsoft C compiler, so I don't think that these changes do not have to be protected by #ifdef _MSC_VER. 1. CFI directives are only supported in gcc 4.4 For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty strings, and keep the __builtin_frame_address call. This should, at least, give a better chance of having a frame pointer than the current code. 2. Yes, all of this is gcc specific. I think the right defines are __GNUC__ and __GNUC_MINOR__. (In reply to comment #15) > 1. CFI directives are only supported in gcc 4.4 > For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty > strings, and keep the __builtin_frame_address call. This should, at least, give > a better chance of having a frame pointer than the current code. > > 2. Yes, all of this is gcc specific. > > I think the right defines are __GNUC__ and __GNUC_MINOR__. As far as I know support for the CFI assembler directives only depends on the binutils version and not on the gcc version. So whether or not these are supported has to be detected by a configure test. What I'm currently wondering about is: * The header file config.h generated by Valgrind's configure script is not installed to $PREFIX/include/valgrind. Should we keep it that way or should we rename that header file to valgrind-config.h and install it ? * Should config.h/valgrind-config.h be included from valgrind.h ? * If not, how to make sure that while building Valgrind and the regression tests that config.h has been included before valgrind.h ? For more information about the CFI directives, see also: * http://sourceware.org/binutils/docs/as/CFI-directives.html * http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/include/asm/dwarf2.h;hb=HEAD (In reply to comment #4) > Hmm, the patch seems to make stack unwinding worse in some cases > on Ubuntu 10.04 x86_64. eg for helgrind/tests/hg02_deadlock > > before: > > ==26793== Thread #3 was created > ==26793== at 0x51346BE: clone (clone.S:77) > ==26793== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > ==26793== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > ==26793== by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288) > ==26793== by 0x40075B: main (hg02_deadlock.c:36) > > after: > > ==26788== Thread #3 was created > ==26788== at 0x51346BE: clone (clone.S:77) > ==26788== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > ==26788== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > > so now the unwinder stops in the wrapper, when it didn't before. > > Any ideas? I think you are correct, the wrappers need some kind of > CFI annotations, but maybe the patch needs further attention? This is probably because the code for interpreting .cfi_def_cfa in m_debuginfo is incomplete: it only supports the stack pointer and the frame pointer on x86/AMD64 but no other registers. I'm not familiar with this code so it's not clear to me which enums/structures/functions have to be updated to implement this properly. (In reply to comment #16) > (In reply to comment #15) > > 1. CFI directives are only supported in gcc 4.4 > > For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty > > strings, and keep the __builtin_frame_address call. This should, at least, give > > a better chance of having a frame pointer than the current code. > > > > 2. Yes, all of this is gcc specific. > > > > I think the right defines are __GNUC__ and __GNUC_MINOR__. > > As far as I know support for the CFI assembler directives only depends on the > binutils version and not on the gcc version. So whether or not these are > supported has to be detected by a configure test. We also need gcc to emit .cfi_startproc/.cfi_endproc. This requires a recent enough _gcc_ version. Also, there is -fno-dwarf2-cfi-asm option that will suppress .cfi_startproc/.cfi_endproc. Yes, it's complicated enough to require a configure test. Perhaps one for CFI support and one for __builtin_frame_address. > What I'm currently wondering > about is: > * The header file config.h generated by Valgrind's configure script is not > installed to $PREFIX/include/valgrind. Should we keep it that way or should we > rename that header file to valgrind-config.h and install it ? > * Should config.h/valgrind-config.h be included from valgrind.h ? > * If not, how to make sure that while building Valgrind and the regression > tests that config.h has been included before valgrind.h ? > > For more information about the CFI directives, see also: > * http://sourceware.org/binutils/docs/as/CFI-directives.html > * > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/include/asm/dwarf2.h;hb=HEAD (In reply to comment #17) > (In reply to comment #4) > > Hmm, the patch seems to make stack unwinding worse in some cases > > on Ubuntu 10.04 x86_64. eg for helgrind/tests/hg02_deadlock > > > > before: > > > > ==26793== Thread #3 was created > > ==26793== at 0x51346BE: clone (clone.S:77) > > ==26793== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > > ==26793== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > > ==26793== by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288) > > ==26793== by 0x40075B: main (hg02_deadlock.c:36) > > > > after: > > > > ==26788== Thread #3 was created > > ==26788== at 0x51346BE: clone (clone.S:77) > > ==26788== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) > > ==26788== by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257) > > > > so now the unwinder stops in the wrapper, when it didn't before. > > > > Any ideas? I think you are correct, the wrappers need some kind of > > CFI annotations, but maybe the patch needs further attention? > > This is probably because the code for interpreting .cfi_def_cfa in m_debuginfo > is incomplete: it only supports the stack pointer and the frame pointer on > x86/AMD64 but no other registers. I'm not familiar with this code so it's not > clear to me which enums/structures/functions have to be updated to implement > this properly. With the latest patch, stack traces are fine in all cases. But you make a good point: there is no sense in trashing another register (%ebx) for the frame pointer if valgrind is not going to use it anyway. I'm not sure if this is 100% correct, but adding a "__builtin_frame_address(0);" statement anywhere inside a function forces a frame pointer for that function. This happens with any combination of flags I could think of. In this case we don't need any CFI directives at all. Or we can try and support other registers in the cfi reader. (In reply to comment #19) > With the latest patch, stack traces are fine in all cases. But you make a good > point: there is no sense in trashing another register (%ebx) for the frame > pointer if valgrind is not going to use it anyway. It would help if you could specify which cases (gcc version + CPU type) you have tested. Many stack traces are wrong even with the latest patch (2010-08-26) when building Valgrind with gcc 4.5.0 on an 64-bit x86 system. The patch doesn't seem to have any effect on the system I have tested it on. Also, your patch only addresses the AMD64 platform. Which other platforms than AMD64 do need CFI annotations with recent gcc versions ? Being too lazy to build gcc myself, I only tested gcc <= 4.4 :) Now that I've overcome my laziness, it looks like the problems with gcc 4.5 are unrelated to this issue. At least some of them are due to the check in m_stacktrace.c:267 : if (fp_min + 512 >= fp_max) { With gcc 4.5, we are getting closer to fp_max than with older versions, it seems. Changing the constant to anything (strictly) less then 440 fixes the problem for me. I'm not sure about other platforms. (In reply to comment #21) > Being too lazy to build gcc myself, I only tested gcc <= 4.4 :) > Now that I've overcome my laziness, it looks like the problems with gcc 4.5 are > unrelated to this issue. At least some of them are due to the check in > m_stacktrace.c:267 : > > if (fp_min + 512 >= fp_max) { > > With gcc 4.5, we are getting closer to fp_max than with older versions, it > seems. Changing the constant to anything (strictly) less then 440 fixes the > problem for me. Well done. But on my setup (amd64 + gcc 4.5.0) I had to decrease that value even further. I only got the same call stacks as with earlier gcc versions with the value 380. Any idea why ? Created attachment 51436 [details]
Add CFI annotations around VALGRIND_CALL_NOREDIR_RAX for amd64
Regenerated patch against trunk and added coregrind/m_stacktrace.c modification.
Evgeniy, a question about your patch: why have you added the VALGRIND_CFI_PROLOGUE / VALGRIND_CFI_EPILOGUE macros in the CALL_FN_W_* macros instead of at the beginning and the end of the VALGRIND_CALL_NOREDIR_RAX macro ? Would that also work ? With this patch (v4) I'm also seeing a crash that does not occur without this patch: $ ./vg-in-place --tool=drd drd/tests/pth_cancel_locked ==23471== drd, a thread error detector ==23471== Copyright (C) 2006-2010, and GNU GPL'd, by Bart Van Assche. ==23471== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info ==23471== Command: drd/tests/pth_cancel_locked ==23471== ==23471== ==23471== Process terminating with default action of signal 11 (SIGSEGV) ==23471== Access not within mapped region at address 0x3 ==23471== at 0x5FC7573: ??? (in /lib64/libgcc_s.so.1) ==23471== by 0x5FC79BA: ??? (in /lib64/libgcc_s.so.1) ==23471== by 0x5FC7E82: _Unwind_Resume (in /lib64/libgcc_s.so.1) ==23471== by 0x4E455DD: __condvar_cleanup1 (pthread_cond_wait.S:462) ==23471== by 0x4C2D35A: pthread_cond_wait@* (drd_pthread_intercepts.c:691) ==23471== by 0x40094C: thread (pth_cancel_locked.c:21) ==23471== by 0x4C29650: vgDrd_thread_wrapper (drd_pthread_intercepts.c:272) ==23471== by 0x4E40A4E: start_thread (pthread_create.c:297) ==23471== If you believe this happened as a result of a stack ==23471== overflow in your program's main thread (unlikely but ==23471== possible), you can try to increase the size of the ==23471== main thread stack using the --main-stacksize= flag. ==23471== The main thread stack size used in this run was 8388608. ==23471== ==23471== For counts of detected and suppressed errors, rerun with: -v ==23471== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 154 from 125) ./vg-in-place: line 28: 23471 Killed VALGRIND_LIB="$vgbasedir/.in_place" VALGRIND_LIB_INNER="$vgbasedir/.in_place" "$vgbasedir/coregrind/valgrind" "$@" > Well done. But on my setup (amd64 + gcc 4.5.0) I had to decrease that value
> even further. I only got the same call stacks as with earlier gcc versions with
> the value 380. Any idea why ?
Maybe you have newer glibc/libpthread? I compiled gcc 4.5 on ubuntu lucid, everything else (glibc, even binutils) are older versions. Perhaps this number shuld be decreased even more, as long as tests pass. As I understand the comment above, even values very close to zero will work.
Adding CFI directives to VALGRIND_CALL_NOREDIR_RAX would not work in a case when stack trace starts at one of the instructions after "subq $128,%%rsp" and before VALGRIND_CALL_NOREDIR_RAX. For example, in functions without a frame pointer, gcc adds a .cfi_adjust_cfa_offset after every stack push to make CFA exact at all times.
If there is a frame pointer already, it does not matter.
As I see it, there are two possibilities:
- We support frame pointer in arbitrary register in m_debug_info, put it in %rbx, and add appropriate cfi directives.
- We simply force gcc into having a frame pointer with __builtin_frame_address(0) call. Then we don't need any cfi directives. But gcc does not really have to have a frame pointer, it just does so in current version. It will probably keep doing so in future, as it's a very reasonable thing to do.
I'll take a look at the crashing test later.
A second strange behavior change caused by this patch is: --- origin5-bz2.stderr.exp-glibc25-amd64 2010-09-18 13:43:07.000000000 +0200 +++ origin5-bz2.stderr.out 2010-09-18 13:48:04.000000000 +0200 @@ -117,6 +117,12 @@ Conditional jump or move depends on uninitialised value(s) at 0x........: main (origin5-bz2.c:6512) - Uninitialised value was created by a client request - at 0x........: main (origin5-bz2.c:6479) + Uninitialised value was created by a heap allocation + at 0x........: malloc (vg_replace_malloc.c:...) + by 0x........: g_serviceFn (origin5-bz2.c:6429) + by 0x........: default_bzalloc (origin5-bz2.c:4470) + by 0x........: BZ2_decompress (origin5-bz2.c:1578) + by 0x........: BZ2_bzDecompress (origin5-bz2.c:5192) + by 0x........: BZ2_bzBuffToBuffDecompress (origin5-bz2.c:5678) + by 0x........: main (origin5-bz2.c:6498) Bart, I can not reproduce either the crash or the behavior change. Actually, with the newest valgrind and the latest patch the set of failing tests is the same for clean and patched versions. gcc-4.5 (GCC) 4.5.1 (self-compiled) binutils 2.20.1-3ubuntu7 libc6 2.11.1-0ubuntu7.2 Valgrind synched at r11392 (In reply to comment #26) I'm confused (this is a confusing problem :-) > As I see it, there are two possibilities: > - We support frame pointer in arbitrary register in m_debug_info, put it in > %rbx, and add appropriate cfi directives. That would make the stack unwinder slower, and it's already one of the most performance critical parts of the system (in unwind-intensive scenarios). > - We simply force gcc into having a frame pointer with > __builtin_frame_address(0) call. Then we don't need any cfi directives. But gcc > does not really have to have a frame pointer, it just does so in current > version. It will probably keep doing so in future, as it's a very reasonable > thing to do. So I like this one better. What I can't figure out is: do any of the attached patches implement exactly this, and nothing else? If yes, does it actually work correctly? Created attachment 52231 [details]
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size
This patch leaves the following tests failing:
memcheck/tests/linux/stack_switch (stderr)
memcheck/tests/origin5-bz2 (stderr)
helgrind/tests/pth_barrier3 (stderr)
helgrind/tests/tc06_two_races_xml (stderr)
helgrind/tests/tc09_bad_unlock (stderr)
helgrind/tests/tc20_verifywrap (stderr)
helgrind/tests/tc23_bogus_condwait (stderr)
drd/tests/pth_barrier3 (stderr)
exp-ptrcheck/tests/bad_percentify (stderr)
exp-ptrcheck/tests/supp (stderr)
It does not add any new failures and fixes 21 out of 31 tests that are currently failing with gcc 4.5.1.
Julian, I've attached a patch that does not clobber any new registers (except for %rbp, if the enclosing function did not have a frame pointer already). Bart, I was wrong, the issue with origin5-bz2 does reproduce over here. But it occurs both with or without the patch. (In reply to comment #31) > Julian, I've attached a patch that does not clobber any new registers (except > for %rbp, if the enclosing function did not have a frame pointer already). This looks like the cleanest solution so far. On further thought and reading of the gcc docs, though, I have more questions. AFAICS, the gcc-4.5 docs for __builtin_frame_address(0) do not give any assurance that it really forces %rbp to be used as a frame register -- what stops gcc from giving us the frame address as (eg) %rsp + offset? What would happen if we used a modified version of this: > - We support frame pointer in arbitrary register in m_debug_info, put it in > %rbx, and add appropriate cfi directives. instead of %rbx, use %rbp: add an assembly instruction to copy __builtin_frame_address(0) into %rbp, and add cfi directives. Then (eg) CALL_FN_W_v would look like this (added/changed lines marked with 'x') #define CALL_FN_W_v(lval, orig) \ do { \ volatile OrigFn _orig = (orig); \ volatile unsigned long _argvec[1]; \ volatile unsigned long _res; \ _argvec[0] = (unsigned long)_orig.nraddr; \ __asm__ volatile( \ ".cfi_remember_state\n\t" \ x ".cfi_def_cfa rbp, 0\n\t" \ x "movq %0,%%rbp\n\t" \ x "subq $128,%%rsp\n\t" \ "movq (%%rax), %%rax\n\t" /* target->%rax */ \ VALGRIND_CALL_NOREDIR_RAX \ "addq $128,%%rsp\n\t" \ ".cfi_restore_state\n\t" \ x : /*out*/ "=a" (_res) \ : /*in*/ "a" (&_argvec[0]), \ "r" (__builtin_frame_address(0)) \ x : /*trash*/ "cc", "memory", __CALLER_SAVED_REGS, "rbp" \ x ); \ lval = (__typeof__(lval)) _res; \ } while (0) Is that feasible? Effects (+,-) are: - it might generate a "movq %rbp,%rbp". I don't care about that, and anyway the JIT's optimiser makes that essentially free. - it might trash a second register (?) I don't care about that either; these wrappers aren't performance critical. + it means we make no assumption about how/what gcc does with __builtin_frame_address(0), only that it produces a correct value + it doesn't require the unwinder to handle %rbx, only %rbp, which it already does, hence no unwinder performance loss + it doesn't require the JIT to maintain precise values for %rbx, only %rbp, which it already does, hence no runtime performance loss - it makes it a bit more complex to handle gccs that don't know about __builtin_frame_address Sorry to drag this out. Considering we also need to solve the same problem for arm-linux and s390x-linux, I want to be sure we get the most robust solution we can. (In reply to comment #32) > (In reply to comment #31) > > > Julian, I've attached a patch that does not clobber any new registers (except > > for %rbp, if the enclosing function did not have a frame pointer already). > > This looks like the cleanest solution so far. > > On further thought and reading of the gcc docs, though, I have more > questions. AFAICS, the gcc-4.5 docs for __builtin_frame_address(0) do > not give any assurance that it really forces %rbp to be used as a > frame register -- what stops gcc from giving us the frame address as > (eg) %rsp + offset? > > > What would happen if we used a modified version of this: > > > - We support frame pointer in arbitrary register in m_debug_info, put it in > > %rbx, and add appropriate cfi directives. > > instead of %rbx, use %rbp: add an assembly instruction to copy > __builtin_frame_address(0) into %rbp, and add cfi directives. Then (eg) > CALL_FN_W_v would look like this (added/changed lines marked with 'x') > > > #define CALL_FN_W_v(lval, orig) \ > do { \ > volatile OrigFn _orig = (orig); \ > volatile unsigned long _argvec[1]; \ > volatile unsigned long _res; \ > _argvec[0] = (unsigned long)_orig.nraddr; \ > __asm__ volatile( \ > ".cfi_remember_state\n\t" \ x > ".cfi_def_cfa rbp, 0\n\t" \ x > "movq %0,%%rbp\n\t" \ x > "subq $128,%%rsp\n\t" \ > "movq (%%rax), %%rax\n\t" /* target->%rax */ \ > VALGRIND_CALL_NOREDIR_RAX \ > "addq $128,%%rsp\n\t" \ > ".cfi_restore_state\n\t" \ x > : /*out*/ "=a" (_res) \ > : /*in*/ "a" (&_argvec[0]), \ > "r" (__builtin_frame_address(0)) \ x > : /*trash*/ "cc", "memory", __CALLER_SAVED_REGS, "rbp" \ x > ); \ > lval = (__typeof__(lval)) _res; \ > } while (0) > > Is that feasible? Effects (+,-) are: > > - it might generate a "movq %rbp,%rbp". I don't care about that, > and anyway the JIT's optimiser makes that essentially free. > > - it might trash a second register (?) I don't care about that > either; these wrappers aren't performance critical. > > + it means we make no assumption about how/what gcc does with > __builtin_frame_address(0), only that it produces a correct value > > + it doesn't require the unwinder to handle %rbx, only %rbp, which > it already does, hence no unwinder performance loss > > + it doesn't require the JIT to maintain precise values for %rbx, > only %rbp, which it already does, hence no runtime performance loss > > - it makes it a bit more complex to handle gccs that don't know > about __builtin_frame_address > > > Sorry to drag this out. Considering we also need to solve the same > problem for arm-linux and s390x-linux, I want to be sure we get the > most robust solution we can. Great idea, but I've already had it before I came up with this %rbx nonsense :) Apparently gcc does not allow clobbering %rbp in inline assembly. int main() { __asm__ volatile( "nop \n\t" : /*out*/ : /*in*/ : /*trash*/ "rbp" ); return 0; } $ gcc -c 1.c 1.c: In function ‘main’: 1.c:9: error: bp cannot be used in asm here > Apparently gcc does not allow clobbering %rbp in inline assembly.
Yeah, I wondered if that would be a problem. So, can't we
temporarily park the value of rbp in r15 (or whatever) and
tell gcc that r15 is trashed by the asm?
Created attachment 52278 [details]
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size
Nice idea.
(In reply to comment #35) > Created an attachment (id=52278) [details] > Force frame pointer in CALL_FN_* and reduce the lower limit on stack size Thanks for the patch. This actually works properly (iow, the gcc-4.5 stack traces are OK), is that correct? If so I'll commit it. Yeah, the same as the previous patch. Committed, with comments, without the 512 -> 380 change. Thanks! r11402. I'll commit the 512->380 change separately. Stack lower limit size reduced from 512 to 256 in r11403. Thanks all for the work in analysis, making & trying out patches. Closing, although as per comments above, will need to port this also to arm and s390x at some point. You should add && defined (__GCC_HAVE_DWARF2_CFI_ASM) to the conditions guarding the use of .cfi_* directives in inline-asm. When compiled with -fno-dwarf2-cfi-asm or with gcc configured against old gas, it won't otherwise compile. The precondition of using .cfi_* directives in inline-asm is that GCC emits .cfi_* directives for the function as well, otherwise you'll have .cfi_* directives without preceeding .cfi_startproc, and the above mentioned macro is a builtin GCC macro that says .cfi_* directives are being emitted. (In reply to comment #41) > You should add && defined (__GCC_HAVE_DWARF2_CFI_ASM) to the conditions > guarding the use of .cfi_* directives in inline-asm. Jakub, is that before or after r11414 (committed about 4 hours ago) ? Assuming after, then the condition would be #if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \ && !defined(PLAT_amd64_darwin) Is that correct? It applies both to code before and after. #if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \ && !defined(PLAT_amd64_darwin) looks correct. I guess you wouldn't need the darwin special case with it, but it doesn't hurt. (In reply to comment #43) > It applies both to code before and after. > #if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \ > && !defined(PLAT_amd64_darwin) > looks correct. I guess you wouldn't need the darwin special case with it, but > it doesn't hurt. Which value does __GCC_HAVE_DWARF2_CFI_ASM have for gcc 4.3 ? The CFI stuff must be disabled for gcc 4.3 because gcc 4.3 does not emit .cfi_startproc and hence causes the assembler to fail. See also https://bugs.kde.org/show_bug.cgi?id=253496. It is not defined there. This is a compiler built in macro which is defined in 4.4+ iff .cfi_startproc etc. is being emitted. Normally on x86-64 it is defined in 4.4+, but e.g. when using -fno-dwarf2-cfi-asm or -fno-asynchronous-unwind-tables, it is not (nor the .cfi_* directives). Hmm, something's not right with the scheme that got committed (independent of the gating conditions). Trying with '#if 1' on Ubuntu 10.04 x86_64 (gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3) I get broken stack traces for (eg) ./vg-in-place --tool=helgrind ./helgrind/tests/hg04_race ==14830== Thread #3 was created ==14830== at 0x51346BE: clone (clone.S:77) ==14830== by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75) ==14830== by 0x4C2D961: pthread_create_WRK (hg_intercepts.c:257) ==14830== That stack trace should have end in main (from the .stderr.exp file): Thread #x was created ... by 0x........: pthread_create@* (hg_intercepts.c:...) by 0x........: main (hg04_race.c:21) I just looked in detail at the interleaved assembly and CFI for a function using these wrappers, "QT4_FUNC(void*, _ZN6QMutexD1Ev, void* mutex)". It's simpler than the wrapper for pthread_create_WRK (hg_intercepts.c:257) and shows the same problem. The assembly is from objdump, and the CFI from valgrind via "--trace-symtab-patt=*vgpreload_helgrind*" --trace-cfi=yes The CFI looks correct to me. At 51db: and 51de: we save the old rbp in r15 and install the new CFA in rbp, as per the plan in comment 34 and comment 35. Problem is, it looks like GCC has given us a completely bogus value for __builtin_frame_address(0): 000000000000514f <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>: 514f: 55 push %rbp 5150: 48 89 e5 mov %rsp,%rbp .. at this point, the CFA = %rbp + 16 .. .. no assignments to %rbp here .. .. 51a3: 48 8d 45 90 lea -0x70(%rbp),%rax .. .. no writes to %rax .. .. 51db: 49 89 ef mov %rbp,%r15 51de: 48 89 c5 mov %rax,%rbp So __builtin_frame_address(0) gives us %rbp - 0x70 whereas I think the correct CFA is %rbp + 16. Anybody have any ideas? Is this analysis correct? This bug is turning into a disaster area :-( The complete combined assembly + CFI follows: 000000000000514f <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>: [0x4c2a14f .. 0x4c2a14f]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=Same 514f: 55 push %rbp [0x4c2a150 .. 0x4c2a152]: let cfa=oldSP+16 in RA=*(cfa+-8) SP=cfa+0 BP=Same 5150: 48 89 e5 mov %rsp,%rbp [0x4c2a153 .. 0x4c2a1da]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) 5153: 41 57 push %r15 5155: 48 8d 55 e0 lea -0x20(%rbp),%rdx 5159: 48 c1 c7 03 rol $0x3,%rdi 515d: 48 c1 c7 0d rol $0xd,%rdi 5161: 48 c1 c7 3d rol $0x3d,%rdi 5165: 48 c1 c7 33 rol $0x33,%rdi 5169: 48 87 c9 xchg %rcx,%rcx 516c: 48 89 45 c8 mov %rax,-0x38(%rbp) 5170: 48 8b 45 c8 mov -0x38(%rbp),%rax 5174: 48 89 02 mov %rax,(%rdx) 5177: 48 c7 45 90 04 01 47 movq $0x48470104,-0x70(%rbp) 517e: 48 517f: 48 89 7d 98 mov %rdi,-0x68(%rbp) 5183: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp) 518a: 00 518b: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp) 5192: 00 5193: 48 c7 45 b0 00 00 00 movq $0x0,-0x50(%rbp) 519a: 00 519b: 48 c7 45 b8 00 00 00 movq $0x0,-0x48(%rbp) 51a2: 00 51a3: 48 8d 45 90 lea -0x70(%rbp),%rax 51a7: ba 00 00 00 00 mov $0x0,%edx 51ac: 48 c1 c7 03 rol $0x3,%rdi 51b0: 48 c1 c7 0d rol $0xd,%rdi 51b4: 48 c1 c7 3d rol $0x3d,%rdi 51b8: 48 c1 c7 33 rol $0x33,%rdi 51bc: 48 87 db xchg %rbx,%rbx 51bf: 48 89 55 c8 mov %rdx,-0x38(%rbp) 51c3: 48 8b 55 c8 mov -0x38(%rbp),%rdx 51c7: 48 8b 55 e0 mov -0x20(%rbp),%rdx 51cb: 48 89 55 d0 mov %rdx,-0x30(%rbp) 51cf: 48 8b 55 d0 mov -0x30(%rbp),%rdx 51d3: 48 89 55 90 mov %rdx,-0x70(%rbp) 51d7: 48 89 7d 98 mov %rdi,-0x68(%rbp) [0x4c2a1db .. 0x4c2a1e0]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) 51db: 49 89 ef mov %rbp,%r15 51de: 48 89 c5 mov %rax,%rbp [0x4c2a1e1 .. 0x4c2a20b]: let cfa=oldBP+0 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) 51e1: 48 81 ec 80 00 00 00 sub $0x80,%rsp 51e8: 48 8b 78 08 mov 0x8(%rax),%rdi 51ec: 48 8b 00 mov (%rax),%rax 51ef: 48 c1 c7 03 rol $0x3,%rdi 51f3: 48 c1 c7 0d rol $0xd,%rdi 51f7: 48 c1 c7 3d rol $0x3d,%rdi 51fb: 48 c1 c7 33 rol $0x33,%rdi 51ff: 48 87 d2 xchg %rdx,%rdx 5202: 48 81 c4 80 00 00 00 add $0x80,%rsp 5209: 4c 89 fd mov %r15,%rbp [0x4c2a20c .. 0x4c2a215]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) 520c: 48 89 45 c8 mov %rax,-0x38(%rbp) 5210: 48 8b 45 c8 mov -0x38(%rbp),%rax 5214: 41 5f pop %r15 [0x4c2a216 .. 0x4c2a217]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) 5216: c9 leaveq 5217: c3 retq > 51a3: 48 8d 45 90 lea -0x70(%rbp),%rax
> ..
> .. no writes to %rax ..
> ..
> 51db: 49 89 ef mov %rbp,%r15
> 51de: 48 89 c5 mov %rax,%rbp
>
> So __builtin_frame_address(0) gives us %rbp - 0x70
> whereas I think the correct CFA is %rbp + 16.
This is due to my mistake in defining VALGRIND_CFI_PROLOGUE,
referring to __builtin_frame_address(0) as %0
rather than %2. With that fixed, the relevant fragment looks more
plausible but is still wrong:
[0x4c2bdce .. 0x4c2bdd3]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
BP=*(cfa+-16)
7dce: 49 89 ef mov %rbp,%r15
7dd1: 48 89 ed mov %rbp,%rbp
The CFI is correct to claim that cfa=oldBP+16, but gcc has given us
%rbp for the CFA when it should be %rbp+16.
Analysis of the combined assembly/annotations below shows that the
"let cfa=" values consistently state it to be the word above the
return address. This is also consistent with the AMD64 ABI Draft 0.98
(September 27, 2006) Sec 6.2.5 "_Unwind_GetCFA", which states that the
CFA is "the value of %rsp at the call site in the previous frame".
However, the value computed by gcc for __canonical_frame_address(0)
appears to be wrong. It's as if the CFI generator inside gcc knows
correctly that the CFA is at rbp+16 but the part of gcc that creates
the expression for __canonical_frame_address(0) ignores the +16 bit.
This is with vanilla gcc-4.5.1 (same with gcc-4.4.3). If
__canonical_frame_address(0) is not going to give us reliable values
then this scheme will have to be devnulled.
0000000000007d45 <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>:
[0x4c2bd45 .. 0x4c2bd45]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=Same
7d45: 55 push %rbp
[0x4c2bd46 .. 0x4c2bd48]: let cfa=oldSP+16 in RA=*(cfa+-8) SP=cfa+0 BP=Same
7d46: 48 89 e5 mov %rsp,%rbp
[0x4c2bd49 .. 0x4c2bdcd]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7d49: 41 57 push %r15
7d4b: 48 c1 c7 03 rol $0x3,%rdi
7d4f: 48 c1 c7 0d rol $0xd,%rdi
7d53: 48 c1 c7 3d rol $0x3d,%rdi
7d57: 48 c1 c7 33 rol $0x33,%rdi
7d5b: 48 87 c9 xchg %rcx,%rcx
7d5e: 48 89 45 c8 mov %rax,-0x38(%rbp)
7d62: 48 8b 45 c8 mov -0x38(%rbp),%rax
7d66: 48 89 45 e0 mov %rax,-0x20(%rbp)
7d6a: 48 c7 45 90 04 01 47 movq $0x48470104,-0x70(%rbp)
7d71: 48
7d72: 48 89 7d 98 mov %rdi,-0x68(%rbp)
7d76: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
7d7d: 00
7d7e: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp)
7d85: 00
7d86: 48 c7 45 b0 00 00 00 movq $0x0,-0x50(%rbp)
7d8d: 00
7d8e: 48 c7 45 b8 00 00 00 movq $0x0,-0x48(%rbp)
7d95: 00
7d96: 48 8d 45 90 lea -0x70(%rbp),%rax
7d9a: ba 00 00 00 00 mov $0x0,%edx
7d9f: 48 c1 c7 03 rol $0x3,%rdi
7da3: 48 c1 c7 0d rol $0xd,%rdi
7da7: 48 c1 c7 3d rol $0x3d,%rdi
7dab: 48 c1 c7 33 rol $0x33,%rdi
7daf: 48 87 db xchg %rbx,%rbx
7db2: 48 89 55 c8 mov %rdx,-0x38(%rbp)
7db6: 48 8b 55 c8 mov -0x38(%rbp),%rdx
7dba: 48 8b 55 e0 mov -0x20(%rbp),%rdx
7dbe: 48 89 55 d0 mov %rdx,-0x30(%rbp)
7dc2: 48 8b 55 d0 mov -0x30(%rbp),%rdx
7dc6: 48 89 55 90 mov %rdx,-0x70(%rbp)
7dca: 48 89 7d 98 mov %rdi,-0x68(%rbp)
[0x4c2bdce .. 0x4c2bdd3]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7dce: 49 89 ef mov %rbp,%r15
7dd1: 48 89 ed mov %rbp,%rbp
[0x4c2bdd4 .. 0x4c2bdfe]: let cfa=oldBP+0 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7dd4: 48 81 ec 80 00 00 00 sub $0x80,%rsp
7ddb: 48 8b 78 08 mov 0x8(%rax),%rdi
[0x4c2bdff .. 0x4c2be08]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7ddf: 48 8b 00 mov (%rax),%rax
7de2: 48 c1 c7 03 rol $0x3,%rdi
7de6: 48 c1 c7 0d rol $0xd,%rdi
7dea: 48 c1 c7 3d rol $0x3d,%rdi
7dee: 48 c1 c7 33 rol $0x33,%rdi
7df2: 48 87 d2 xchg %rdx,%rdx
7df5: 48 81 c4 80 00 00 00 add $0x80,%rsp
7dfc: 4c 89 fd mov %r15,%rbp
7dff: 48 89 45 c8 mov %rax,-0x38(%rbp)
7e03: 48 8b 45 c8 mov -0x38(%rbp),%rax
7e07: 41 5f pop %r15
[0x4c2be09 .. 0x4c2be09]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7e09: c9 leaveq
[0x4c2be0a .. 0x4c2be0a]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
7e0a: c3 retq
Try __builtin_dwarf_cfa () instead of __builtin_frame_address (0) ? (In reply to comment #48) > Try __builtin_dwarf_cfa () instead of __builtin_frame_address (0) ? Hmm, that does seem to help, yes. Thanks. It's also totally undocumented (in the gcc manual). It would be nice to see some documentation, given that __builtin_dwarf_cfa() seems to do what __builtin_frame_address(0) claims to do, but doesn't. From googling it has been around at least since gcc-4.2 times, if not earlier. Correct? __builtin_dwarf_cfa () is supported by the compiler for more than 12 years, I bet even 2.95 had it and perhaps some of the latest egcs releases. I updated our s390x valgrind repository to use the same mechanism. Florian or myself will post updated patches soon in 243404 Btw, on Darwin gcc does not emit .cfi_startproc for a really stupid reason: its configure check for CFI support in GAS fails because it depends on readelf, and there is no readelf there (by default, at least). I guess this can not be fixed or worked around in Valgrind. |
Created attachment 48491 [details] update CFI when messing with the stack pointer The attached patch adds ".cfi_adjust_cfa_offset" annotations to the assembler blocks in CALL_FN_W_* macros. This way one don't have to wonder why stack unwinding works in Valgrind wrappers. Also, in one of the assembler blocks the stack frame of the OrigFn is disposed _before_ the actual call is done. Probably this code was never used ;) This is also fixed in the patch.