Bug 243270

Summary: Make stack unwinding in Valgrind wrappers more reliable
Product: valgrind Reporter: Evgeniy Stepanov <eugeni.stepanov>
Component: generalAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: bart.vanassche+kde, borntraeger, florian, jakub, konstantin.s.serebryany
Priority: NOR    
Version: 3.6 SVN   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:
Bug Depends on:    
Bug Blocks: 243404    
Attachments: update CFI when messing with the stack pointer
update CFI, try 2
update CFI, try 2 + a small fix
Add CFI annotations around VALGRIND_CALL_NOREDIR_RAX for amd64
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size

Description Evgeniy Stepanov 2010-06-30 18:58:14 UTC
Created attachment 48491 [details]
update CFI when messing with the stack pointer

The attached patch adds ".cfi_adjust_cfa_offset" annotations to the assembler blocks in CALL_FN_W_* macros. This way one don't have to wonder why stack unwinding works in Valgrind wrappers.

Also, in one of the assembler blocks the stack frame of the OrigFn is disposed _before_ the actual call is done. Probably this code was never used ;) This is also fixed in the patch.
Comment 1 Julian Seward 2010-07-21 12:40:52 UTC
Evgeniy, interesting.  I would like to ask: does this solve any specific
problem?  Or is it something that you just thought should be fixed?
Comment 2 Konstantin Serebryany 2010-07-22 10:35:36 UTC
This is a *potential* problem for the platforms supported by official valgrind and a *real* problem (and a real fix) for Native Client (http://code.google.com/p/nativeclient/wiki/ValgrindMemcheck)
Comment 3 Julian Seward 2010-08-21 13:24:01 UTC
Presumably this needs to happen (or at least be checked) 
also for arm and s390x.
Comment 4 Julian Seward 2010-08-21 13:41:00 UTC
Hmm, the patch seems to make stack unwinding worse in some cases
on Ubuntu 10.04 x86_64.  eg for helgrind/tests/hg02_deadlock

before:

==26793== Thread #3 was created
==26793==    at 0x51346BE: clone (clone.S:77)
==26793==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
==26793==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
==26793==    by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288)
==26793==    by 0x40075B: main (hg02_deadlock.c:36)

after:

==26788== Thread #3 was created
==26788==    at 0x51346BE: clone (clone.S:77)
==26788==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
==26788==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)

so now the unwinder stops in the wrapper, when it didn't before.

Any ideas?  I think you are correct, the wrappers need some kind of
CFI annotations, but maybe the patch needs further attention?
Comment 5 Bart Van Assche 2010-08-21 16:38:13 UTC
(In reply to comment #4)
> Hmm, the patch seems to make stack unwinding worse in some cases
> on Ubuntu 10.04 x86_64.  eg for helgrind/tests/hg02_deadlock
> 
> before:
> 
> ==26793== Thread #3 was created
> ==26793==    at 0x51346BE: clone (clone.S:77)
> ==26793==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> ==26793==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> ==26793==    by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288)
> ==26793==    by 0x40075B: main (hg02_deadlock.c:36)
> 
> after:
> 
> ==26788== Thread #3 was created
> ==26788==    at 0x51346BE: clone (clone.S:77)
> ==26788==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> ==26788==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> 
> so now the unwinder stops in the wrapper, when it didn't before.
> 
> Any ideas?  I think you are correct, the wrappers need some kind of
> CFI annotations, but maybe the patch needs further attention?

It's more complex than that. The short version of the stacktrace is also produced by the stack walking code that is currently on the trunk on some of the most recent Linux distro's. Examples of such distro's are Fedora 13 and openSUSE 11.3. I'm not sure whether or not this is related to the fact that both distro's contain gcc 4.5.0.
Comment 6 Evgeniy Stepanov 2010-08-21 22:21:35 UTC
Yes, it's not that simple. In your example (hg02_deadlock) gcc picked %rbp as the CFA register, with zero offset. In this case moving %rsp does not change anything, and CFA offset must stay the same.
In short, these directives are needed when (and only when) %rsp is the CFA register. At this point I don't see a clean fix for this problem.
Comment 7 Evgeniy Stepanov 2010-08-23 12:34:36 UTC
There is a GCC-specific solution.
We can create our own frame pointer by calling __builtin_frame_address(0), writing it in some free register, and adding an appropriate CFI annotation. But that will take one extra register for what is essentially a compile-time constant.
Comment 8 Christian Borntraeger 2010-08-25 14:10:11 UTC
As far as I can see s390 would need a similar annotation, if the CFA is based on the content r15. But we also have the same issue that we dont know if CFA is based on r15 or not (e.g. frame pointer r11)

Hmm, maybe it is possible to create a hand written eh_frame section for this code.
Comment 9 Evgeniy Stepanov 2010-08-26 11:56:15 UTC
Created attachment 50973 [details]
update CFI, try 2
Comment 10 Evgeniy Stepanov 2010-08-26 12:08:31 UTC
The second patch allocates one more register for the frame pointer and uses it a base for CFA. In fact, it may use as many as 2 extra registers: __builtin_frame_address(0) makes gcc keep a frame pointer in %rbp.

Alternatively, we can try tricks like calling __builtin_frame_address(0) or alloca(0) and casting it to something like (void)(volatile*) to prevent optimization. Any of these forces a frame pointer for the current function and gcc then defines CFA as %rbp+0. Of course, this may change in the future, but it seems unlikely. Another bonus: this works for gcc<4.3, which don't support cfi directives at all.
Comment 11 Evgeniy Stepanov 2010-08-26 12:21:23 UTC
Created attachment 50974 [details]
update CFI, try 2 + a small fix

Includes a fix for the stack frame being disposed _before_ the call in CALL_FN_W_8W
Comment 12 Julian Seward 2010-09-01 19:16:13 UTC
We should at least take the CALL_FN_W_8W fix right away, whilst
we figure out what to do with the CFI issue.
Comment 13 Julian Seward 2010-09-01 19:20:30 UTC
(In reply to comment #10)

Evgeniy,

> The second patch allocates one more register for the frame pointer and uses it
> a base for CFA. In fact, it may use as many as 2 extra registers:
> __builtin_frame_address(0) makes gcc keep a frame pointer in %rbp.
> 
> Alternatively, we can try tricks like calling __builtin_frame_address(0) or
> alloca(0) and casting it to something like (void)(volatile*) to prevent
> optimization. Any of these forces a frame pointer for the current function and
> gcc then defines CFA as %rbp+0. Of course, this may change in the future, but
> it seems unlikely. Another bonus: this works for gcc<4.3, which don't support
> cfi directives at all.

Thanks for the patch.  However, /me is confused.  Which of the
following are true?

1. the patch works for all gcc, not just gcc >= 4.3  

2. the patch is gcc specific, because it uses __builtin_frame_address

If (2) is true, that's a bit of a problem because Bart wants to
be able to compile this file with MSVC now.  Can we solve that
by #ifdef __MSC_VER (or whatever the right symbol is) ?
Comment 14 Bart Van Assche 2010-09-01 20:17:16 UTC
These macros haven't been ported yet to the Microsoft C compiler, so I don't think that these changes do not have to be protected by #ifdef _MSC_VER.
Comment 15 Evgeniy Stepanov 2010-09-01 22:12:22 UTC
1. CFI directives are only supported in gcc 4.4
For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty strings, and keep the __builtin_frame_address call. This should, at least, give a better chance of having a frame pointer than the current code.

2. Yes, all of this is gcc specific.

I think the right defines are __GNUC__ and __GNUC_MINOR__.
Comment 16 Bart Van Assche 2010-09-05 11:28:04 UTC
(In reply to comment #15)
> 1. CFI directives are only supported in gcc 4.4
> For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty
> strings, and keep the __builtin_frame_address call. This should, at least, give
> a better chance of having a frame pointer than the current code.
> 
> 2. Yes, all of this is gcc specific.
> 
> I think the right defines are __GNUC__ and __GNUC_MINOR__.

As far as I know support for the CFI assembler directives only depends on the binutils version and not on the gcc version. So whether or not these are supported has to be detected by a configure test. What I'm currently wondering about is:
* The header file config.h generated by Valgrind's configure script is not installed to $PREFIX/include/valgrind. Should we keep it that way or should we rename that header file to valgrind-config.h and install it ?
* Should config.h/valgrind-config.h be included from valgrind.h ?
* If not, how to make sure that while building Valgrind and the regression tests that config.h has been included before valgrind.h ?

For more information about the CFI directives, see also:
* http://sourceware.org/binutils/docs/as/CFI-directives.html
* http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/include/asm/dwarf2.h;hb=HEAD
Comment 17 Bart Van Assche 2010-09-05 15:23:17 UTC
(In reply to comment #4)
> Hmm, the patch seems to make stack unwinding worse in some cases
> on Ubuntu 10.04 x86_64.  eg for helgrind/tests/hg02_deadlock
> 
> before:
> 
> ==26793== Thread #3 was created
> ==26793==    at 0x51346BE: clone (clone.S:77)
> ==26793==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> ==26793==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> ==26793==    by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288)
> ==26793==    by 0x40075B: main (hg02_deadlock.c:36)
> 
> after:
> 
> ==26788== Thread #3 was created
> ==26788==    at 0x51346BE: clone (clone.S:77)
> ==26788==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> ==26788==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> 
> so now the unwinder stops in the wrapper, when it didn't before.
> 
> Any ideas?  I think you are correct, the wrappers need some kind of
> CFI annotations, but maybe the patch needs further attention?

This is probably because the code for interpreting .cfi_def_cfa in m_debuginfo is incomplete: it only supports the stack pointer and the frame pointer on x86/AMD64 but no other registers. I'm not familiar with this code so it's not clear to me which enums/structures/functions have to be updated to implement this properly.
Comment 18 Evgeniy Stepanov 2010-09-06 10:23:39 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > 1. CFI directives are only supported in gcc 4.4
> > For older versions, we should map VALGRIND_PROLOGUE and _EPILOGUE to empty
> > strings, and keep the __builtin_frame_address call. This should, at least, give
> > a better chance of having a frame pointer than the current code.
> > 
> > 2. Yes, all of this is gcc specific.
> > 
> > I think the right defines are __GNUC__ and __GNUC_MINOR__.
> 
> As far as I know support for the CFI assembler directives only depends on the
> binutils version and not on the gcc version. So whether or not these are
> supported has to be detected by a configure test.

We also need gcc to emit .cfi_startproc/.cfi_endproc. This requires a recent enough _gcc_ version. Also, there is -fno-dwarf2-cfi-asm option that will suppress .cfi_startproc/.cfi_endproc.

Yes, it's complicated enough to require a configure test. Perhaps one for CFI support and one for __builtin_frame_address.

> What I'm currently wondering
> about is:
> * The header file config.h generated by Valgrind's configure script is not
> installed to $PREFIX/include/valgrind. Should we keep it that way or should we
> rename that header file to valgrind-config.h and install it ?
> * Should config.h/valgrind-config.h be included from valgrind.h ?
> * If not, how to make sure that while building Valgrind and the regression
> tests that config.h has been included before valgrind.h ?
> 
> For more information about the CFI directives, see also:
> * http://sourceware.org/binutils/docs/as/CFI-directives.html
> *
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/include/asm/dwarf2.h;hb=HEAD
Comment 19 Evgeniy Stepanov 2010-09-06 10:38:46 UTC
(In reply to comment #17)
> (In reply to comment #4)
> > Hmm, the patch seems to make stack unwinding worse in some cases
> > on Ubuntu 10.04 x86_64.  eg for helgrind/tests/hg02_deadlock
> > 
> > before:
> > 
> > ==26793== Thread #3 was created
> > ==26793==    at 0x51346BE: clone (clone.S:77)
> > ==26793==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> > ==26793==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> > ==26793==    by 0x4C2D6FE: pthread_create@* (hg_intercepts.c:288)
> > ==26793==    by 0x40075B: main (hg02_deadlock.c:36)
> > 
> > after:
> > 
> > ==26788== Thread #3 was created
> > ==26788==    at 0x51346BE: clone (clone.S:77)
> > ==26788==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
> > ==26788==    by 0x4C2D5FE: pthread_create_WRK (hg_intercepts.c:257)
> > 
> > so now the unwinder stops in the wrapper, when it didn't before.
> > 
> > Any ideas?  I think you are correct, the wrappers need some kind of
> > CFI annotations, but maybe the patch needs further attention?
> 
> This is probably because the code for interpreting .cfi_def_cfa in m_debuginfo
> is incomplete: it only supports the stack pointer and the frame pointer on
> x86/AMD64 but no other registers. I'm not familiar with this code so it's not
> clear to me which enums/structures/functions have to be updated to implement
> this properly.

With the latest patch, stack traces are fine in all cases. But you make a good point: there is no sense in trashing another register (%ebx) for the frame pointer if valgrind is not going to use it anyway.

I'm not sure if this is 100% correct, but adding a "__builtin_frame_address(0);" statement anywhere inside a function forces a frame pointer for that function. This happens with any combination of flags I could think of. In this case we don't need any CFI directives at all.

Or we can try and support other registers in the cfi reader.
Comment 20 Bart Van Assche 2010-09-06 12:31:16 UTC
(In reply to comment #19)
> With the latest patch, stack traces are fine in all cases. But you make a good
> point: there is no sense in trashing another register (%ebx) for the frame
> pointer if valgrind is not going to use it anyway.

It would help if you could specify which cases (gcc version + CPU type) you have tested. Many stack traces are wrong even with the latest patch (2010-08-26) when building Valgrind with gcc 4.5.0 on an 64-bit x86 system. The patch doesn't seem to have any effect on the system I have tested it on.

Also, your patch only addresses the AMD64 platform. Which other platforms than AMD64 do need CFI annotations with recent gcc versions ?
Comment 21 Evgeniy Stepanov 2010-09-07 10:20:53 UTC
Being too lazy to build gcc myself, I only tested gcc <= 4.4 :)
Now that I've overcome my laziness, it looks like the problems with gcc 4.5 are unrelated to this issue. At least some of them are due to the check in m_stacktrace.c:267 :

if (fp_min + 512 >= fp_max) {

With gcc 4.5, we are getting closer to fp_max than with older versions, it seems. Changing the constant to anything (strictly) less then 440 fixes the problem for me.

I'm not sure about other platforms.
Comment 22 Bart Van Assche 2010-09-08 19:23:05 UTC
(In reply to comment #21)
> Being too lazy to build gcc myself, I only tested gcc <= 4.4 :)
> Now that I've overcome my laziness, it looks like the problems with gcc 4.5 are
> unrelated to this issue. At least some of them are due to the check in
> m_stacktrace.c:267 :
> 
> if (fp_min + 512 >= fp_max) {
> 
> With gcc 4.5, we are getting closer to fp_max than with older versions, it
> seems. Changing the constant to anything (strictly) less then 440 fixes the
> problem for me.

Well done. But on my setup (amd64 + gcc 4.5.0) I had to decrease that value even further. I only got the same call stacks as with earlier gcc versions with the value 380. Any idea why ?
Comment 23 Bart Van Assche 2010-09-08 19:32:14 UTC
Created attachment 51436 [details]
Add CFI annotations around VALGRIND_CALL_NOREDIR_RAX for amd64

Regenerated patch against trunk and added coregrind/m_stacktrace.c modification.
Comment 24 Bart Van Assche 2010-09-08 19:33:59 UTC
Evgeniy, a question about your patch: why have you added the VALGRIND_CFI_PROLOGUE / VALGRIND_CFI_EPILOGUE macros in the CALL_FN_W_* macros instead of at the beginning and the end of the VALGRIND_CALL_NOREDIR_RAX macro ? Would that also work ?
Comment 25 Bart Van Assche 2010-09-08 19:43:49 UTC
With this patch (v4) I'm also seeing a crash that does not occur without this patch:

$ ./vg-in-place --tool=drd drd/tests/pth_cancel_locked
==23471== drd, a thread error detector
==23471== Copyright (C) 2006-2010, and GNU GPL'd, by Bart Van Assche.
==23471== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info
==23471== Command: drd/tests/pth_cancel_locked
==23471== 
==23471== 
==23471== Process terminating with default action of signal 11 (SIGSEGV)
==23471==  Access not within mapped region at address 0x3
==23471==    at 0x5FC7573: ??? (in /lib64/libgcc_s.so.1)
==23471==    by 0x5FC79BA: ??? (in /lib64/libgcc_s.so.1)
==23471==    by 0x5FC7E82: _Unwind_Resume (in /lib64/libgcc_s.so.1)
==23471==    by 0x4E455DD: __condvar_cleanup1 (pthread_cond_wait.S:462)
==23471==    by 0x4C2D35A: pthread_cond_wait@* (drd_pthread_intercepts.c:691)
==23471==    by 0x40094C: thread (pth_cancel_locked.c:21)
==23471==    by 0x4C29650: vgDrd_thread_wrapper (drd_pthread_intercepts.c:272)
==23471==    by 0x4E40A4E: start_thread (pthread_create.c:297)
==23471==  If you believe this happened as a result of a stack
==23471==  overflow in your program's main thread (unlikely but
==23471==  possible), you can try to increase the size of the
==23471==  main thread stack using the --main-stacksize= flag.
==23471==  The main thread stack size used in this run was 8388608.
==23471== 
==23471== For counts of detected and suppressed errors, rerun with: -v
==23471== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 154 from 125)
./vg-in-place: line 28: 23471 Killed                  VALGRIND_LIB="$vgbasedir/.in_place" VALGRIND_LIB_INNER="$vgbasedir/.in_place" "$vgbasedir/coregrind/valgrind" "$@"
Comment 26 Evgeniy Stepanov 2010-09-09 10:54:50 UTC
> Well done. But on my setup (amd64 + gcc 4.5.0) I had to decrease that value
> even further. I only got the same call stacks as with earlier gcc versions with
> the value 380. Any idea why ?

Maybe you have newer glibc/libpthread? I compiled gcc 4.5 on ubuntu lucid, everything else (glibc, even binutils) are older versions. Perhaps this number shuld be decreased even more, as long as tests pass. As I understand the comment above, even values very close to zero will work.

Adding CFI directives to VALGRIND_CALL_NOREDIR_RAX would not work in a case when stack trace starts at one of the instructions after "subq $128,%%rsp" and before VALGRIND_CALL_NOREDIR_RAX. For example, in functions without a frame pointer, gcc adds a .cfi_adjust_cfa_offset after every stack push to make CFA exact at all times.

If there is a frame pointer already, it does not matter.

As I see it, there are two possibilities:
- We support frame pointer in arbitrary register in m_debug_info, put it in %rbx, and add appropriate cfi directives.
- We simply force gcc into having a frame pointer with __builtin_frame_address(0) call. Then we don't need any cfi directives. But gcc does not really have to have a frame pointer, it just does so in current version. It will probably keep doing so in future, as it's a very reasonable thing to do.

I'll take a look at the crashing test later.
Comment 27 Bart Van Assche 2010-09-18 13:59:45 UTC
A second strange behavior change caused by this patch is:

--- origin5-bz2.stderr.exp-glibc25-amd64        2010-09-18 13:43:07.000000000 +0200
+++ origin5-bz2.stderr.out      2010-09-18 13:48:04.000000000 +0200
@@ -117,6 +117,12 @@
 
 Conditional jump or move depends on uninitialised value(s)
    at 0x........: main (origin5-bz2.c:6512)
- Uninitialised value was created by a client request
-   at 0x........: main (origin5-bz2.c:6479)
+ Uninitialised value was created by a heap allocation
+   at 0x........: malloc (vg_replace_malloc.c:...)
+   by 0x........: g_serviceFn (origin5-bz2.c:6429)
+   by 0x........: default_bzalloc (origin5-bz2.c:4470)
+   by 0x........: BZ2_decompress (origin5-bz2.c:1578)
+   by 0x........: BZ2_bzDecompress (origin5-bz2.c:5192)
+   by 0x........: BZ2_bzBuffToBuffDecompress (origin5-bz2.c:5678)
+   by 0x........: main (origin5-bz2.c:6498)
Comment 28 Evgeniy Stepanov 2010-10-04 14:41:50 UTC
Bart, I can not reproduce either the crash or the behavior change. Actually, with the newest valgrind and the latest patch the set of failing tests is the same for clean and patched versions.

gcc-4.5 (GCC) 4.5.1 (self-compiled)
binutils 2.20.1-3ubuntu7
libc6 2.11.1-0ubuntu7.2

Valgrind synched at r11392
Comment 29 Julian Seward 2010-10-04 23:00:22 UTC
(In reply to comment #26)

I'm confused (this is a confusing problem :-)

> As I see it, there are two possibilities:
> - We support frame pointer in arbitrary register in m_debug_info, put it in
> %rbx, and add appropriate cfi directives.

That would make the stack unwinder slower, and it's already one of the
most performance critical parts of the system (in unwind-intensive
scenarios).

> - We simply force gcc into having a frame pointer with
> __builtin_frame_address(0) call. Then we don't need any cfi directives. But gcc
> does not really have to have a frame pointer, it just does so in current
> version. It will probably keep doing so in future, as it's a very reasonable
> thing to do.

So I like this one better.

What I can't figure out is: do any of the attached patches implement
exactly this, and nothing else?  If yes, does it actually work
correctly?
Comment 30 Evgeniy Stepanov 2010-10-05 13:00:27 UTC
Created attachment 52231 [details]
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size

This patch leaves the following tests failing:
memcheck/tests/linux/stack_switch        (stderr)
memcheck/tests/origin5-bz2               (stderr)
helgrind/tests/pth_barrier3              (stderr)
helgrind/tests/tc06_two_races_xml        (stderr)
helgrind/tests/tc09_bad_unlock           (stderr)
helgrind/tests/tc20_verifywrap           (stderr)
helgrind/tests/tc23_bogus_condwait       (stderr)
drd/tests/pth_barrier3                   (stderr)
exp-ptrcheck/tests/bad_percentify        (stderr)
exp-ptrcheck/tests/supp                  (stderr)

It does not add any new failures and fixes 21 out of 31 tests that are currently failing with gcc 4.5.1.
Comment 31 Evgeniy Stepanov 2010-10-05 13:05:53 UTC
Julian, I've attached a patch that does not clobber any new registers (except for %rbp, if the enclosing function did not have a frame pointer already).

Bart, I was wrong, the issue with origin5-bz2 does reproduce over here. But it occurs both with or without the patch.
Comment 32 Julian Seward 2010-10-05 15:34:12 UTC
(In reply to comment #31)

> Julian, I've attached a patch that does not clobber any new registers (except
> for %rbp, if the enclosing function did not have a frame pointer already).

This looks like the cleanest solution so far.

On further thought and reading of the gcc docs, though, I have more
questions.  AFAICS, the gcc-4.5 docs for __builtin_frame_address(0) do
not give any assurance that it really forces %rbp to be used as a
frame register -- what stops gcc from giving us the frame address as
(eg) %rsp + offset?


What would happen if we used a modified version of this:

> - We support frame pointer in arbitrary register in m_debug_info, put it in
> %rbx, and add appropriate cfi directives.

instead of %rbx, use %rbp: add an assembly instruction to copy
__builtin_frame_address(0) into %rbp, and add cfi directives.  Then (eg)
CALL_FN_W_v would look like this (added/changed lines marked with 'x')


#define CALL_FN_W_v(lval, orig)                                   \
   do {                                                           \
      volatile OrigFn        _orig = (orig);                      \
      volatile unsigned long _argvec[1];                          \
      volatile unsigned long _res;                                \
      _argvec[0] = (unsigned long)_orig.nraddr;                   \
      __asm__ volatile(                                           \
         ".cfi_remember_state\n\t"                                \ x
         ".cfi_def_cfa rbp, 0\n\t"                                \ x
         "movq %0,%%rbp\n\t"                                      \ x
         "subq $128,%%rsp\n\t"                                    \
         "movq (%%rax), %%rax\n\t"  /* target->%rax */            \
         VALGRIND_CALL_NOREDIR_RAX                                \
         "addq $128,%%rsp\n\t"                                    \
         ".cfi_restore_state\n\t"                                 \ x
         : /*out*/   "=a" (_res)                                  \
         : /*in*/    "a" (&_argvec[0]),                           \
                     "r" (__builtin_frame_address(0))             \ x
         : /*trash*/ "cc", "memory", __CALLER_SAVED_REGS, "rbp"   \ x
      );                                                          \
      lval = (__typeof__(lval)) _res;                             \
   } while (0)

Is that feasible?  Effects (+,-) are:

- it might generate a "movq %rbp,%rbp".  I don't care about that,
  and anyway the JIT's optimiser makes that essentially free.

- it might trash a second register (?)  I don't care about that
  either; these wrappers aren't performance critical.

+ it means we make no assumption about how/what gcc does with
  __builtin_frame_address(0), only that it produces a correct value

+ it doesn't require the unwinder to handle %rbx, only %rbp, which
  it already does, hence no unwinder performance loss

+ it doesn't require the JIT to maintain precise values for %rbx,
  only %rbp, which it already does, hence no runtime performance loss

- it makes it a bit more complex to handle gccs that don't know
  about __builtin_frame_address


Sorry to drag this out.  Considering we also need to solve the same
problem for arm-linux and s390x-linux, I want to be sure we get the
most robust solution we can.
Comment 33 Evgeniy Stepanov 2010-10-05 17:59:42 UTC
(In reply to comment #32)
> (In reply to comment #31)
> 
> > Julian, I've attached a patch that does not clobber any new registers (except
> > for %rbp, if the enclosing function did not have a frame pointer already).
> 
> This looks like the cleanest solution so far.
> 
> On further thought and reading of the gcc docs, though, I have more
> questions.  AFAICS, the gcc-4.5 docs for __builtin_frame_address(0) do
> not give any assurance that it really forces %rbp to be used as a
> frame register -- what stops gcc from giving us the frame address as
> (eg) %rsp + offset?
> 
> 
> What would happen if we used a modified version of this:
> 
> > - We support frame pointer in arbitrary register in m_debug_info, put it in
> > %rbx, and add appropriate cfi directives.
> 
> instead of %rbx, use %rbp: add an assembly instruction to copy
> __builtin_frame_address(0) into %rbp, and add cfi directives.  Then (eg)
> CALL_FN_W_v would look like this (added/changed lines marked with 'x')
> 
> 
> #define CALL_FN_W_v(lval, orig)                                   \
>    do {                                                           \
>       volatile OrigFn        _orig = (orig);                      \
>       volatile unsigned long _argvec[1];                          \
>       volatile unsigned long _res;                                \
>       _argvec[0] = (unsigned long)_orig.nraddr;                   \
>       __asm__ volatile(                                           \
>          ".cfi_remember_state\n\t"                                \ x
>          ".cfi_def_cfa rbp, 0\n\t"                                \ x
>          "movq %0,%%rbp\n\t"                                      \ x
>          "subq $128,%%rsp\n\t"                                    \
>          "movq (%%rax), %%rax\n\t"  /* target->%rax */            \
>          VALGRIND_CALL_NOREDIR_RAX                                \
>          "addq $128,%%rsp\n\t"                                    \
>          ".cfi_restore_state\n\t"                                 \ x
>          : /*out*/   "=a" (_res)                                  \
>          : /*in*/    "a" (&_argvec[0]),                           \
>                      "r" (__builtin_frame_address(0))             \ x
>          : /*trash*/ "cc", "memory", __CALLER_SAVED_REGS, "rbp"   \ x
>       );                                                          \
>       lval = (__typeof__(lval)) _res;                             \
>    } while (0)
> 
> Is that feasible?  Effects (+,-) are:
> 
> - it might generate a "movq %rbp,%rbp".  I don't care about that,
>   and anyway the JIT's optimiser makes that essentially free.
> 
> - it might trash a second register (?)  I don't care about that
>   either; these wrappers aren't performance critical.
> 
> + it means we make no assumption about how/what gcc does with
>   __builtin_frame_address(0), only that it produces a correct value
> 
> + it doesn't require the unwinder to handle %rbx, only %rbp, which
>   it already does, hence no unwinder performance loss
> 
> + it doesn't require the JIT to maintain precise values for %rbx,
>   only %rbp, which it already does, hence no runtime performance loss
> 
> - it makes it a bit more complex to handle gccs that don't know
>   about __builtin_frame_address
> 
> 
> Sorry to drag this out.  Considering we also need to solve the same
> problem for arm-linux and s390x-linux, I want to be sure we get the
> most robust solution we can.

Great idea, but I've already had it before I came up with this %rbx nonsense :)
Apparently gcc does not allow clobbering %rbp in inline assembly.

int main() {
  __asm__ volatile(
      "nop \n\t"
      : /*out*/
      : /*in*/
      : /*trash*/ "rbp"
      );
  return 0;
}


$ gcc -c 1.c
1.c: In function ‘main’:
1.c:9: error: bp cannot be used in asm here
Comment 34 Julian Seward 2010-10-05 18:22:12 UTC
> Apparently gcc does not allow clobbering %rbp in inline assembly.

Yeah, I wondered if that would be a problem.  So, can't we
temporarily park the value of rbp in r15 (or whatever) and
tell gcc that r15 is trashed by the asm?
Comment 35 Evgeniy Stepanov 2010-10-06 15:05:54 UTC
Created attachment 52278 [details]
Force frame pointer in CALL_FN_* and reduce the lower limit on stack size

Nice idea.
Comment 36 Julian Seward 2010-10-06 15:20:08 UTC
(In reply to comment #35)
> Created an attachment (id=52278) [details]
> Force frame pointer in CALL_FN_* and reduce the lower limit on stack size

Thanks for the patch.  This actually works properly (iow, the gcc-4.5
stack traces are OK), is that correct?  If so I'll commit it.
Comment 37 Evgeniy Stepanov 2010-10-06 16:00:40 UTC
Yeah, the same as the previous patch.
Comment 38 Julian Seward 2010-10-07 00:22:23 UTC
Committed, with comments, without the 512 -> 380 change.  Thanks!
r11402.

I'll commit the 512->380 change separately.
Comment 39 Julian Seward 2010-10-07 00:46:23 UTC
Stack lower limit size reduced from 512 to 256 in r11403.

Thanks all for the work in analysis, making & trying out patches.
Comment 40 Julian Seward 2010-10-07 00:47:20 UTC
Closing, although as per comments above, will need to port this
also to arm and s390x at some point.
Comment 41 Jakub Jelinek 2010-10-08 11:16:44 UTC
You should add && defined (__GCC_HAVE_DWARF2_CFI_ASM) to the conditions guarding the use of .cfi_* directives in inline-asm.  When compiled with -fno-dwarf2-cfi-asm or with gcc configured against old gas, it won't otherwise compile.
The precondition of using .cfi_* directives in inline-asm is that GCC emits .cfi_* directives for the function as well, otherwise you'll have .cfi_* directives without preceeding .cfi_startproc, and the above mentioned macro
is a builtin GCC macro that says .cfi_* directives are being emitted.
Comment 42 Julian Seward 2010-10-08 11:35:36 UTC
(In reply to comment #41)

> You should add && defined (__GCC_HAVE_DWARF2_CFI_ASM) to the conditions
> guarding the use of .cfi_* directives in inline-asm.

Jakub, is that before or after r11414 (committed about 4 hours ago) ?

Assuming after, then the condition would be

  #if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \
      && !defined(PLAT_amd64_darwin)

Is that correct?
Comment 43 Jakub Jelinek 2010-10-08 11:42:46 UTC
It applies both to code before and after.
#if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \
      && !defined(PLAT_amd64_darwin)
looks correct.  I guess you wouldn't need the darwin special case with it, but it doesn't hurt.
Comment 44 Bart Van Assche 2010-10-08 12:11:25 UTC
(In reply to comment #43)
> It applies both to code before and after.
> #if defined(__GNUC__) && defined(__GCC_HAVE_DWARF2_CFI_ASM) \
>       && !defined(PLAT_amd64_darwin)
> looks correct.  I guess you wouldn't need the darwin special case with it, but
> it doesn't hurt.

Which value does __GCC_HAVE_DWARF2_CFI_ASM have for gcc 4.3 ? The CFI stuff must be disabled for gcc 4.3 because gcc 4.3 does not emit .cfi_startproc and hence causes the assembler to fail. See also https://bugs.kde.org/show_bug.cgi?id=253496.
Comment 45 Jakub Jelinek 2010-10-08 12:24:25 UTC
It is not defined there.  This is a compiler built in macro which is defined in 4.4+ iff .cfi_startproc etc. is being emitted.
Normally on x86-64 it is defined in 4.4+, but e.g. when using -fno-dwarf2-cfi-asm or -fno-asynchronous-unwind-tables, it is not (nor the .cfi_* directives).
Comment 46 Julian Seward 2010-10-08 14:33:01 UTC
Hmm, something's not right with the scheme that got committed
(independent of the gating conditions).  Trying with '#if 1' on Ubuntu
10.04 x86_64 (gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3) I get broken stack
traces for (eg)

./vg-in-place --tool=helgrind ./helgrind/tests/hg04_race

==14830== Thread #3 was created
==14830==    at 0x51346BE: clone (clone.S:77)
==14830==    by 0x4E38172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
==14830==    by 0x4C2D961: pthread_create_WRK (hg_intercepts.c:257)
==14830== 

That stack trace should have end in main (from the .stderr.exp file):

Thread #x was created
   ...
   by 0x........: pthread_create@* (hg_intercepts.c:...)
   by 0x........: main (hg04_race.c:21)


I just looked in detail at the interleaved assembly and CFI for a
function using these wrappers, "QT4_FUNC(void*, _ZN6QMutexD1Ev, void*
mutex)".  It's simpler than the wrapper for pthread_create_WRK
(hg_intercepts.c:257) and shows the same problem.

The assembly is from objdump, and the CFI from valgrind via
"--trace-symtab-patt=*vgpreload_helgrind*" --trace-cfi=yes

The CFI looks correct to me.  At 51db: and 51de: we save the old
rbp in r15 and install the new CFA in rbp, as per the plan in
comment 34 and comment 35.

Problem is, it looks like GCC has given us a completely bogus
value for __builtin_frame_address(0):

000000000000514f <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>:
    514f:       55                      push   %rbp
    5150:       48 89 e5                mov    %rsp,%rbp
    .. at this point, the CFA = %rbp + 16
    ..
    .. no assignments to %rbp here ..
    ..
    51a3:       48 8d 45 90             lea    -0x70(%rbp),%rax
    ..
    .. no writes to %rax ..
    ..
    51db:       49 89 ef                mov    %rbp,%r15
    51de:       48 89 c5                mov    %rax,%rbp

So __builtin_frame_address(0) gives us %rbp - 0x70
whereas I think the correct CFA is %rbp + 16.

Anybody have any ideas?  Is this analysis correct?  This bug is
turning into a disaster area :-(




The complete combined assembly + CFI follows:

000000000000514f <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>:

[0x4c2a14f .. 0x4c2a14f]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=Same
    514f:       55                      push   %rbp

[0x4c2a150 .. 0x4c2a152]: let cfa=oldSP+16 in RA=*(cfa+-8) SP=cfa+0 BP=Same
    5150:       48 89 e5                mov    %rsp,%rbp

[0x4c2a153 .. 0x4c2a1da]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
                                              BP=*(cfa+-16)
    5153:       41 57                   push   %r15
    5155:       48 8d 55 e0             lea    -0x20(%rbp),%rdx
    5159:       48 c1 c7 03             rol    $0x3,%rdi
    515d:       48 c1 c7 0d             rol    $0xd,%rdi
    5161:       48 c1 c7 3d             rol    $0x3d,%rdi
    5165:       48 c1 c7 33             rol    $0x33,%rdi
    5169:       48 87 c9                xchg   %rcx,%rcx
    516c:       48 89 45 c8             mov    %rax,-0x38(%rbp)
    5170:       48 8b 45 c8             mov    -0x38(%rbp),%rax
    5174:       48 89 02                mov    %rax,(%rdx)
    5177:       48 c7 45 90 04 01 47    movq   $0x48470104,-0x70(%rbp)
    517e:       48 
    517f:       48 89 7d 98             mov    %rdi,-0x68(%rbp)
    5183:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
    518a:       00 
    518b:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
    5192:       00 
    5193:       48 c7 45 b0 00 00 00    movq   $0x0,-0x50(%rbp)
    519a:       00 
    519b:       48 c7 45 b8 00 00 00    movq   $0x0,-0x48(%rbp)
    51a2:       00 
    51a3:       48 8d 45 90             lea    -0x70(%rbp),%rax
    51a7:       ba 00 00 00 00          mov    $0x0,%edx
    51ac:       48 c1 c7 03             rol    $0x3,%rdi
    51b0:       48 c1 c7 0d             rol    $0xd,%rdi
    51b4:       48 c1 c7 3d             rol    $0x3d,%rdi
    51b8:       48 c1 c7 33             rol    $0x33,%rdi
    51bc:       48 87 db                xchg   %rbx,%rbx
    51bf:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
    51c3:       48 8b 55 c8             mov    -0x38(%rbp),%rdx
    51c7:       48 8b 55 e0             mov    -0x20(%rbp),%rdx
    51cb:       48 89 55 d0             mov    %rdx,-0x30(%rbp)
    51cf:       48 8b 55 d0             mov    -0x30(%rbp),%rdx
    51d3:       48 89 55 90             mov    %rdx,-0x70(%rbp)
    51d7:       48 89 7d 98             mov    %rdi,-0x68(%rbp)

[0x4c2a1db .. 0x4c2a1e0]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
                                              BP=*(cfa+-16)
    51db:       49 89 ef                mov    %rbp,%r15
    51de:       48 89 c5                mov    %rax,%rbp

[0x4c2a1e1 .. 0x4c2a20b]: let cfa=oldBP+0 in RA=*(cfa+-8) SP=cfa+0
                                             BP=*(cfa+-16)
    51e1:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
    51e8:       48 8b 78 08             mov    0x8(%rax),%rdi
    51ec:       48 8b 00                mov    (%rax),%rax
    51ef:       48 c1 c7 03             rol    $0x3,%rdi
    51f3:       48 c1 c7 0d             rol    $0xd,%rdi
    51f7:       48 c1 c7 3d             rol    $0x3d,%rdi
    51fb:       48 c1 c7 33             rol    $0x33,%rdi
    51ff:       48 87 d2                xchg   %rdx,%rdx
    5202:       48 81 c4 80 00 00 00    add    $0x80,%rsp
    5209:       4c 89 fd                mov    %r15,%rbp

[0x4c2a20c .. 0x4c2a215]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
                                              BP=*(cfa+-16)
    520c:       48 89 45 c8             mov    %rax,-0x38(%rbp)
    5210:       48 8b 45 c8             mov    -0x38(%rbp),%rax
    5214:       41 5f                   pop    %r15

[0x4c2a216 .. 0x4c2a217]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
                                              BP=*(cfa+-16)
    5216:       c9                      leaveq 
    5217:       c3                      retq
Comment 47 Julian Seward 2010-10-08 17:02:07 UTC
>     51a3:       48 8d 45 90             lea    -0x70(%rbp),%rax
>     ..
>     .. no writes to %rax ..
>     ..
>     51db:       49 89 ef                mov    %rbp,%r15
>    51de:       48 89 c5                mov    %rax,%rbp
> 
> So __builtin_frame_address(0) gives us %rbp - 0x70
> whereas I think the correct CFA is %rbp + 16.

This is due to my mistake in defining VALGRIND_CFI_PROLOGUE,
referring to __builtin_frame_address(0) as %0
rather than %2.  With that fixed, the relevant fragment looks more
plausible but is still wrong:

[0x4c2bdce .. 0x4c2bdd3]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0
                                              BP=*(cfa+-16)
    7dce:       49 89 ef                mov    %rbp,%r15
    7dd1:       48 89 ed                mov    %rbp,%rbp

The CFI is correct to claim that cfa=oldBP+16, but gcc has given us
%rbp for the CFA when it should be %rbp+16.

Analysis of the combined assembly/annotations below shows that the
"let cfa=" values consistently state it to be the word above the
return address.  This is also consistent with the AMD64 ABI Draft 0.98
(September 27, 2006) Sec 6.2.5 "_Unwind_GetCFA", which states that the
CFA is "the value of %rsp at the call site in the previous frame".

However, the value computed by gcc for __canonical_frame_address(0)
appears to be wrong.  It's as if the CFI generator inside gcc knows
correctly that the CFA is at rbp+16 but the part of gcc that creates
the expression for __canonical_frame_address(0) ignores the +16 bit.

This is with vanilla gcc-4.5.1 (same with gcc-4.4.3).  If
__canonical_frame_address(0) is not going to give us reliable values
then this scheme will have to be devnulled.




0000000000007d45 <_vgwZU_libQtCoreZdsoZa__ZN6QMutexD1Ev>:

[0x4c2bd45 .. 0x4c2bd45]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=Same
    7d45:       55                      push   %rbp

[0x4c2bd46 .. 0x4c2bd48]: let cfa=oldSP+16 in RA=*(cfa+-8) SP=cfa+0 BP=Same
    7d46:       48 89 e5                mov    %rsp,%rbp

[0x4c2bd49 .. 0x4c2bdcd]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7d49:       41 57                   push   %r15
    7d4b:       48 c1 c7 03             rol    $0x3,%rdi
    7d4f:       48 c1 c7 0d             rol    $0xd,%rdi
    7d53:       48 c1 c7 3d             rol    $0x3d,%rdi
    7d57:       48 c1 c7 33             rol    $0x33,%rdi
    7d5b:       48 87 c9                xchg   %rcx,%rcx
    7d5e:       48 89 45 c8             mov    %rax,-0x38(%rbp)
    7d62:       48 8b 45 c8             mov    -0x38(%rbp),%rax
    7d66:       48 89 45 e0             mov    %rax,-0x20(%rbp)
    7d6a:       48 c7 45 90 04 01 47    movq   $0x48470104,-0x70(%rbp)
    7d71:       48 
    7d72:       48 89 7d 98             mov    %rdi,-0x68(%rbp)
    7d76:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
    7d7d:       00 
    7d7e:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
    7d85:       00 
    7d86:       48 c7 45 b0 00 00 00    movq   $0x0,-0x50(%rbp)
    7d8d:       00 
    7d8e:       48 c7 45 b8 00 00 00    movq   $0x0,-0x48(%rbp)
    7d95:       00 
    7d96:       48 8d 45 90             lea    -0x70(%rbp),%rax
    7d9a:       ba 00 00 00 00          mov    $0x0,%edx
    7d9f:       48 c1 c7 03             rol    $0x3,%rdi
    7da3:       48 c1 c7 0d             rol    $0xd,%rdi
    7da7:       48 c1 c7 3d             rol    $0x3d,%rdi
    7dab:       48 c1 c7 33             rol    $0x33,%rdi
    7daf:       48 87 db                xchg   %rbx,%rbx
    7db2:       48 89 55 c8             mov    %rdx,-0x38(%rbp)
    7db6:       48 8b 55 c8             mov    -0x38(%rbp),%rdx
    7dba:       48 8b 55 e0             mov    -0x20(%rbp),%rdx
    7dbe:       48 89 55 d0             mov    %rdx,-0x30(%rbp)
    7dc2:       48 8b 55 d0             mov    -0x30(%rbp),%rdx
    7dc6:       48 89 55 90             mov    %rdx,-0x70(%rbp)
    7dca:       48 89 7d 98             mov    %rdi,-0x68(%rbp)

[0x4c2bdce .. 0x4c2bdd3]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7dce:       49 89 ef                mov    %rbp,%r15
    7dd1:       48 89 ed                mov    %rbp,%rbp

[0x4c2bdd4 .. 0x4c2bdfe]: let cfa=oldBP+0 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7dd4:       48 81 ec 80 00 00 00    sub    $0x80,%rsp
    7ddb:       48 8b 78 08             mov    0x8(%rax),%rdi

[0x4c2bdff .. 0x4c2be08]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7ddf:       48 8b 00                mov    (%rax),%rax
    7de2:       48 c1 c7 03             rol    $0x3,%rdi
    7de6:       48 c1 c7 0d             rol    $0xd,%rdi
    7dea:       48 c1 c7 3d             rol    $0x3d,%rdi
    7dee:       48 c1 c7 33             rol    $0x33,%rdi
    7df2:       48 87 d2                xchg   %rdx,%rdx
    7df5:       48 81 c4 80 00 00 00    add    $0x80,%rsp
    7dfc:       4c 89 fd                mov    %r15,%rbp
    7dff:       48 89 45 c8             mov    %rax,-0x38(%rbp)
    7e03:       48 8b 45 c8             mov    -0x38(%rbp),%rax
    7e07:       41 5f                   pop    %r15

[0x4c2be09 .. 0x4c2be09]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7e09:       c9                      leaveq 

[0x4c2be0a .. 0x4c2be0a]: let cfa=oldSP+8 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    7e0a:       c3                      retq
Comment 48 Jakub Jelinek 2010-10-08 17:07:49 UTC
Try __builtin_dwarf_cfa () instead of __builtin_frame_address (0) ?
Comment 49 Julian Seward 2010-10-08 18:53:46 UTC
(In reply to comment #48)
> Try __builtin_dwarf_cfa () instead of __builtin_frame_address (0) ?

Hmm, that does seem to help, yes.  Thanks.  It's also totally
undocumented (in the gcc manual).  It would be nice to see some
documentation, given that __builtin_dwarf_cfa() seems to do what
__builtin_frame_address(0) claims to do, but doesn't.

From googling it has been around at least since gcc-4.2 times, if
not earlier.  Correct?
Comment 50 Jakub Jelinek 2010-10-08 23:11:21 UTC
__builtin_dwarf_cfa () is supported by the compiler for more than 12 years, I bet even 2.95 had it and perhaps some of the latest egcs releases.
Comment 51 Christian Borntraeger 2010-10-12 12:58:41 UTC
I updated our s390x valgrind repository to use the same mechanism.
Florian or myself will post updated patches soon in  243404
Comment 52 Evgeniy Stepanov 2010-10-13 11:53:30 UTC
Btw, on Darwin gcc does not emit .cfi_startproc for a really stupid reason:
its configure check for CFI support in GAS fails because it depends on readelf, and there is no readelf there (by default, at least). I guess this can not be fixed or worked around in Valgrind.