Bug 252091 - Callgrind on ARM does not detect function returns correctly
Summary: Callgrind on ARM does not detect function returns correctly
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: callgrind (show other bugs)
Version: 3.6 SVN
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-23 03:07 UTC by Timothy B. Terriberry
Modified: 2012-07-31 08:31 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
First stab at identifying ARM procedure return instructions. (3.62 KB, patch)
2010-09-23 03:07 UTC, Timothy B. Terriberry
Details
Proposed fix to allow Callgrind to handle conditional calls/returns (13.50 KB, patch)
2011-11-11 11:21 UTC, Josef Weidendorfer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Timothy B. Terriberry 2010-09-23 03:07:01 UTC
Created attachment 51889 [details]
First stab at identifying ARM procedure return instructions.

Version:           3.6 SVN
OS:                Linux

It seems (at least judging by the display in kcachegrind) to be identifying instructions such as
   LDMFD r13!,{r4,r5,r10,pc}
as calls instead of returns, and instructions such as
   MOV pc,r14
as ordinary jumps.
This generates huge numbers of cycles which make it very difficult to actually isolate regions of code for performance, especially as the latter means it does not identify function boundaries correctly, so even their Self measurements are wrong.

The Cortex A8 TRM Section 5.2.1 and the ARMv7-a ARM Section C9.10.1 define the following instructions as calls:

BL immediate
BLX
HBL (ThumbEE state)
HBLP (ThumbEE state)

and the following instructions as returns:

BX r14
MOV pc, r14
LDM r13!, {... PC} (in ARM or Thumb state)
LDM r9!, {... PC} (in ThumbEE state)
LDR pc, [r13], #imm (in ARM or Thumb state)
LDR pc, [r9], #imm (in ThumbEE state)

I took a stab at adding what I thought was the necessary code in the attached patch, but while this definitely changes the result, it does not fix the problem. In fact, after applying this patch, callgrind starts to interpret MOV PC,r14 as a call rather than a return in places (and other places does not). This is consistent with the behavior of LDM above, which looked like it was already being properly marked as a return in guest_arm_toIR.c, so either I've badly misunderstood things, or something else must be going on here.

Reproducible: Didn't try
Comment 1 Julian Seward 2010-09-23 03:22:03 UTC
I'll take it; it's a front end problem, not a Callgrind problem.

Josef, it would be cool to get this fixed, though, if only because
perhaps that would show a way to make CG work better on PowerPC.
Comment 2 Timothy B. Terriberry 2010-10-09 05:19:34 UTC
Okay, I think I've tracked down a good bit of the problem.

First, a simple change, in addition to the attached patch:
@@ -12885,7 +12911,7 @@
             stmt( IRStmt_Exit( unop(Iop_32to1, mkexpr(condT)),
                                jk, IRConst_U32(dst) ));
             irsb->next     = mkU32(guest_R15_curr_instr_notENC + 4);
-            irsb->jumpkind = jk;
+            irsb->jumpkind = Ijk_Boring;
             dres.whatNext  = Dis_StopHere;
          }
          DIP("b%s%s 0x%x %s\n", link ? "l" : "", nCC(INSN_COND),

This stops non-taken conditional jumps from sometimes incorrectly being treated as calls. However, this creates more problems.

Consider the following trace:
Call 04811008 -> 0481cfb0, SP bdf10b18
This is the entry point to oc_loop_filter_frag_rows_neon().

None 0481cfc0 -> 0481cfc4, SP bdf10af0
None 0481cfc8 -> 0481cfcc, SP bdf10af0
None 0481cff4 -> 0481cff8, SP bdf10af0
None 0481d000 -> 0481d004, SP bdf10af0
These jumps are not taken, which is fine.

Conditional 0481d008 -> 0481cf3c, SP bdf10af0
This is supposed to be a call to loop_filter_v_neon().
It's properly marked in the VEX output (with my current patch and the above change):

        (arm) 0x481D008:  bl{gt} 0x481CF3C 

              ------ IMark(0x481D008, 4) ------
              PUT(60) = 0x481D008:I32
              t4 = armg_calculate_condition[mcx=0x9]{0x3815127c}(Or32(GET:I32(64),0xC0:I32),GET:I32(68),GET:I32(72),GET:I32(76)):I32
              PUT(56) = Mux0X(32to8(t4),GET:I32(56),0x481D00C:I32)
              if (32to1(t4)) goto {Call} 0x481CF3C:I32
              goto {Boring} 0x481D00C:I32

Here's what happens on the return:

sp (0xBDF10AF0) < top_ce->sp (0xBDF10B18)
sp here is the updated stack pointer from inside oc_loop_filter_frag_rows_neon(), as loop_filter_v_neon() does not use the stack at all. top_ce->sp is still from outside oc_loop_filter_frag_rows_neon(), because the call to loop_filter_v_neon() wasn't detected properly.

Forcing jump to Ijk_Boring.
     JMP: 0x00012f3c[/home/user/theorarm/libtheoradec.so.1] to 0x0001300c[/home/user/theorarm/libtheoradec.so.1] (RET w/o CALL)!
     Pop on Jump!
Then the jump gets made boring because it failed the sp check.

Converting jump to Ijk_Call.
Call 0481cfac -> 0481d00c, SP bdf10b18
And after that, converted into a call instead of a return, which is what creates the cycle. It doesn't take much to envelop half the program in such cycles.

There's an
#if defined(VGA_ppc32)
here that should probably be extended to include ARM to make this conversion more conservative, but that's not the real problem.

The real problem here (or at least one of them) is this code in setup_bbcc:

      if (passed == last_bb->cjmp_count) {
...
      }
      else
          jmpkind = JmpCond;

It looks like the code is designed such that conditional jumps can never be calls or returns. That might be true on x86, but certainly isn't true on ARM. At this point I'm getting out of my depth. AFAICT, callgrind only looks at jmpkind on the BB level, which means a conditional jump in the middle of a BB can't be marked as a call or return. I'm not sure how a conditional call is expected to be marked at all, as it has two jmpkinds: one if taken (Ijk_Call), and one if not (Ijk_Boring), and there is special code to convert a not-taken (Ijk_Boring) jump into a JmpNone for the last jump in the BB, which means it can't also have the other tag.
Comment 3 Josef Weidendorfer 2010-10-11 12:06:39 UTC
Thanks for the analysis. This wrong assertion should be fixed.

And as you found out, Callgrind checks any return for a matching
call before. If that is not found, it assumes a jump, which
is then converted to a call.
I remember that this was necessary for the calls to the runtime
linker to show up correctly (at least on x86).

Of course, the best would be to have regressions tests on detected
call hierarchy of a simple example code. To be independent from the
compiler output/optimization, it best is written in assembler, but
should be taken from typical compiler output.
I never came around doing that stuff. With "--ct-verbose=1", it
prints out the detected call sequence, which can be used for the
regression test.

I big problem is how to check correct behavior with regard to the
runtime linker. How would you do a regression test here?
Comment 4 Julian Seward 2011-10-03 19:50:27 UTC
Josef, do you have any suggestions how I could move forwards
with this?  I can put some time into trying to improve the situation,
but am not sure how to proceed.
Comment 5 Josef Weidendorfer 2011-10-04 07:58:54 UTC
I am not sure this is completely solvable purely from VEX.
There are heuristics in callgrind based on properties of the
client platform. Two I can now think of are
(1) the assumption that SP is changing with CALL/RET instructions.
E.g., based on the SP content, unwinding stops.
(2) there are assumptions on how runtime linking for the given
platform is done.
The best is to refactor the heuristics into platform specific code.

The first step is to come up with tests together with expected outcome. 
I have access to a beagleboard, and  try to come up with such tests
first, to understand the problematic cases.
Comment 6 Julian Seward 2011-10-04 10:24:57 UTC
> The best is to refactor the heuristics into platform specific code.

More generally, is it possible to modularise the fn entry/exit detection
code so as to be more standalone?  Then perhaps it could monitor 
execution and feed a sequence of function entry/exit (or, shadow-stack
update events) to the rest of the tool.
Comment 7 Julian Seward 2011-10-04 10:26:19 UTC
(related to previous comment)  That kind of modularisation might make it
possible to try out different entry/exit detectors, and/or have completely
different implementations for different architectures.
Comment 8 Julian Seward 2011-10-14 15:49:05 UTC
VEX-side changes committed as r2216.  Thanks.
Comment 9 Julian Seward 2011-11-08 21:18:19 UTC
> The real problem here (or at least one of them) is this code in setup_bbcc:
> 
>       if (passed == last_bb->cjmp_count) {
> ...
>       }
>       else
>           jmpkind = JmpCond;

After staring at this for hours and finally figuring out what
setup_bbcc is doing, I agree.


> It looks like the code is designed such that conditional jumps can never be
> calls or returns. That might be true on x86, but certainly isn't
> true on ARM.

I find the terminology a bit confusing.  It'd be easier to follow if
they were called side exits, following traditional trace-based system
terminology.  But anyway, not a big deal.


> At this point I'm getting out of my depth. AFAICT, callgrind only looks at
> jmpkind on the BB level, which means a conditional jump in the middle of a BB
> can't be marked as a call or return.

Well, presumably we'd have to augment struct _BB so that it records
the jmpkind of all the side exits, rather than simply assuming they
are Ijk_Boring.  But that's certainly doable.


I will study the code further.
Comment 10 Julian Seward 2011-11-08 21:37:04 UTC
Looking at setup_bbcc more ..

Josef, what is the reason for having JmpCond and JmpNone in addition
to the Ijk_* values?
Comment 11 Julian Seward 2011-11-08 22:15:24 UTC
> It looks like the code is designed such that conditional jumps can
> never be calls or returns. That might be true on x86, but certainly
> isn't true on ARM.

Indeed, on ARM the question of whether a transfer is conditional or
not is orthogonal to its call/return/boring classification: transfers
can be any of the 6 combinations drawn from 

  {cond, uncond} x {call, ret, boring}
Comment 12 Josef Weidendorfer 2011-11-10 19:14:21 UTC
> > The real problem here (or at least one of them) is this code in setup_bbcc:
> > 
> >       if (passed == last_bb->cjmp_count) {
> > ...
> >       }
> >       else
> >           jmpkind = JmpCond;
> 
> After staring at this for hours and finally figuring out what
> setup_bbcc is doing, I agree.

Sorry about making you so confused.

The global "passed" variable gets written by instrumented code before any
side exit and at the end of a BB. If "passed == last_bb->cjmp_count" holds
true, this means that we passed trough the previously executed BB without
leaving at a side exit, and the first guess for the type of control flow
change between the previous and the current BB is the jmpkind we got from
VEX for the previously executed BB.

If we left the previous BB via a side exit, I always assume that this
has to be a conditional jump. This is only true for x86, and has to be
fixed.

(In reply to comment #10)
> Looking at setup_bbcc more ..
> 
> Josef, what is the reason for having JmpCond and JmpNone in addition
> to the Ijk_* values?

I want to detect different kinds of control flow changes, and encode this
in just one value. It will be passed from one invocation of setup_bbcc
to the next in the same thread (ignoring signal handlers).
The Ijk_* values are not enough for this.

The thing is that Callgrind not only tries to build a call graph, but
also wants to collect statistics about (conditional) jumps (with
"--collect-jumps=yes"). Thus, if I get Ijk_Boring from VEX, I still do
not know whether this is a real jump or just passing through a BB to the next one (this is marked by using JmpNone instead of Ijk_Boring).
Similar for side exits, which I assume to be conditional jumps, using
JmpCond. Again, for x86, VEX marks all side exits as Ijk_Boring.

Hmm. Instead of misusing the IRJumpKind enum, we should use a new
enum here to allow for the 8 cases:

  {cond, uncond} x {call, ret, jump, boring}

With the meaning of "boring" here (in contrast to "jump") being
"passing trough to next BB without explicit guest control flow change".

Another question is what kind of data should be collected for conditional
calls. For conditional jumps (and using "--collect-jumps=yes"),
kcachgrind provides information such as "jumped x times on y executions".
It would be nice to have such information also for calls, but this needs
a change in the dump format and the GUI. I think it is enough to see the
number of calls without the information of how often the condition for
the conditional call did not hold true.

I'll try to come up with a patch for that.
Comment 13 Josef Weidendorfer 2011-11-11 11:21:11 UTC
Created attachment 65519 [details]
Proposed fix to allow Callgrind to handle conditional calls/returns

Attached a patch that makes Callgrind allows conditional calls/returns
for ARM. Does this work for you?
Comment 14 Sergio Martins 2012-07-30 15:05:29 UTC
I'm seeing callgraphs where some functions have bogus callers.

Could it be related to this bug ?
Comment 15 Josef Weidendorfer 2012-07-31 08:31:55 UTC
(In reply to comment #14)

> I'm seeing callgraphs where some functions have bogus callers.
> Could it be related to this bug ?

Probably. The fix discussed in comment 12 of this bug report
is going in the direction of supporting ARM in callgrind, but the
problem is tricky.