Created attachment 51889 [details] First stab at identifying ARM procedure return instructions. Version: 3.6 SVN OS: Linux It seems (at least judging by the display in kcachegrind) to be identifying instructions such as LDMFD r13!,{r4,r5,r10,pc} as calls instead of returns, and instructions such as MOV pc,r14 as ordinary jumps. This generates huge numbers of cycles which make it very difficult to actually isolate regions of code for performance, especially as the latter means it does not identify function boundaries correctly, so even their Self measurements are wrong. The Cortex A8 TRM Section 5.2.1 and the ARMv7-a ARM Section C9.10.1 define the following instructions as calls: BL immediate BLX HBL (ThumbEE state) HBLP (ThumbEE state) and the following instructions as returns: BX r14 MOV pc, r14 LDM r13!, {... PC} (in ARM or Thumb state) LDM r9!, {... PC} (in ThumbEE state) LDR pc, [r13], #imm (in ARM or Thumb state) LDR pc, [r9], #imm (in ThumbEE state) I took a stab at adding what I thought was the necessary code in the attached patch, but while this definitely changes the result, it does not fix the problem. In fact, after applying this patch, callgrind starts to interpret MOV PC,r14 as a call rather than a return in places (and other places does not). This is consistent with the behavior of LDM above, which looked like it was already being properly marked as a return in guest_arm_toIR.c, so either I've badly misunderstood things, or something else must be going on here. Reproducible: Didn't try
I'll take it; it's a front end problem, not a Callgrind problem. Josef, it would be cool to get this fixed, though, if only because perhaps that would show a way to make CG work better on PowerPC.
Okay, I think I've tracked down a good bit of the problem. First, a simple change, in addition to the attached patch: @@ -12885,7 +12911,7 @@ stmt( IRStmt_Exit( unop(Iop_32to1, mkexpr(condT)), jk, IRConst_U32(dst) )); irsb->next = mkU32(guest_R15_curr_instr_notENC + 4); - irsb->jumpkind = jk; + irsb->jumpkind = Ijk_Boring; dres.whatNext = Dis_StopHere; } DIP("b%s%s 0x%x %s\n", link ? "l" : "", nCC(INSN_COND), This stops non-taken conditional jumps from sometimes incorrectly being treated as calls. However, this creates more problems. Consider the following trace: Call 04811008 -> 0481cfb0, SP bdf10b18 This is the entry point to oc_loop_filter_frag_rows_neon(). None 0481cfc0 -> 0481cfc4, SP bdf10af0 None 0481cfc8 -> 0481cfcc, SP bdf10af0 None 0481cff4 -> 0481cff8, SP bdf10af0 None 0481d000 -> 0481d004, SP bdf10af0 These jumps are not taken, which is fine. Conditional 0481d008 -> 0481cf3c, SP bdf10af0 This is supposed to be a call to loop_filter_v_neon(). It's properly marked in the VEX output (with my current patch and the above change): (arm) 0x481D008: bl{gt} 0x481CF3C ------ IMark(0x481D008, 4) ------ PUT(60) = 0x481D008:I32 t4 = armg_calculate_condition[mcx=0x9]{0x3815127c}(Or32(GET:I32(64),0xC0:I32),GET:I32(68),GET:I32(72),GET:I32(76)):I32 PUT(56) = Mux0X(32to8(t4),GET:I32(56),0x481D00C:I32) if (32to1(t4)) goto {Call} 0x481CF3C:I32 goto {Boring} 0x481D00C:I32 Here's what happens on the return: sp (0xBDF10AF0) < top_ce->sp (0xBDF10B18) sp here is the updated stack pointer from inside oc_loop_filter_frag_rows_neon(), as loop_filter_v_neon() does not use the stack at all. top_ce->sp is still from outside oc_loop_filter_frag_rows_neon(), because the call to loop_filter_v_neon() wasn't detected properly. Forcing jump to Ijk_Boring. JMP: 0x00012f3c[/home/user/theorarm/libtheoradec.so.1] to 0x0001300c[/home/user/theorarm/libtheoradec.so.1] (RET w/o CALL)! Pop on Jump! Then the jump gets made boring because it failed the sp check. Converting jump to Ijk_Call. Call 0481cfac -> 0481d00c, SP bdf10b18 And after that, converted into a call instead of a return, which is what creates the cycle. It doesn't take much to envelop half the program in such cycles. There's an #if defined(VGA_ppc32) here that should probably be extended to include ARM to make this conversion more conservative, but that's not the real problem. The real problem here (or at least one of them) is this code in setup_bbcc: if (passed == last_bb->cjmp_count) { ... } else jmpkind = JmpCond; It looks like the code is designed such that conditional jumps can never be calls or returns. That might be true on x86, but certainly isn't true on ARM. At this point I'm getting out of my depth. AFAICT, callgrind only looks at jmpkind on the BB level, which means a conditional jump in the middle of a BB can't be marked as a call or return. I'm not sure how a conditional call is expected to be marked at all, as it has two jmpkinds: one if taken (Ijk_Call), and one if not (Ijk_Boring), and there is special code to convert a not-taken (Ijk_Boring) jump into a JmpNone for the last jump in the BB, which means it can't also have the other tag.
Thanks for the analysis. This wrong assertion should be fixed. And as you found out, Callgrind checks any return for a matching call before. If that is not found, it assumes a jump, which is then converted to a call. I remember that this was necessary for the calls to the runtime linker to show up correctly (at least on x86). Of course, the best would be to have regressions tests on detected call hierarchy of a simple example code. To be independent from the compiler output/optimization, it best is written in assembler, but should be taken from typical compiler output. I never came around doing that stuff. With "--ct-verbose=1", it prints out the detected call sequence, which can be used for the regression test. I big problem is how to check correct behavior with regard to the runtime linker. How would you do a regression test here?
Josef, do you have any suggestions how I could move forwards with this? I can put some time into trying to improve the situation, but am not sure how to proceed.
I am not sure this is completely solvable purely from VEX. There are heuristics in callgrind based on properties of the client platform. Two I can now think of are (1) the assumption that SP is changing with CALL/RET instructions. E.g., based on the SP content, unwinding stops. (2) there are assumptions on how runtime linking for the given platform is done. The best is to refactor the heuristics into platform specific code. The first step is to come up with tests together with expected outcome. I have access to a beagleboard, and try to come up with such tests first, to understand the problematic cases.
> The best is to refactor the heuristics into platform specific code. More generally, is it possible to modularise the fn entry/exit detection code so as to be more standalone? Then perhaps it could monitor execution and feed a sequence of function entry/exit (or, shadow-stack update events) to the rest of the tool.
(related to previous comment) That kind of modularisation might make it possible to try out different entry/exit detectors, and/or have completely different implementations for different architectures.
VEX-side changes committed as r2216. Thanks.
> The real problem here (or at least one of them) is this code in setup_bbcc: > > if (passed == last_bb->cjmp_count) { > ... > } > else > jmpkind = JmpCond; After staring at this for hours and finally figuring out what setup_bbcc is doing, I agree. > It looks like the code is designed such that conditional jumps can never be > calls or returns. That might be true on x86, but certainly isn't > true on ARM. I find the terminology a bit confusing. It'd be easier to follow if they were called side exits, following traditional trace-based system terminology. But anyway, not a big deal. > At this point I'm getting out of my depth. AFAICT, callgrind only looks at > jmpkind on the BB level, which means a conditional jump in the middle of a BB > can't be marked as a call or return. Well, presumably we'd have to augment struct _BB so that it records the jmpkind of all the side exits, rather than simply assuming they are Ijk_Boring. But that's certainly doable. I will study the code further.
Looking at setup_bbcc more .. Josef, what is the reason for having JmpCond and JmpNone in addition to the Ijk_* values?
> It looks like the code is designed such that conditional jumps can > never be calls or returns. That might be true on x86, but certainly > isn't true on ARM. Indeed, on ARM the question of whether a transfer is conditional or not is orthogonal to its call/return/boring classification: transfers can be any of the 6 combinations drawn from {cond, uncond} x {call, ret, boring}
> > The real problem here (or at least one of them) is this code in setup_bbcc: > > > > if (passed == last_bb->cjmp_count) { > > ... > > } > > else > > jmpkind = JmpCond; > > After staring at this for hours and finally figuring out what > setup_bbcc is doing, I agree. Sorry about making you so confused. The global "passed" variable gets written by instrumented code before any side exit and at the end of a BB. If "passed == last_bb->cjmp_count" holds true, this means that we passed trough the previously executed BB without leaving at a side exit, and the first guess for the type of control flow change between the previous and the current BB is the jmpkind we got from VEX for the previously executed BB. If we left the previous BB via a side exit, I always assume that this has to be a conditional jump. This is only true for x86, and has to be fixed. (In reply to comment #10) > Looking at setup_bbcc more .. > > Josef, what is the reason for having JmpCond and JmpNone in addition > to the Ijk_* values? I want to detect different kinds of control flow changes, and encode this in just one value. It will be passed from one invocation of setup_bbcc to the next in the same thread (ignoring signal handlers). The Ijk_* values are not enough for this. The thing is that Callgrind not only tries to build a call graph, but also wants to collect statistics about (conditional) jumps (with "--collect-jumps=yes"). Thus, if I get Ijk_Boring from VEX, I still do not know whether this is a real jump or just passing through a BB to the next one (this is marked by using JmpNone instead of Ijk_Boring). Similar for side exits, which I assume to be conditional jumps, using JmpCond. Again, for x86, VEX marks all side exits as Ijk_Boring. Hmm. Instead of misusing the IRJumpKind enum, we should use a new enum here to allow for the 8 cases: {cond, uncond} x {call, ret, jump, boring} With the meaning of "boring" here (in contrast to "jump") being "passing trough to next BB without explicit guest control flow change". Another question is what kind of data should be collected for conditional calls. For conditional jumps (and using "--collect-jumps=yes"), kcachgrind provides information such as "jumped x times on y executions". It would be nice to have such information also for calls, but this needs a change in the dump format and the GUI. I think it is enough to see the number of calls without the information of how often the condition for the conditional call did not hold true. I'll try to come up with a patch for that.
Created attachment 65519 [details] Proposed fix to allow Callgrind to handle conditional calls/returns Attached a patch that makes Callgrind allows conditional calls/returns for ARM. Does this work for you?
I'm seeing callgraphs where some functions have bogus callers. Could it be related to this bug ?
(In reply to comment #14) > I'm seeing callgraphs where some functions have bogus callers. > Could it be related to this bug ? Probably. The fix discussed in comment 12 of this bug report is going in the direction of supporting ARM in callgrind, but the problem is tricky.