Newer glibc have alternate ld.so _ld_runtime_resolve functions. Namely _dl_runtime_resolve_xsave and _dl_runtime_resolve_xsave'2 Those exist to save the full AVX registers, for applications that somewhat use an 'extended' x86_64 ABI. The problem is that applications need to run with LD_BIND_NOW set for kcachegrind to have some more meaningful usage. I believe callgrind/fn.c should be extended to handle these new ld.so symbols. (BTW, it appears there is a minor typo in the comments of callgrind/fn.c, that tell about _ld_runtime_resolve, but check for _dl_runtime_resolve).
As an extra note, there are several different _dl_runtime_resolve{EXT} variants. I believe only one is used in an application, but there are multiple names. A possible fix is to use the logic in callgrind/fn.c so that when a function matching the initial _dl_runtime_resolve is called, use the existing code. A quick grep shows "EXT" as _xsave, _xsavec, _fxsave, _old, _new, _0, _1, _avx, _avx512, _sse, _sse_vex, _opt... On different versions of recent (2 or less years) glibc trees.
Created attachment 126591 [details] valgrind-3.14.0-rhbz1784541.patch This is a suggested patch or possible way to handle the issue. The user that had the issue now has it working on rhel7 using the attached patch and the command line option --dl-runtime-resolve-pattern=PATTERN where PATTERN is either: amd64-def amd64-xsavec amd64-xsave amd64-fxsave and the most common should be amd64-xsave, so, testing it before amd64-xsavec would make it work most times without the option. This is specific to rhel7 or centos7 and x86_64. The patch checks for /usr/lib/ld and /usr/lib64/ld patterns, otherwise it would not even check the patterns, then, checks patterns in the order above, but because it needs to use only one, the option was required.
What would be a good way to test the different variants? Do you have a simple example that shows the original issue? Can we add something to the testsuite?
Created attachment 128770 [details] sample_callgrind.tgz This a reproducer from a Red Hat user. The repro.sh script works with or without devtoolset installed. On a example run: $ ls *_out.out callgrind_cacheprofiling_7020_out.out callgrind_cacheprofiling_7021_out.out $ grep dl_runtime_resolve *.out callgrind_cacheprofiling_7020_out.out:cfn=(422) _dl_runtime_resolve_xsave callgrind_cacheprofiling_7020_out.out:fn=(423) _dl_runtime_resolve_xsave'2 only the sample output without LD_BIND_NOW set shows _dl_runtime_resolve*.
It should be tricky to figure what version is being used. Could look into the 'the elf_machine_runtime_setup' function in sysdeps/${ARCH}/dl-machine.h glibc source. Sample chunk from rhel7 x86_64 source: ... if (HAS_ARCH_FEATURE (AVX512F_Usable)) *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx512; else if (HAS_ARCH_FEATURE (AVX_Usable)) *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_avx; else *(ElfW(Addr) *) (got + 2) = (ElfW(Addr)) &_dl_runtime_profile_sse; ... because of check for cpu features, a reproducer environment to validate all entries, or experiment with automatic detection might need custom VMs.
Created attachment 130439 [details] Patch to search/check all possible dl_runtime_resolve variants I tweaked your patch a little so we detect all variants, all variants are then, when detected, squashed to the same name ("_dl_runtime_resolve"). This way we don't need the user to supply a variant name on the command line. Does this work for you?
Appears to work for me. I will confirm with the person that pointed out the problem if it works on their environment, but unlikely to not work.
(In reply to Mark Wielaard from comment #6) > Created attachment 130439 [details] > Patch to search/check all possible dl_runtime_resolve variants So this approach seems to work as intended and I would like to push it to git. Any comments/reviews before I do that?
commit 86277041a296d3ca34d938b347c246ff4485c3a0 Author: Mark Wielaard <mark@klomp.org> Date: Mon Jul 27 22:43:28 2020 +0200 Incorrect call-graph tracking due to new _dl_runtime_resolve_xsave* Newer glibc have alternate ld.so _ld_runtime_resolve functions. Namely _dl_runtime_resolve_xsave and _dl_runtime_resolve_xsavec This patch recognizes the xsave, xsvec and fxsave variants and changes callgrind so that any variant counts as _dl_runtime_resolve. Original patch by paulo.cesar.pereira.de.andrade@gmail.com https://bugs.kde.org/show_bug.cgi?id=415293
Original Callgrind author here... Sorry for not checking my old email address for open-source projects obviously for almost 1 1/2 years... just see this now, as I was checking out callgrind on recent Ubuntu and found it not working any more (partly due to this issue). First: thanks for this very elaborated patch! Two remarks: - there is no "_dl_runtime_resolve_xsave'2" - the ending "'2" is generated by callgrind notifying "recursion level 2" - ie it's the same function, just found for the 2nd time on the call stack. As by default the recursion level is max'ed at 2, it probably includes all higher levels - the complex "check_code" function really only should be required when the symbols are stripped from the runtime linker, as last resort to still be able to do the special handling... Trying to make this work on more architectures and recent glibc's, my suggestion is to add this patch for callgrind/fn.c: - if (VG_(strcmp)(fn->name, "_dl_runtime_resolve")==0) { + if (VG_(strncmp)(fn->name, "_dl_runtime_resolve", 19)==0) { While the heuristic that every function symbol starting with the prefix "_dl_runtime_resolve" as being an entry point into the runtime linker for resolving a function address may be a bit rough, I do not think this prefix is used often in other source code for anything else. The worst case is a slightly misleading call graph only visible in a very specific situation: if the wrongly-detected function does a tail call (ie instead of returning, jumping to another function), it will be shown as 2 calls in a row from the original caller. The whole issue to solve here is that the call chain A -(call)-> B -(tail call)-> C (with B being "_dl_runtime_resolve*") should be interpreted as A calling B, returning to A, calling C. Why? Because "_dl_runtime_resolve*" is some kind of dispatcher function, with a large number of calls routed through it. And this scenario is bad for profile analysis when only having a single caller as context. The replacement shows (1) better information, (2) reduces the probability of a recursive function cycle - which obscures profiling results, and (3) is less confusing to users not aware of the runtime linker. Note: the patch applied as fix can stay, as it allows for a stripped runtime linker.