SUMMARY *** NOTE: If you are reporting a crash, please try to attach a backtrace with debug symbols. See https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports *** I found a bug in the ARM64 version of valgrind (in both versions 3.16.1 and 3.19.0) that causes an infinite loop in some instrumentation code. The lackey tool is one example that produces this bug. The bug is reproducible in the lackey tool on ARM64 by running: valgrind --tool=lackey --trace-superblocks=yes ./a.out I can reproduce on every example C program I've tried, even the most simple (for example: int main(int argc, char *argv[]) { int x; x = 6; return 0; } triggers it)). The bug is getting stuck on repeating the same two superblocks over and over again in an infinite loop. I suspect it is a bug with getting the correct return address when instrumenting at the granularity of superblocks (and of individual instructions), or it is more specifically not getting the right return address when there are calls to certain functions in the instrumentation (specifically to VG_printf, to other VG_ output functions in certain cases (described more below in the ADDITIONAL INFORMATION section), to VG_message_flush, and possibly others). This is not a bug in the x86 versions of the lackey tool's superblock tracing. STEPS TO REPRODUCE 1. compile with debugging: gcc -g prog.c # (and other gcc command line options tried listed in ADDITIONAL INFORMATION) 2. run lackey with trace-superblocks option: valgrind --tool=lackey --trace-superblocks=yes ./a.out OBSERVED RESULT infinite loop of same two superblocks (always SB 04954ecc and SB 04954ed8 on my system) repeated in VEX instrumented code on ARM-64 EXPECTED RESULT instrumentation would not get into infinite loop and program would complete tracing through all its superblocks until completion (the a.out does not itself have an infinite loop) SOFTWARE/OS VERSIONS Linux: Linux 5.15.69-rockchip64 #22.08.2 SMP PREEMPT Wed Sep 21 19:28:26 UTC 2022 aarch64 GNU/Linux gcc: gcc (Debian 10.2.1-6) processor: ARM v8.4 valgrind: 3.19.0 built from source (also occurs on debian installed version 3.16.1) ADDITIONAL INFORMATION I've done some experimentation with lackey code, and this is what I've discovered about what more specifically seems to trigger the bug: * Calling VG_printf in the instrumentation function always causes this problem. * Calls to VG_emit, VG_message, VG_umsg work if the format string does not contain a '\n' character, but if the format string does contain `\n`, then the instrumentation gets into this infinite loop bug. * Explicitly calling VG_message_flush triggers the infinite loop of instrumentation code. I can trigger the bug when only including one function call at each instrumentation point. So it is not a bug with adding more than one call to an instrumentation function at a single instrumentation point (e.g. it is not calling both add_one_SB_entered and trace_superblock that is causing the bug in lackey, but with just a call to one of these and an added call to VG_printf in the instrumentation function triggers the bug). * It is also not a problem with passing an Addr parameter to an instrumentation function (as in trace_superblock in lackey), so is also not likely parameter passing in general that seems problematic. * I've also discovered that calls to VG_lseek in instrumentation code fail on ARM (it works fine on x86). This may be related or a different bug. I'm trying to write a valgrind tool that instruments at the instruction-level. My valgrind tool works fine for x86, but has this infinite loop issue on ARM-64 in a similar way to lackey's. I've also tried compiling with these different gcc flags, and all trigger the bug: * gcc -g * gcc -ggdb * gcc -O0 -ggdb -fno-omit-frame-pointer * gcc -Wall -ggdb -O0 -fno-asynchronous-unwind-tables -fno-dwarf2-cfi-asm -fno-pic -no-pie -fno-omit-frame-pointer I don't know an easy way to debug valgrind instrumented code at runtime, so I have not looked into this further, but I'd really like to use this functionality in a valgrind tool I'm building (again, it works fine on x86, but has this bug on ARM). Perhaps the problem is with some call optimization with code (perhaps specific to system call code (like write calling a function to flush that could be tail call optimized?)) and valgrind ARM instrumentation code not finding the right return address value and getting into an infinite loop. I'm hoping someone can fix this bug (my guess is it is somewhere in the VEX code for ARM, and something about return addresses in VG_ functions that make system calls, but this is a guess). Thank you for your help! system/SW version details: $cat /proc/cpuinfo...processor : 5 BogoMIPS : 48.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd08 CPU revision : 2 $ inst/bin/valgrind --version # version I built from source valgrind-3.19.0 $ valgrind --version # system installed version as part of debian install valgrind-3.16.1 $ uname -a Linux 5.15.69-rockchip64 #22.08.2 SMP PREEMPT Wed Sep 21 19:28:26 UTC 2022 aarch64 GNU/Linux $ gcc --version gcc (Debian 10.2.1-6) 10.2.1 20210110 Copyright (C) 2020 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
the bug is also in valgrind version 3.20.0
still a bug in 3.21.0 run lackey on an executable with no infinite loop: valgrind --tool=lackey --trace-superblocks=yes ./a.out valgrind stuck in infinite loop of same 2 superblocks: SB 04954ecc SB 04954ed8 ... forever
Have you tried printing out the problematic blocks as described in README_DEVELOPERS: Printing out problematic blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you want to print out a disassembly of a particular block that causes a crash, do the following. Try running with "--vex-guest-chase=no --trace-flags=10000000 --trace-notbelow=999999". This should print one line for each block translated, and that includes the address. Then re-run with 999999 changed to the highest bb number shown. This will print the one line per block, and also will print a disassembly of the block in which the fault occurred. See also valgrind --help-debug for a description of --trace-flags Tracing and profile control: --trace-flags and --profile-flags values (omit the middle space): 1000 0000 show conversion into IR 0100 0000 show after initial opt 0010 0000 show after instrumentation 0001 0000 show after second opt 0000 1000 show after tree building 0000 0100 show selecting insns 0000 0010 show after reg-alloc 0000 0001 show final assembly 0000 0000 show summary profile only (Nb: you need --trace-notbelow and/or --trace-notabove with --trace-flags for full details)
This is what we get when run with those flags (it is in libc code called from _start): valgrind -v --vex-guest-chase=no --trace-flags=10000000 --trace-notbelow=999999 --tool=lackey --trace-superblocks=yes ./a.out ... a lot of ouput ... SB 0400dffc SB 0400e008 SB 04014080 ==== SB 1427 (evchecks 6793) [tid 1] 0x4866d30 (below main) /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d30 SB 04866d30 ==== SB 1428 (evchecks 6794) [tid 1] 0x4866d78 (below main)+72 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d78 SB 04866d78 ==== SB 1429 (evchecks 6795) [tid 1] 0x4866d84 (below main)+84 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d84 SB 04866d84 ==== SB 1430 (evchecks 6796) [tid 1] 0x487cc40 __cxa_atexit /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36c40 SB 0487cc40 ==== SB 1431 (evchecks 6797) [tid 1] 0x487cb30 __internal_atexit /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36b30 SB 0487cb30 ==== SB 1432 (evchecks 6798) [tid 1] 0x487cb48 __internal_atexit+24 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36b48 SB 0487cb48 ==== SB 1433 (evchecks 6799) [tid 1] 0x4954eb0 __aarch64_cas4_acq /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eeb0 SB 04954eb0 ==== SB 1434 (evchecks 6800) [tid 1] 0x4954ec8 __aarch64_cas4_acq+24 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eec8 SB 04954ec8 ==== SB 1435 (evchecks 6801) [tid 1] 0x4954ed8 __aarch64_cas4_acq+40 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eed8 SB 04954ed8 ==== SB 1436 (evchecks 6802) [tid 1] 0x4954ecc __aarch64_cas4_acq+28 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eecc SB 04954ecc SB 04954ed8 SB 04954ecc SB 04954ed8 SB 04954ecc SB 04954ed8 SB 04954ecc SB 04954ed8 SB 04954ecc ... continues on forever