Bug 460951

Summary: infinite loop in ARM-64 version of instrumentation with ouptput VG_ calls at superblock and instruction level
Product: [Developer tools] valgrind Reporter: newhall
Component: vexAssignee: Julian Seward <jseward>
Status: REPORTED ---    
Severity: normal CC: mark, pjfloyd
Priority: NOR    
Version First Reported In: 3.19.0   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description newhall 2022-10-24 16:31:10 UTC
SUMMARY
***
NOTE: If you are reporting a crash, please try to attach a backtrace with debug symbols.
See https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports
***
I found a bug in the ARM64 version of valgrind (in both versions 3.16.1 and 3.19.0) that causes an infinite loop in some instrumentation code. The lackey tool is one example that produces this bug. The bug is reproducible in the lackey tool on ARM64 by running: valgrind --tool=lackey --trace-superblocks=yes ./a.out  
I can reproduce on every example C program I've tried, even the most simple (for example: int main(int argc, char *argv[]) { int x;  x = 6; return 0; } triggers it)).

The bug is getting stuck on repeating the same two superblocks over and over again in an infinite loop.  I suspect it is a bug with getting the correct return address when instrumenting at the  granularity of superblocks (and of individual instructions), or it is more specifically not getting the right return address when there are calls to certain functions in the instrumentation (specifically to VG_printf,  to other VG_ output functions in certain cases (described more below in the ADDITIONAL INFORMATION section), to VG_message_flush, and possibly others).  This is not a bug in the x86 versions of  the lackey tool's superblock tracing.

STEPS TO REPRODUCE
1.  compile with debugging: gcc -g prog.c       # (and other gcc command line options tried listed in ADDITIONAL INFORMATION)
2. run lackey with trace-superblocks option:  valgrind --tool=lackey --trace-superblocks=yes ./a.out  

OBSERVED RESULT

infinite loop  of same two superblocks (always SB 04954ecc and SB 04954ed8 on my system) repeated in VEX instrumented code on ARM-64

EXPECTED RESULT

instrumentation would not get into infinite loop and program would complete tracing through all its superblocks until completion  (the a.out does not itself have an infinite loop)

SOFTWARE/OS VERSIONS

Linux: Linux 5.15.69-rockchip64 #22.08.2 SMP PREEMPT Wed Sep 21 19:28:26 UTC 2022 aarch64 GNU/Linux
gcc: gcc (Debian 10.2.1-6) 
processor: ARM  v8.4
valgrind:  3.19.0 built from source (also occurs on debian installed version 3.16.1) 

ADDITIONAL INFORMATION

I've done some experimentation with lackey code, and this is what I've discovered about what more specifically seems to trigger the bug:
* Calling VG_printf in the instrumentation function always causes this problem.  
* Calls to VG_emit, VG_message, VG_umsg work if the format string does not contain a '\n' character, but if the format string does contain `\n`, then the instrumentation gets into this infinite loop bug. 
* Explicitly calling VG_message_flush triggers the infinite loop of instrumentation code.  I can trigger the bug when only including one function call at each instrumentation point.  So it is not a bug with adding more than one call to an instrumentation function at a single instrumentation point  (e.g. it is not calling both add_one_SB_entered and trace_superblock  that is causing the bug in lackey, but with just a call to one of these and an added call to VG_printf in the instrumentation function triggers the bug).  
* It is also not a problem with passing an Addr parameter to an instrumentation function (as in trace_superblock in lackey), so is also not likely parameter passing in general that seems problematic.
* I've also discovered that calls to VG_lseek in instrumentation code fail on ARM (it works fine on x86).  This may be related or a different bug.

I'm trying to write a valgrind tool that instruments at the instruction-level.  My valgrind tool works fine for x86, but has this infinite loop issue on ARM-64 in a similar way to lackey's.  

I've also tried compiling with these different gcc flags, and all trigger the bug:
* gcc -g
* gcc -ggdb
* gcc -O0 -ggdb -fno-omit-frame-pointer
* gcc -Wall -ggdb -O0 -fno-asynchronous-unwind-tables -fno-dwarf2-cfi-asm -fno-pic -no-pie -fno-omit-frame-pointer

I don't know an easy way to debug valgrind instrumented code at runtime, so I have not looked into this further, but I'd really like to use this functionality in a valgrind tool I'm building (again, it works fine on x86, but has this bug on ARM).  Perhaps the problem is with some call optimization with code (perhaps specific to system call code (like write calling a function to flush that could be tail call optimized?)) and valgrind ARM instrumentation code not finding the right return address value  and getting into an infinite loop. 

I'm hoping someone can fix this bug (my guess is it is somewhere in the VEX code for ARM, and something about return addresses in VG_ functions that make system calls, but this is a guess).

Thank you for your help!

system/SW version details:

$cat /proc/cpuinfo...processor	: 5
BogoMIPS	: 48.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd08
CPU revision	: 2

$ inst/bin/valgrind --version     # version I built from source
valgrind-3.19.0
$ valgrind --version    # system installed version as part of debian install
valgrind-3.16.1

$  uname -a
Linux 5.15.69-rockchip64 #22.08.2 SMP PREEMPT Wed Sep 21 19:28:26 UTC 2022 aarch64 GNU/Linux

$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Comment 1 newhall 2022-10-27 17:22:24 UTC
the bug is also in valgrind version 3.20.0
Comment 2 newhall 2023-06-23 14:59:15 UTC
still a bug in 3.21.0

run lackey on an executable with no infinite loop:
valgrind --tool=lackey --trace-superblocks=yes ./a.out

valgrind stuck in infinite loop of  same 2 superblocks:

SB 04954ecc
SB 04954ed8
...
forever
Comment 3 Mark Wielaard 2023-08-22 12:17:10 UTC
Have you tried printing out the problematic blocks as described in README_DEVELOPERS:

Printing out problematic blocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to print out a disassembly of a particular block that
causes a crash, do the following.

Try running with "--vex-guest-chase=no --trace-flags=10000000
--trace-notbelow=999999".  This should print one line for each block
translated, and that includes the address.

Then re-run with 999999 changed to the highest bb number shown.
This will print the one line per block, and also will print a
disassembly of the block in which the fault occurred.

See also valgrind --help-debug for a description of --trace-flags

    Tracing and profile control:
      --trace-flags and --profile-flags values (omit the middle space):
         1000 0000   show conversion into IR
         0100 0000   show after initial opt
         0010 0000   show after instrumentation
         0001 0000   show after second opt
         0000 1000   show after tree building
         0000 0100   show selecting insns
         0000 0010   show after reg-alloc
         0000 0001   show final assembly
         0000 0000   show summary profile only
        (Nb: you need --trace-notbelow and/or --trace-notabove
             with --trace-flags for full details)
Comment 4 newhall 2024-06-07 18:24:51 UTC
This is what we get when run with those flags (it is in libc code called from _start):

valgrind -v --vex-guest-chase=no --trace-flags=10000000 --trace-notbelow=999999 --tool=lackey  --trace-superblocks=yes ./a.out
...
a lot of ouput
...
SB 0400dffc
SB 0400e008
SB 04014080
==== SB 1427 (evchecks 6793) [tid 1] 0x4866d30 (below main) /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d30
SB 04866d30
==== SB 1428 (evchecks 6794) [tid 1] 0x4866d78 (below main)+72 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d78
SB 04866d78
==== SB 1429 (evchecks 6795) [tid 1] 0x4866d84 (below main)+84 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x20d84
SB 04866d84
==== SB 1430 (evchecks 6796) [tid 1] 0x487cc40 __cxa_atexit /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36c40
SB 0487cc40
==== SB 1431 (evchecks 6797) [tid 1] 0x487cb30 __internal_atexit /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36b30
SB 0487cb30
==== SB 1432 (evchecks 6798) [tid 1] 0x487cb48 __internal_atexit+24 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x36b48
SB 0487cb48
==== SB 1433 (evchecks 6799) [tid 1] 0x4954eb0 __aarch64_cas4_acq /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eeb0
SB 04954eb0
==== SB 1434 (evchecks 6800) [tid 1] 0x4954ec8 __aarch64_cas4_acq+24 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eec8
SB 04954ec8
==== SB 1435 (evchecks 6801) [tid 1] 0x4954ed8 __aarch64_cas4_acq+40 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eed8
SB 04954ed8
==== SB 1436 (evchecks 6802) [tid 1] 0x4954ecc __aarch64_cas4_acq+28 /usr/lib/aarch64-linux-gnu/libc-2.31.so+0x10eecc
SB 04954ecc
SB 04954ed8
SB 04954ecc
SB 04954ed8
SB 04954ecc
SB 04954ed8
SB 04954ecc
SB 04954ed8
SB 04954ecc
...
continues on forever