Created attachment 170032 [details] assembly-language test program for brk() system call SUMMARY On Linux, when the brk() system call adds new pages to the address space, then those pages are guaranteed to be all-zero, but memcheck thinks the bytes are Undefined. STEPS TO REPRODUCE 1. Build the attached short assembly-language program brk-test.S for x86_64. The program finds the next page boundary at or above "the .brk", then uses system call brk() to append 345 bytes to "the .brk". Linux must add one entire page, and guarantees that the page is all-zero. 2. Run under valgrind(memcheck) and note the complaint about Uninit value. 3. Comment-out the "jmp no_clear" instruction, re-build and re-run. 4. Note that the complaint disapears. 5. Use a debugger to single-step the two programs, examining memory to convince yourself that the code to zero the added address space on a new page is effectively a no-op. OBSERVED RESULT Step 2: $ valgrind --track-origins=yes ./brk-test ==5737== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info ==5737== Command: ./brk-test ==5737== ==5737== Syscall param exit(status) contains uninitialised byte(s) ==5737== at 0x401051: ??? (in /home/jreiser/brk-test) ==5737== Uninitialised value was created ==5737== at 0x401039: ??? (brk-test.S:38) ## the third brk() syscall Step 4: Note no complaint from memcheck $ valgrind --track-origins=yes ./brk-test ==5768== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info ==5768== Command: ./brk-test ==5768== ==5768== ==5768== HEAP SUMMARY: EXPECTED RESULT No complaint from Step 2 (and thus no complaint in either case.) SOFTWARE/OS VERSIONS Windows: macOS: Linux/KDE Plasma: $ uname -a Linux fedora 6.8.10-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 17 21:20:54 UTC 2024 x86_64 GNU/Linux $ rpm -q valgrind valgrind-3.23.0-1.fc40.x86_64 (available in About System) KDE Plasma Version: KDE Frameworks Version: Qt Version: ADDITIONAL INFORMATION The Go-lang runtime _dl_early_allocate uses initialization code like this when -static binding. Thus memcheck can generate many many false-positive complaints from many different tracebacks.
brk() is fairly old, removed from posix accordingly to the Linux manpage. Increasingly I see it getting removed from platforms. Nevertheless, I just downloaded a copy of SUSv1, and indeed it does say The brk( ) and sbrk( ) functions are used to change the amount of space allocated for the calling process. The change is made by resetting the process’ break value and allocating the appropriate amount of space. The amount of allocated space increases as the break value increases. The newly-allocated space is set to 0. However, if the application first decrements and then increments the break value, the contents of the reallocated space are unspecified.
(In reply to Paul Floyd from comment #1) > brk() is fairly old, removed from posix accordingly to the Linux manpage. > Increasingly I see it getting removed from platforms. In this case "old" means good, well-documented, well-debugged, reliable. Please name explicitly some platforms which are removing brk(), and their typical usage environments. The brk() system call has been present in most *nix-like systems since 1970, which is more than 50 years. brk() is present in Linux today, and Linux has a strong policy and implementation record for backwards compatibility of system calls; brk() will not disappear from Linux anytime soon. brk() is particularly useful during process startup when a call to malloc() might not yet be supported;. A programming language runtime can use brk() via an inline system call. See _dl_early_allocate() in glibc, for example. brk() and sbrk() enable an easy implementation of mark() and release() for Pascal run-time support. brk() is quite handy for use by other in-process emulators, which must deal with arbitrary mmap() anyway. For instance, qemu relies on brk() and sbrk(). brk() is multi-thread safe and async-signal safe without requiring any user-mode locks. It can be used safely in a signal handler. Thread managers may rely on brk() regardless of the implementation of malloc().
(In reply to John Reiser from comment #2) > (In reply to Paul Floyd from comment #1) > > brk() is fairly old, removed from posix accordingly to the Linux manpage. > > Increasingly I see it getting removed from platforms. > > In this case "old" means good, well-documented, well-debugged, reliable. > Please name explicitly some platforms which are removing brk(), and their > typical usage environments. The fist thing in the FreeBSD man page, in BOLD letters is DESCRIPTION The brk() and sbrk() functions are legacy interfaces from before the advent of modern virtual memory management. They are deprecated and not present on the arm64 or riscv architectures. The mmap(2) interface should be used to allocate pages instead. FreeBSD has (or had) two system calls, brk and sbrk. sbrk was recently removed from FreeBSD 15 (probably because it has just returned EOPNOTSUPP for as long as I can remember). > brk() is multi-thread safe and async-signal safe without requiring The FreeBSD man page says brk() and sbrk() are not thread-safe. Some other juicy bits from SUSv1 "The behaviour of brk( ) and sbrk( ) is unspecified if an application also uses any other memory functions (such as malloc ( ), mmap( ), free( )). Other functions may use these other memory functions silently." and "It is unspecified whether the pointer returned by sbrk( ) is aligned suitably for any purpose." So, it's not portable, not thread-safe, not safe to use with malloc and provides no alignment guarantees. "removed in POSIX.1-2001" twenty three years ago isn't what I'd call well documented.
(In reply to John Reiser from comment #0) > On Linux, when the brk() system call adds new pages to the address space, > then those pages are guaranteed to be all-zero, but memcheck thinks the > bytes are Undefined. I think this is the basic problem. It says if it adds "new pages" then those pages are guaranteed to be all-zero. But if the pages already existed and brk() just gives them back those aren't guaranteed to be all-zero but can contain whatever was there before. Also note that the man7 manpage for brk() nor the glibc manual make any claims about the memory area being initialized: https://man7.org/linux/man-pages/man2/brk.2.html https://sourceware.org/glibc/manual/latest/html_node/Resizing-the-Data-Segment.html So it seems reasonable for memcheck to assume the area exposed by brk() is undefined.
(In reply to Mark Wielaard from comment #4) > So it seems reasonable for memcheck to assume the area exposed by brk() is > undefined. That assumption is incorrect for new pages on Linux, and there are applications that rely on the actual behavior, which is that on new pages all bytes are zero. For instance: qemu relies on this to hold data that tracks threads. > But if the pages already existed and brk() just gives them back those > aren't guaranteed to be all-zero but can contain whatever was there before. On Linux this is only partially true. If __cur_brk (the top of the sys_brk() region) varies within the same page, then the bytes retain their values. If __cur_brk decreases enough so that one or more whole pages are not contained in the allocated address space, then those pages are removed from the address space, and a subsequent sys_brk() which increases __cur_brk to include such addresses does assign new pages, and they are guaranteed to be zero. If an increasing sys_brk() would overlap with a different VMA (Linux-internal Virtual Memory Area) that already is assigned for some other purpose, then that sys_brk() fails and returns an error value. So for any successful sys_brk(), then new pages are guaranteed to be zero. Memcheck certainly knows which pages are new, and should mark then as Defined.
BTW it seems this whole discussion is duplicated in the comments of memcheck/mc_main.c mc_post_clo_init () // We assume that brk()/sbrk() does not initialise new memory. ... With as conclusion: // So we should arguably observe all this. However: // - The current inaccuracy has caused maybe one complaint in seven years(?) // - Relying on the zeroed-ness of whole brk'd pages is pretty grotty... I // doubt most programmers know the above information. // So I'm not terribly unhappy with marking it as undefined. --njn. Which was written in 2009. So maybe times have changed since then. But I don't really remember any false positive report about this.
As to implementing this, my idea is to add a new variable VG_(brk_high_tide) that starts off the same as VG_(brk_limit), If ever VG_(brk_limit) is increased beyond VG_(brk_high_tide) then a new VG_(track_new_init_mem_brk) gets used (and the high tide updated). If VG_(brk_limit) gets increased but doesn't exceed VG_(brk_high_tide) (because VG_(brk_limit) got decreased in the meantime) then the existing VG_(track_new_mem_brk) gets used. VG_(track_new_init_mem_brk) will use make_mem_defined_w_tid. So instead of if (brk_new > brk_limit) { /* successfully grew the data segment */ VG_TRACK( new_mem_brk, brk_limit, ARG1-brk_limit, tid ); } it would be if (brk_new > brk_limit) { /* successfully grew the data segment */ if (brk_new > brk_high_tide) { VG_TRACK( new_init_mem_brk, brk_limit, ARG1-brk_limit, tid ); } else { VG_TRACK( new_mem_brk, brk_limit, ARG1-brk_limit, tid ); } } This doesn't model what is happening to physical memory (initialized pages being mapped) but I do think that it more closely represents what the guest should be seeing.
(In reply to Mark Wielaard from comment #6) > BTW it seems this whole discussion is duplicated in the comments of > memcheck/mc_main.c mc_post_clo_init () > // - The current inaccuracy has caused maybe one complaint in seven > years(?) > // - Relying on the zeroed-ness of whole brk'd pages is pretty grotty... I > // doubt most programmers know the above information. > // So I'm not terribly unhappy with marking it as undefined. --njn. > > Which was written in 2009. So maybe times have changed since then. But I > don't really remember any false positive report about this. What has changed is very large apps such as qemu-user-* which depend on dozens of "random" libraries, and the system builders/maintainers/administrators who tired of their own version of "dll hell" due to dependency tracking, and the users (including "systematic" users with scripts) who got tired of multi-second initialization of all those shared libraries; all gained new appreciation for -static builds of ET_EXEC (no shared libraries at all!) which initialize in a flash and have robust configuration (none!) for end users. Disk space is cheap for many such users; time is not. qemu-user has grown into a predictable evolution which avoids nearly all "security" incidents which otherwise might argue for using shared libraries. But then "corner case" users of qemu-user-* (etc), the users who just might find a bug or two, discover that memcheck is much less useful. Gdb and debuginfo have no trouble identifying malloc/free/sbrk+sys_brk etc., but memcheck does not recognize them. Re-building the app from source (and avoiding -static+ET_EXEC) takes hours of time and almost 10 GB of disk space, uses 4 different build systems (make, cmake, meson, ninja) and two runtime support systems (Go-lang and C-lang). Being highly motivated for the particular case of qemu-mips*-static, I have developed binary patches that zero the new region of {_dl_early_allocate, malloc} + {expanding sbrk/sys_brk}. This is not as good as proper memcheck accounting, but it suits my purpose of finding uninit local variables in what would otherwise be a blizzard of Uninit complaints caused by memcheck not recognizing when to mark bytes from sys_brk as Defined. Without the patches, memcheck finds ==375110== ERROR SUMMARY: 9646 errors from 125 contexts (suppressed: 0 from 0) before the process aborts with ==375110== Warning: ignored attempt to set SIGRT32 handler in sigaction(); ==375110== the SIGRT32 signal is used internally by Valgrind fatal error: sigaction failed [from Go-lang] Memcheck should find some way to avoid relying on SIGRT32, perhaps using pipefs. The histogram of memcheck Uninit complaints (before patching) begins: 108 ==375110== at 0x64E006: _dl_early_allocate (in /usr/bin/qemu-mipsel-static) 16 ==375110== at 0x63B5AB: brk (in /usr/bin/qemu-mipsel-static) 15 ==375110== at 0x5FDC36: __pthread_enable_asynccancel (in /usr/bin/qemu-mipsel-static) 15 ==375110== at 0x5FDC2B: __pthread_enable_asynccancel (in /usr/bin/qemu-mipsel-static) 14 ==375110== at 0x5FDCB9: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static) 14 ==375110== at 0x5FDCB1: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static) 14 ==375110== at 0x5FDC87: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static) 10 ==375110== at 0x61BA21: __strcmp_avx2 (in /usr/bin/qemu-mipsel-static) 5 ==375110== at 0x4E6CE6: mmap_unlock (mmap.c:39) 5 ==375110== at 0x4E6A7F: mmap_lock (mmap.c:32) 3 ==375110== at 0x606D85: _int_free (in /usr/bin/qemu-mipsel-static) So far there is one uninit local: ==375110== Uninitialised value was created by a stack allocation ==375110== at 0x4808D0: process_string_cmd.constprop.0 (gdbstub.c:1368)
Hi John, (In reply to John Reiser from comment #8) > What has changed is very large apps such as qemu-user-* which depend on > dozens of "random" libraries, and the system > builders/maintainers/administrators > who tired of their own version of "dll hell" due to dependency tracking, > and the users (including "systematic" users with scripts) who got tired of > multi-second initialization of all those shared libraries; all gained new > appreciation > for -static builds of ET_EXEC (no shared libraries at all!) which > initialize in a flash > and have robust configuration (none!) for end users. Disk space is cheap > for many > such users; time is not. qemu-user has grown into a predictable evolution > which > avoids nearly all "security" incidents which otherwise might argue for using > shared libraries. Interesting, but is this still about how to handle the brk syscall? It sound like a separate issue that valgrind doesn't support static linked binaries (because we cannot use the LD_PRELOAD trick to get our alternative malloc/mem/str implementations loaded). There seem to be some alternate bugs about that. https://bugs.kde.org/show_bug.cgi?id=198248 https://bugs.kde.org/show_bug.cgi?id=406434