Bug 487862 - memcheck does not believe that bytes on new pages are Defined by brk() system call
Summary: memcheck does not believe that bytes on new pages are Defined by brk() system...
Status: REPORTED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.23.0
Platform: Fedora RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Paul Floyd
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-01 04:05 UTC by John Reiser
Modified: 2024-06-10 10:12 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
assembly-language test program for brk() system call (1.11 KB, text/x-csrc)
2024-06-01 04:05 UTC, John Reiser
Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Reiser 2024-06-01 04:05:28 UTC
Created attachment 170032 [details]
assembly-language test program for brk() system call

SUMMARY
On Linux, when the brk() system call adds new pages to the address space, then those pages are guaranteed to be all-zero, but memcheck thinks the bytes are Undefined.

STEPS TO REPRODUCE
1. Build the attached short assembly-language program brk-test.S  for x86_64.
The program finds the next page boundary at or above "the .brk",
then uses system call brk() to append 345 bytes to "the .brk".
Linux must add one entire page, and guarantees that the page is all-zero.
2. Run under valgrind(memcheck) and note the complaint about Uninit value.
3. Comment-out the "jmp no_clear" instruction, re-build and re-run.
4. Note that the complaint disapears.
5. Use a debugger to single-step the two programs, examining memory to convince yourself that the code to zero the added address space on a new page is effectively a no-op.

OBSERVED RESULT
Step 2: 
$ valgrind --track-origins=yes ./brk-test
==5737== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==5737== Command: ./brk-test
==5737== 
==5737== Syscall param exit(status) contains uninitialised byte(s)
==5737==    at 0x401051: ??? (in /home/jreiser/brk-test)
==5737==  Uninitialised value was created
==5737==    at 0x401039: ??? (brk-test.S:38)   ## the third brk() syscall


Step 4:  Note no complaint from memcheck
$ valgrind --track-origins=yes ./brk-test
==5768== Using Valgrind-3.23.0 and LibVEX; rerun with -h for copyright info
==5768== Command: ./brk-test
==5768== 
==5768== 
==5768== HEAP SUMMARY:



EXPECTED RESULT
No complaint from Step 2 (and thus no complaint in either case.)

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma:
$ uname -a
Linux fedora 6.8.10-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 17 21:20:54 UTC 2024 x86_64 GNU/Linux
$ rpm -q valgrind
valgrind-3.23.0-1.fc40.x86_64

(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION
The Go-lang runtime _dl_early_allocate uses initialization code like this when -static binding.  Thus memcheck can generate many many false-positive complaints from many different tracebacks.
Comment 1 Paul Floyd 2024-06-02 07:57:02 UTC
brk() is fairly old, removed from posix accordingly to the Linux manpage. Increasingly I see it getting removed from platforms.

Nevertheless, I just downloaded a copy of SUSv1, and indeed it does say

The brk( ) and sbrk( ) functions are used to change the amount of space allocated for the calling
process. The change is made by resetting the process’ break value and allocating the appropriate
amount of space. The amount of allocated space increases as the break value increases. The
newly-allocated space is set to 0. However, if the application first decrements and then
increments the break value, the contents of the reallocated space are unspecified.
Comment 2 John Reiser 2024-06-06 20:37:16 UTC
(In reply to Paul Floyd from comment #1)
> brk() is fairly old, removed from posix accordingly to the Linux manpage.
> Increasingly I see it getting removed from platforms.

In this case "old" means good, well-documented, well-debugged, reliable.
Please name explicitly some platforms which are removing brk(), and their
typical usage environments.

The brk() system call has been present in most *nix-like systems since 1970,
which is more than 50 years.  brk() is present in Linux today, and Linux
has a strong policy and implementation record for backwards compatibility
of system calls; brk() will not disappear from Linux anytime soon.

brk() is particularly useful during process startup when a call to malloc()
might not yet be supported;.  A programming language runtime can
use brk() via an inline system call.  See _dl_early_allocate() in glibc, for example.

brk() and sbrk() enable an easy implementation of  mark() and release()
for Pascal run-time support.

brk() is quite handy for use by other in-process emulators, which must
deal with arbitrary mmap() anyway.  For instance, qemu relies on brk() and sbrk().

brk() is multi-thread safe and async-signal safe without requiring
any user-mode locks.  It can be used safely in a signal handler.
Thread managers may rely on brk() regardless of the implementation of malloc().
Comment 3 Paul Floyd 2024-06-07 06:25:59 UTC
(In reply to John Reiser from comment #2)
> (In reply to Paul Floyd from comment #1)
> > brk() is fairly old, removed from posix accordingly to the Linux manpage.
> > Increasingly I see it getting removed from platforms.
> 
> In this case "old" means good, well-documented, well-debugged, reliable.
> Please name explicitly some platforms which are removing brk(), and their
> typical usage environments.

The fist thing in the FreeBSD man page, in BOLD letters is

DESCRIPTION
     The brk() and sbrk() functions are legacy interfaces from before the
     advent of modern virtual memory management.  They are deprecated and not
     present on the arm64 or riscv architectures.  The mmap(2) interface
     should be used to allocate pages instead.

FreeBSD has (or had) two system calls, brk and sbrk. sbrk was recently removed from FreeBSD 15 (probably because it has just returned EOPNOTSUPP for as long as I can remember).

> brk() is multi-thread safe and async-signal safe without requiring

The FreeBSD man page says

     brk() and sbrk() are not thread-safe.

Some other juicy bits from SUSv1

"The behaviour of brk( ) and sbrk( ) is unspecified if an application also uses any other memory
functions (such as malloc ( ), mmap( ), free( )). Other functions may use these other memory
functions silently."

and

"It is unspecified whether the pointer returned by sbrk( ) is aligned suitably for any purpose."

So, it's not portable, not thread-safe, not safe to use with malloc and provides no alignment guarantees. "removed in POSIX.1-2001" twenty three years ago isn't what I'd call well documented.
Comment 4 Mark Wielaard 2024-06-07 11:58:45 UTC
(In reply to John Reiser from comment #0)
> On Linux, when the brk() system call adds new pages to the address space,
> then those pages are guaranteed to be all-zero, but memcheck thinks the
> bytes are Undefined.

I think this is the basic problem. It says if it adds "new pages" then those pages are guaranteed to be all-zero. But if the pages already existed and brk() just gives them back those aren't guaranteed to be all-zero but can contain whatever was there before.

Also note that the man7 manpage for brk() nor the glibc manual make any claims about the memory area being initialized:
https://man7.org/linux/man-pages/man2/brk.2.html
https://sourceware.org/glibc/manual/latest/html_node/Resizing-the-Data-Segment.html

So it seems reasonable for memcheck to assume the area exposed by brk() is undefined.
Comment 5 John Reiser 2024-06-07 14:57:08 UTC
(In reply to Mark Wielaard from comment #4)

> So it seems reasonable for memcheck to assume the area exposed by brk() is
> undefined.

That assumption is incorrect for new pages on Linux, and there are applications
 that rely on the actual behavior, which is that on new pages all bytes
are zero.  For instance: qemu relies on this to hold data that tracks threads.

>  But if the pages already existed and brk() just gives them back those
> aren't guaranteed to be all-zero but can contain whatever was there before.

On Linux this is only partially true.  If __cur_brk (the top of the sys_brk() region) varies within the same page, then the bytes retain their values. If __cur_brk decreases enough so that one or more whole pages are not contained in the allocated address space, then those pages are removed from the address space, and a subsequent sys_brk() which increases __cur_brk to include such addresses does assign new pages, and they are guaranteed to be zero.
If an increasing sys_brk() would overlap with a different VMA (Linux-internal Virtual Memory Area) that already is assigned for some other purpose, then that sys_brk() fails and returns an error value.
So for any successful sys_brk(), then new pages are guaranteed to be zero.  Memcheck certainly knows which pages are new, and should mark then as Defined.
Comment 6 Mark Wielaard 2024-06-07 15:33:55 UTC
BTW it seems this whole discussion is duplicated in the comments of memcheck/mc_main.c mc_post_clo_init ()

  // We assume that brk()/sbrk() does not initialise new memory.  ...

With as conclusion:

   // So we should arguably observe all this.  However:
   // - The current inaccuracy has caused maybe one complaint in seven years(?)
   // - Relying on the zeroed-ness of whole brk'd pages is pretty grotty... I
   //   doubt most programmers know the above information.
   // So I'm not terribly unhappy with marking it as undefined. --njn.

Which was written in 2009. So maybe times have changed since then. But I don't really remember any false positive report about this.
Comment 7 Paul Floyd 2024-06-07 15:39:52 UTC
As to implementing this, my idea is to add a new variable VG_(brk_high_tide) that starts off the same as VG_(brk_limit),

If ever VG_(brk_limit) is increased beyond VG_(brk_high_tide) then a new VG_(track_new_init_mem_brk) gets used (and the high tide updated). If VG_(brk_limit) gets increased but doesn't exceed VG_(brk_high_tide) (because VG_(brk_limit) got decreased in the meantime) then the existing VG_(track_new_mem_brk) gets used.

VG_(track_new_init_mem_brk) will use make_mem_defined_w_tid.

So instead of

      if (brk_new > brk_limit) {
         /* successfully grew the data segment */
         VG_TRACK( new_mem_brk, brk_limit,
                   ARG1-brk_limit, tid );
      }

it would be

      if (brk_new > brk_limit) {
         /* successfully grew the data segment */
         if (brk_new > brk_high_tide) {
            VG_TRACK( new_init_mem_brk, brk_limit,
                      ARG1-brk_limit, tid );
         } else {
            VG_TRACK( new_mem_brk, brk_limit,
                      ARG1-brk_limit, tid );
         }
      }

This doesn't model what is happening to physical memory (initialized pages being mapped) but I do think that it more closely represents what the guest should be seeing.
Comment 8 John Reiser 2024-06-09 19:58:47 UTC
(In reply to Mark Wielaard from comment #6)
> BTW it seems this whole discussion is duplicated in the comments of
> memcheck/mc_main.c mc_post_clo_init ()

>    // - The current inaccuracy has caused maybe one complaint in seven
> years(?)
>    // - Relying on the zeroed-ness of whole brk'd pages is pretty grotty... I
>    //   doubt most programmers know the above information.
>    // So I'm not terribly unhappy with marking it as undefined. --njn.
> 
> Which was written in 2009. So maybe times have changed since then. But I
> don't really remember any false positive report about this.

What has changed is very large apps such as qemu-user-* which depend on
dozens of "random" libraries, and the system builders/maintainers/administrators
who tired of their own version of "dll hell" due to dependency tracking,
and the users (including "systematic" users with scripts) who got tired of
multi-second initialization of all those shared libraries; all gained new appreciation
for -static builds of  ET_EXEC (no shared libraries at all!) which initialize in a flash
and have robust configuration (none!) for end users.  Disk space is cheap for many
such users; time is not. qemu-user has grown into a predictable evolution which
avoids nearly all "security" incidents which otherwise might argue for using shared libraries.

But then "corner case" users of qemu-user-* (etc), the users who just might find a bug or two,
discover that memcheck is much less useful.  Gdb and debuginfo have no trouble
identifying malloc/free/sbrk+sys_brk etc., but memcheck does not recognize them.
Re-building the app from source (and avoiding -static+ET_EXEC) takes hours of time
and almost 10 GB of disk space, uses 4 different build systems (make, cmake, meson, ninja)
and two runtime support systems (Go-lang and C-lang).

Being highly motivated for the particular case of qemu-mips*-static, I have developed
binary patches that zero the new region of {_dl_early_allocate, malloc} + {expanding sbrk/sys_brk}.
This is not as good as proper memcheck accounting, but it suits my purpose
of finding uninit local variables in what would otherwise be a blizzard of Uninit complaints
caused by memcheck not recognizing when to mark bytes from sys_brk as Defined.

Without the patches, memcheck finds
   ==375110== ERROR SUMMARY: 9646 errors from 125 contexts (suppressed: 0 from 0)
before the process aborts with
   ==375110== Warning: ignored attempt to set SIGRT32 handler in sigaction();
   ==375110==          the SIGRT32 signal is used internally by Valgrind
   fatal error: sigaction failed  [from Go-lang]
Memcheck should find some way to avoid relying on SIGRT32, perhaps using pipefs.

The histogram of memcheck Uninit complaints (before patching) begins:
    108 ==375110==    at 0x64E006: _dl_early_allocate (in /usr/bin/qemu-mipsel-static)
     16 ==375110==    at 0x63B5AB: brk (in /usr/bin/qemu-mipsel-static)
     15 ==375110==    at 0x5FDC36: __pthread_enable_asynccancel (in /usr/bin/qemu-mipsel-static)
     15 ==375110==    at 0x5FDC2B: __pthread_enable_asynccancel (in /usr/bin/qemu-mipsel-static)
     14 ==375110==    at 0x5FDCB9: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static)
     14 ==375110==    at 0x5FDCB1: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static)
     14 ==375110==    at 0x5FDC87: __pthread_disable_asynccancel (in /usr/bin/qemu-mipsel-static)
     10 ==375110==    at 0x61BA21: __strcmp_avx2 (in /usr/bin/qemu-mipsel-static)
      5 ==375110==    at 0x4E6CE6: mmap_unlock (mmap.c:39)
      5 ==375110==    at 0x4E6A7F: mmap_lock (mmap.c:32)
      3 ==375110==    at 0x606D85: _int_free (in /usr/bin/qemu-mipsel-static)

So far there is one uninit local:
==375110==  Uninitialised value was created by a stack allocation
==375110==    at 0x4808D0: process_string_cmd.constprop.0 (gdbstub.c:1368)
Comment 9 Mark Wielaard 2024-06-10 10:12:31 UTC
Hi John,

(In reply to John Reiser from comment #8)
> What has changed is very large apps such as qemu-user-* which depend on
> dozens of "random" libraries, and the system
> builders/maintainers/administrators
> who tired of their own version of "dll hell" due to dependency tracking,
> and the users (including "systematic" users with scripts) who got tired of
> multi-second initialization of all those shared libraries; all gained new
> appreciation
> for -static builds of  ET_EXEC (no shared libraries at all!) which
> initialize in a flash
> and have robust configuration (none!) for end users.  Disk space is cheap
> for many
> such users; time is not. qemu-user has grown into a predictable evolution
> which
> avoids nearly all "security" incidents which otherwise might argue for using
> shared libraries.

Interesting, but is this still about how to handle the brk syscall?
It sound like a separate issue that valgrind doesn't support static linked binaries
(because we cannot use the LD_PRELOAD trick to get our alternative malloc/mem/str
implementations loaded). There seem to be some alternate bugs about that.

https://bugs.kde.org/show_bug.cgi?id=198248
https://bugs.kde.org/show_bug.cgi?id=406434