Bug 480405 - valgrind 3.22.0 "m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(sr)' failed."
Summary: valgrind 3.22.0 "m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(s...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.22.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-27 20:16 UTC by bugskde
Modified: 2024-01-31 06:29 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description bugskde 2024-01-27 20:16:09 UTC
SUMMARY
***
Valgrind seems to throw an error when checking my code. Error is as follows:

***


STEPS TO REPRODUCE
1. Download and build https://github.com/sbromberger/finddups-cpp
2. run `valgrind --leak-check=yes ./finddups`


OBSERVED RESULT

https://dpaste.org/OTor0:
```
seth@hydrogen:~/dev/cpp/finddups master $ valgrind --leak-check=yes ./finddups .
==6333== Memcheck, a memory error detector
==6333== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==6333== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==6333== Command: ./finddups .
==6333==
min size: 0, max size: 9223372036854775807, includes: [], excludes: []
sizemap size = 181
finished sizemap

valgrind: m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(sr)' failed.

host stacktrace:
==6333== at 0x58065812: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x58065867: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x5806887B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x580C4DA5: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x5813A08A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x5813E232: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x58144CAC: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x581B4456: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==6333== by 0x581B5CB8: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)

sched status:
running_tid=1

Thread 1: status = VgTs_Runnable syscall 9 (lwpid 6333)
==6333== at 0x4D0F0A2: __mmap64 (mmap64.c:58)
==6333== by 0x4D0F0A2: mmap (mmap64.c:46)
==6333== by 0x11292A: hashmap(std::unordered_map<unsigned long, std::vector<std::filesystem::__cxx11::directory_entry, std
::allocator<std::filesystem::__cxx11::directory_entry> >, std::hash<unsigned long>, std::equal_to<unsigned long>, std::alloca
tor<std::pair<unsigned long const, std::vector<std::filesystem::__cxx11::directory_entry, std::allocator<std::filesystem::__c
xx11::directory_entry> > > > > const&, Config const&) (in /home/seth/dev/cpp/finddups/finddups)
==6333== by 0x110073: main (in /home/seth/dev/cpp/finddups/finddups)
client stack range: [0x1FFEFFE000 0x1FFF000FFF] client SP: 0x1FFEFFFA58
valgrind stack range: [0x1002CB2000 0x1002DB1FFF] top usage: 13144 of 1048576

<Note and other boilerplate text elided>
```


EXPECTED RESULT
Something else (unknown as this is my first time running valgrind).

SOFTWARE/OS VERSIONS
Linux:  Void Linux `Void 6.6.11_1 x86_64 AuthenticAMD uptodate rFFF`

ADDITIONAL INFORMATION
valgrind on system utilities (`df` and `ls` tested) seems to work fine.
Comment 1 Paul Floyd 2024-01-28 07:43:26 UTC
First problem, your code didn't compile straight away. Here is a diff.

diff --git a/finddups.cpp b/finddups.cpp
index ec44c03..f22dba5 100644
--- a/finddups.cpp
+++ b/finddups.cpp
@@ -11,6 +11,7 @@
 #include <iostream>
 #include <sys/mman.h>
 #include <unordered_map>
+#include <unistd.h>
 
 constexpr unsigned long PAGESZ = 4096UL;
 
@@ -194,7 +195,7 @@ int main(int argc, char *argv[]) {
   }
   std::cerr << "Total time: "
             << std::chrono::duration_cast<std::chrono::milliseconds>(
-                   std::chrono::high_resolution_clock::now() - start)
+                   std::chrono::high_resolution_clock::now() - start).count()
             << "\n";
   return 0;
 }

There are also a few warnings

/home/paulf/scratch/finddups-cpp/finddups.cpp:49:31: warning: missing field 'includes' initializer [-Wmissing-field-initializers]
  return Config{min_sz, max_sz};
                              ^
/home/paulf/scratch/finddups-cpp/finddups.cpp:93:69: warning: unused parameter 'cfg' [-Wunused-parameter]
entry_hash_map hashmap(const entry_size_map &sizemap, const Config &cfg) {
                                                                    ^
/home/paulf/scratch/finddups-cpp/finddups.cpp:16:25: warning: unused variable 'PAGESZ' [-Wunused-const-variable]
constexpr unsigned long PAGESZ = 4096UL;
                        ^


Second problem, from your report

host stacktrace:
==6333== at 0x58065812: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)

It looks like you are using a crappy distro that has a package maintainer that believes that they know better than the Valgrind developers and has ignored our advice not to strip the Valgrind binaries. Your distro may have a "valgrind-dbg" or "valgrind-dbgsym" package or something like that which would contain the split debuginfo files.

Reproducing the error may depend where you are running the exe. When I run it on the project build directory I get no problem. When I run it as non-root on "/usr" I get an abort.  I had to make several changes to use the overloads that use std::error_code since you don't catch exceptions in your directory iterating code. However, in your case "sizemap size" is only 181 (for my /usr it is 24699, my build directory is 176). So I assume that you're running it in your project build directory.

On my system, no problem (FreeBSD 13.2, clang 14.0.5).

Now looking at the source code. There are 5 calls to set_Cent. I can't tell which one it is in your case because of your stripped binary. 

Can you try again, either with the Valgrind debuginfo package (if your distro has one and it works) or with Valgrind built from source?
Comment 2 bugskde 2024-01-28 09:41:39 UTC
Thanks for the tips. On my system it compiles fine (using the CMakeFile.txt with clang, also with `clang++ -o finddups -g --std=c++20 *.cpp` but I have the dependencies built already). I don't think you need to call a `count()` method on the difference: https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now. I also don't seem to need `unistd.h` on my system.

The warnings are innocuous, as they represent functionality that exists in other implementations (PAGESZ: I'm benchmarking the file accesses) and future options (ability to specify include/excludes).

As far as the version of valgrind, this is simply what void packages: https://github.com/void-linux/void-packages/tree/master/srcpkgs/valgrind) - I didn't do any customization. I'll look around for one with debug symbols. Failing that, I'll try to build from source.
Comment 3 bugskde 2024-01-28 09:45:36 UTC
I found the package with debug symbols. Here's the new output. Does this help?

```
valgrind: m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(sr)' failed.

host stacktrace:
==21768==    at 0x58065812: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x58065867: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x5806887B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x580C4DA5: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x5813A08A: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x5813E232: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x58144CAC: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x581B4456: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==21768==    by 0x581B5CB8: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 9 (lwpid 21768)
==21768==    at 0x4D5C0A2: __mmap64 (mmap64.c:58)
==21768==    by 0x4D5C0A2: mmap (mmap64.c:46)
==21768==    by 0x112127: hashmap(std::unordered_map<unsigned long, std::vector<std::filesystem::__cxx11::directory_entry, std::allocator<std::filesystem::__cxx11::directory_entry> >, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::vector<std::filesystem::__cxx11::directory_entry, std::allocator<std::filesystem::__cxx11::directory_entry> > > > > const&, Config const&) (finddups.cpp:111)
==21768==    by 0x112FF0: main (finddups.cpp:183)
client stack range: [0x1FFEFFE000 0x1FFF000FFF] client SP: 0x1FFEFFF818
valgrind stack range: [0x1002DB2000 0x1002EB1FFF] top usage: 17080 of 1048576
```
Comment 4 Paul Floyd 2024-01-28 16:46:10 UTC
No, it's a proper stacktrace with file names and line numbers that I need.

Can you build from source and retry?
Comment 5 bugskde 2024-01-28 18:23:35 UTC
(In reply to Paul Floyd from comment #4)
> No, it's a proper stacktrace with file names and line numbers that I need.
> 
> Can you build from source and retry?

The one built from source does not error. It ends with

```
=19375==End of process memory map.
==19375== 
==19375== HEAP SUMMARY:
==19375==     in use at exit: 0 bytes in 0 blocks
==19375==   total heap usage: 88 allocs, 88 frees, 2,951 bytes allocated
==19375== 
==19375== All heap blocks were freed -- no leaks are possible
==19375== 
==19375== For lists of detected and suppressed errors, rerun with: -s
==19375== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
```
Comment 6 bugskde 2024-01-28 18:28:45 UTC
I found the problem: the finddups code was incorrectly checking `open` for error using `0` - it needed to be `-1`. When this changed, valgrind (both the package version and the one built from source) started working.
Comment 7 bugskde 2024-01-29 19:22:09 UTC
I spoke too soon - it started crashing again. (I don't think it's my code). Here's the full output from a custom build of valgrind with the symbols:

```
valgrind: m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(sr)' failed.                                            
                                                                                                                             
host stacktrace:                                                                                                             
==27068==    at 0x5804383A: show_sched_status_wrk (m_libcassert.c:407)                                                       
==27068==    by 0x58043957: report_and_quit (m_libcassert.c:478)                                                             
==27068==    by 0x58043AE7: vgPlain_assert_fail (m_libcassert.c:544)                                                         
==27068==    by 0x580C4B9F: set_CEnt (image.c:586)                                                                           
==27068==    by 0x580C4DD1: vgModuleLocal_img_from_fd (image.c:913)                                                          
==27068==    by 0x58083B9F: vgModuleLocal_check_elf_and_get_rw_loads (readelf.c:3820)                               
==27068==    by 0x58070609: vgPlain_di_notify_mmap (debuginfo.c:1393)                                               
==27068==    by 0x580A0031: vgModuleLocal_generic_PRE_sys_mmap (syswrap-generic.c:2466)                             
==27068==    by 0x580ABA6F: vgSysWrap_amd64_linux_sys_mmap_before (syswrap-amd64-linux.c:413)
==27068==    by 0x5809C6A9: vgPlain_client_syscall (syswrap-main.c:2240)                                            
==27068==    by 0x580981EA: handle_syscall (scheduler.c:1206)
==27068==    by 0x5809A1C6: vgPlain_scheduler (scheduler.c:1552)
==27068==    by 0x580E5279: thread_wrapper (syswrap-linux.c:102)
==27068==    by 0x580E5279: run_a_thread_NORETURN (syswrap-linux.c:155)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable syscall 9 (lwpid 27068)
==27068==    at 0x4D5C0A2: __mmap64 (mmap64.c:58)
==27068==    by 0x4D5C0A2: mmap (mmap64.c:46)
==27068==    by 0x112017: hashmap(std::unordered_map<unsigned long, std::vector<std::filesystem::__cxx11::directory_entry, st
d::allocator<std::filesystem::__cxx11::directory_entry> >, std::hash<unsigned long>, std::equal_to<unsigned long>, std::alloc
ator<std::pair<unsigned long const, std::vector<std::filesystem::__cxx11::directory_entry, std::allocator<std::filesystem::__
cxx11::directory_entry> > > > > const&, Config const&) (finddups.cpp:111)
==27068==    by 0x112F05: main (finddups.cpp:183)
client stack range: [0x1FFEFFE000 0x1FFF000FFF] client SP: 0x1FFEFFF6C8
valgrind stack range: [0x1002EF3000 0x1002FF2FFF] top usage: 12616 of 1048576
```
Comment 8 Paul Floyd 2024-01-30 05:04:11 UTC
I need to check the generic mmap wrapper to see if it handles the case of an mmap with a failed open fd.
Comment 9 bugskde 2024-01-30 05:10:23 UTC
(In reply to Paul Floyd from comment #8)
> I need to check the generic mmap wrapper to see if it handles the case of an
> mmap with a failed open fd.

I don't really know how any of this works, but in my tests I didn't have any mmap / open failures. Does valgrind simulate them?
Comment 10 Paul Floyd 2024-01-30 07:07:01 UTC
Loading of shared libraries is done with mmap. So whenever Valgrind sees an mmap it reads the file to see if is an ELF binary and to determine if it needs to read the DWARF debuginfo. The assert that you are seeing happens during this check (vgModuleLocal_check_elf_and_get_rw_loads). It looks like the mmap syscall is succeeding (which would also mean that the open succeeded).

The failure is here

      SysRes sr = VG_(pread)(img->source.fd, &ce->data[0], (Int)len, off);
      vg_assert(!sr_isError(sr))

I'll see if I can reproduce the problem on Linux.
Comment 11 Paul Floyd 2024-01-30 12:53:34 UTC
OK I've found the problem.

You're opening the files with O_DIRECT. But then Valgrind is just using a pread based on the size to see if it is an ELF load.

That's causing a EINVAL because of this

       EINVAL fd is attached to an object which is unsuitable for reading; or  the  file  was
              opened  with  the  O_DIRECT  flag, and either the address specified in buf, the
              value specified in count, or the file offset is not suitably aligned.

As a workaround, you can just remove the O_DIRECT.

In Valgrind, handling the memory restrictions looks complicated. Instead I'll look at 1) getting the fd flags with fcntl, 2) checking if O_DIRECT is set 3) turning it off if it is, 4) doing the pread and 5) restoring the flag if needed

As a reminder to myself, the code in question is

   if (img->source.is_local) {
      // Simple: just read it
      SysRes sr = VG_(pread)(img->source.fd, &ce->data[0], (Int)len, off);
      vg_assert(!sr_isError(sr));

in image.c
Comment 12 bugskde 2024-01-30 16:33:25 UTC
Removing O_DIRECT does indeed appear to fix the problem. Since there's no significant performance impact to removing it, I'll do that for now. Thank you for helping me track this down.

Also, thanks for the suggestion re: debug symbols. I've opened up an issue with my distribution: https://github.com/void-linux/void-packages/issues/48427
Comment 13 Paul Floyd 2024-01-30 21:26:42 UTC
commit 2c8ecd648204a8be01714fa6878373f57e073d25 (HEAD -> master, origin/users/paulf/try-bug480405, origin/master, origin/HEAD, bug480405)
Author: Paul Floyd <pjfloyd@wanadoo.fr>
Date:   Tue Jan 30 22:02:21 2024 +0100

    Bug 480405 - valgrind 3.22.0 "m_debuginfo/image.c:586 (set_CEnt): Assertion '!sr_isError(sr)' failed."
Comment 14 bugskde 2024-01-31 03:21:44 UTC
This is a silly question: but it sort of sounds like Valgrind looks at any mmap-opened file to see if it's a shared library (that is used by the binary?). What if the file I'm opening happens to be a shared library that is NOT used by the binary?
Comment 15 Paul Floyd 2024-01-31 06:29:00 UTC
(In reply to bugskde from comment #14)
> This is a silly question: but it sort of sounds like Valgrind looks at any
> mmap-opened file to see if it's a shared library (that is used by the
> binary?). What if the file I'm opening happens to be a shared library that
> is NOT used by the binary?

Loading an executable binary doesn't just mmap the entire file. ELF (and macho on macOS) files are split into segments and sections. Segments like "NOTE" (info related to the OS and architecture) don't need to be mapped. It's only the LOAD segments that matter. And even then we aren't interested in the read-only segment as it should never be the source of errors.

So there is complex code to determine when the read-write (global data) and read-execute (the program text) has been loaded. Once it's been determined that all RW and RX LOAD segments have been mmaped then we trigger reading any DWARF debuginfo. Platforms vary a lot in how they do this, and linkers keep coming up with tweaks to try to improve the memory layout and/or make the binary files smaller.