Our nightly tester runs a few thousand testcases, and usually this goes all fine. However, seemingly some rare condition can trigger the following assert: ==10071== Memcheck, a memory error detector ==10071== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==10071== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==10071== Command: /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.sdbg O-B97-q6.inp ==10071== blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed. host stacktrace: ==10071== at 0x38083F48: show_sched_status_wrk (m_libcassert.c:343) ==10071== by 0x38084064: report_and_quit (m_libcassert.c:415) ==10071== by 0x380841F1: vgPlain_assert_fail (m_libcassert.c:481) ==10071== by 0x380925E6: vgPlain_arena_free (m_mallocfree.c:2042) ==10071== by 0x3811B51E: vgModuleLocal_img_done (image.c:778) ==10071== by 0x380BAFF0: vgModuleLocal_read_elf_debug_info (readelf.c:3027) ==10071== by 0x380B343A: di_notify_ACHIEVE_ACCEPT_STATE (debuginfo.c:749) ==10071== by 0x380B343A: vgPlain_di_notify_mmap (debuginfo.c:1067) ==10071== by 0x380D963D: vgModuleLocal_generic_PRE_sys_mmap (syswrap-generic.c:2367) ==10071== by 0x3810D6A1: vgSysWrap_amd64_linux_sys_mmap_before (syswrap-amd64-linux.c:637) ==10071== by 0x380D60A4: vgPlain_client_syscall (syswrap-main.c:1905) ==10071== by 0x380D2B9A: handle_syscall (scheduler.c:1118) ==10071== by 0x380D424E: vgPlain_scheduler (scheduler.c:1435) ==10071== by 0x380E37B6: thread_wrapper (syswrap-linux.c:102) ==10071== by 0x380E37B6: run_a_thread_NORETURN (syswrap-linux.c:155) sched status: running_tid=1 Thread 1: status = VgTs_Runnable (lwpid 10071) ==10071== at 0x3232A1761A: mmap (in /lib64/ld-2.12.so) ==10071== by 0x3232A076B9: _dl_map_object_from_fd (in /lib64/ld-2.12.so) ==10071== by 0x3232A08399: _dl_map_object (in /lib64/ld-2.12.so) ==10071== by 0x3232A0C3A1: openaux (in /lib64/ld-2.12.so) ==10071== by 0x3232A0E285: _dl_catch_error (in /lib64/ld-2.12.so) ==10071== by 0x3232A0CA84: _dl_map_object_deps (in /lib64/ld-2.12.so) ==10071== by 0x3232A0330F: dl_main (in /lib64/ld-2.12.so) ==10071== by 0x3232A160AD: _dl_sysdep_start (in /lib64/ld-2.12.so) ==10071== by 0x3232A014A3: _dl_start (in /lib64/ld-2.12.so) ==10071== by 0x3232A00B07: ??? (in /lib64/ld-2.12.so) ==10071== by 0x1: ??? ==10071== by 0xFFF000A32: ??? ==10071== by 0xFFF000A7C: ??? Note: see also the FAQ in the source distribution. I have attempted to reproduce this (running exactly the same commands and binary, etc.) but this didn't fail. The error appears to happen before our executable is really running, so maybe something is wrong on startup ? Reproducible: Couldn't Reproduce
The Assertion 'blockSane(a, b)' failed. might indicate that there is a bug in Valgrind (buffer overrun in reading the debug information of an mmaped library?). However, without a small reproducer and/or more details, it is unlikely much can be done. Here are a few things you could try, easiest things are first :). If this is a buffer overrun in Valgrind, the 2nd action is most likely to find the problem. * run with -v -v -v -d -d -d and see which library load is loaded just before the corruption This might give a hint about how to reproduce the problem and make a small test case. * recompile valgrind after having uncommented // #define DEBUG_MALLOC in m_mallocfree.c and rerun your test case. * compile valgrind as an 'inner valgrind', and then run valgrind <your test case> under valgrind (see section self-hosting in README_DEVELOPERS ). This might detect buffer overrun in valgrind heap allocated blocks. * run with --vgdb-stop-at=valgrindabexit till the problem reproduces. You can then attach with gdb, and debug valgrind itself. (but that will not be easy, and moreover that will be after the corruption has happened)
(In reply to Philippe Waroquiers from comment #1) > in Valgrind, the 2nd action is most likely to find the problem. The 3rd action (self-hosting) is in fact most likely to detect the buffer overrun.
just happened again, but it is really rare. (this is a 12 core server running valgrind +-12h a day... and this seems to happen every +- 10 days). Is any of the suggestions mentioned above possible without runtime overhead and excessive IO ? ==25277== Memcheck, a memory error detector ==25277== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==25277== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==25277== Command: /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.pdbg Pa.inp ==25277== blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
I'm experience exactly the same behavior. But the interval is between +-4 days and I'm running multiple servers 24h. ==247219== Memcheck, a memory error detector ==247219== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==247219== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==247219== Command: some test.elf ==247219== Parent PID: 247218 ==247219== blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2047 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed.
(In reply to Joost VandeVondele from comment #3) > just happened again, but it is really rare. (this is a 12 core server > running valgrind +-12h a day... and this seems to happen every +- 10 days). > Is any of the suggestions mentioned above possible without runtime overhead > and excessive IO ? Assuming you know which executable/test causes the bug, the first thing to try is the 'self-hosting', and run your test executable under valgrind self-hosted under itself. and/or run in a loop valgrind on the executable that gave the problem, with the options -v -v -v -d -d -d --vgdb-stop-at=valgrindabexit A failing run will then stop, and allow to examine the debug output of valgrind. The above will for sure consume CPU, but you can sleep during that time :)
(In reply to Kim Rosberg from comment #4) > I'm experience exactly the same behavior. But the interval is between +-4 > days and I'm running multiple servers 24h. > > ==247219== Memcheck, a memory error detector > ==247219== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. > ==247219== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info > ==247219== Command: some test.elf > ==247219== Parent PID: 247218 > ==247219== > blockSane: fail -- redzone-hi > > valgrind: m_mallocfree.c:2047 (vgPlain_arena_free): Assertion 'blockSane(a, > b)' failed. The above line nr is strange. There is no assertion at line 2047. There is one at line 2042. It would be good if you could (both) indicate which distribution and libraries your executables are using : this might give a hint about the origin. I am now re-running all valgrind tests under itself on a amd64 Debian 7, just in case.
I'm sorry, assertion was from 2047 in Valgrind 3.10.0 and 2042 in 3.11.0 (I copy pasted and changed the version number from the error code). I'm mainly running on Red Hat 6 with custom Valgrind (almost default config). I cannot inform you about the libraries.
The failures are observed on RedHatEnterpriseServer Release: 6.7 over the weekend, I have been running valgrind with as an argument essentially just a start of the relevant binary (and have it print the version number). With >200000 runs (10s each) I had no failure. This is on different machine with Redhat 7.2 I'll try something similar on the other machine, but the failure is not so easy to trigger, seemingly. Dynamic libraries in my case are few, and standard I suppose: > ldd /data/vjoost/clean/cp2k/cp2k/exe/local_valgrind/cp2k.sdbg linux-vdso.so.1 => (0x00007ffe09f0d000) libstdc++.so.6 => /data/vjoost/toolchain-r16447/install/lib64/libstdc++.so.6 (0x00007f7d5a890000) libgfortran.so.3 => /data/vjoost/toolchain-r16447/install/lib64/libgfortran.so.3 (0x00007f7d5a56f000) libm.so.6 => /lib64/libm.so.6 (0x0000003233200000) libgcc_s.so.1 => /data/vjoost/toolchain-r16447/install/lib64/libgcc_s.so.1 (0x00007f7d5a33c000) libquadmath.so.0 => /data/vjoost/toolchain-r16447/install/lib64/libquadmath.so.0 (0x00007f7d5a0fd000) libc.so.6 => /lib64/libc.so.6 (0x0000003232e00000) /lib64/ld-linux-x86-64.so.2 (0x0000003232a00000) There are many more static libraries involved, and all are compiled with debug info. The binary is also large (~142Mb).
(In reply to Joost VandeVondele from comment #8) > I'll try something similar on the other machine, but the failure is not so > easy to trigger, seemingly. ... > There are many more static libraries involved, and all are compiled with > debug info. The binary is also large (~142Mb). According to the guest stacktrace, the corruption happens when mmap-ing a shared lib (so, when valgrind is reading the debug info of this library), so is probably related to the shared lib being loaded. When self-hosting, you will increase the chance to detect a possible buffer overrun by using --core-redzone-size=xxx with xxx being e.g; 100 bytes, or even 1000 bytes (if that does not give an out of memory) on the inner valgrind,
Created attachment 96782 [details] self-hosting output 2
Created attachment 96783 [details] self hosting output 3
Since the error is recurring, I have now tried to run the self-hosting. Running : /data/vjoost/test/outer/install/bin/valgrind --sim-hints=enable-outer --trace-children=yes --smc-check=all-non-file --run-libc-freeres=no --tool=memcheck -v /data/vjoost/test/inner/install/bin/valgrind --suppressions=/data/vjoost/toolchain-r16494/install/valgrind.supp --max-stackframe=2168152 --error-exitcode=42 --vgdb-prefix=./inner --core-redzone-size=1000 --tool=memcheck -v /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.sdbg ethanol_both_rcut10.0_e1-1_v1-4_RSR.inp (I.e. self-hosting with added redzone, on the our executable corresponding to a failed run, with its arguments and parameters), I get a seemingly correct run. The output will be attached as out.innerouter.2 . Maybe it is worthwhile to look with expert eyes. However, after observing in that output a warning on stack switching, I added --max-stackframe=68009224472 (as suggested, seems a bit large;-), and that lead to a run with some other error (Memcheck: the 'impossible' happened: create_MC_Chunk: shadow area is accessible).
(In reply to Joost VandeVondele from comment #12) Thanks for this data. The warning about the stack switch is normal : valgrind has an heuristic to detect stack switch. If a program uses huge stackframes, then a call can be confused with a stack switch, and this warning indicates what to do if this is *not* a stack switch (but in a self-hosting setup, such message is normal: it is a stack switch, not a huge frame). Sadly, no error is detected by the self-hosting. So, there are a few more things you could try: Run (no need to self host, just a normal run) but add the option --sanity-level=4 Valgrind will do (more) sanity checks while running, and maybe this might give a hint. Another thing is to add the option --vgdb-stop-at=valgrindabexit when you run all your regression tests. And then, when the problem reproduces, valgrind will stop and wait for a gdb to connect. Then attach with gdb to the valgrind process and do e.g. bt full Also to the frame (image.c:778) and do print img->ces[i]->off print img->ces[i]->used You might also print all the not null ces entries. But we are really trying to kill the bug by shooting in the dark :( > Since the error is recurring, I have now tried to run the self-hosting. > Running : > > /data/vjoost/test/outer/install/bin/valgrind --sim-hints=enable-outer > --trace-children=yes --smc-check=all-non-file --run-libc-freeres=no > --tool=memcheck -v /data/vjoost/test/inner/install/bin/valgrind > --suppressions=/data/vjoost/toolchain-r16494/install/valgrind.supp > --max-stackframe=2168152 --error-exitcode=42 --vgdb-prefix=./inner > --core-redzone-size=1000 --tool=memcheck -v > /data/schuetto/auto_regtesting/regtests/cp2k/exe/local_valgrind/cp2k.sdbg > ethanol_both_rcut10.0_e1-1_v1-4_RSR.inp > > (I.e. self-hosting with added redzone, on the our executable corresponding > to a failed run, with its arguments and parameters), I get a seemingly > correct run. The output will be attached as out.innerouter.2 . Maybe it is > worthwhile to look with expert eyes. > > However, after observing in that output a warning on stack switching, I > added --max-stackframe=68009224472 (as suggested, seems a bit large;-), and > that lead to a run with some other error (Memcheck: the 'impossible' > happened: create_MC_Chunk: shadow area is accessible).
Also no luck with --sanity-level=4 The fact that it is not reproducible on command is indeed not simplifying this. I wonder if this could be related to something external to valgrind triggering this.
(In reply to Joost VandeVondele from comment #14) > Also no luck with --sanity-level=4 > > The fact that it is not reproducible on command is indeed not simplifying > this. I wonder if this could be related to something external to valgrind > triggering this. Yes, this bug is quite mysterious. The only remaining thing to try that I see is to add --vgdb-stop-at=valgrindabexit to the valgrind args you use for your regression tests. Then when the error happens, valgrind will wait for a gdb to connect using gdb+vgdb. You can then examine e.g. which library is being mmap-ed. You might also use gdb to directly attach to valgrind and examine valgrind internals.
Created attachment 101649 [details] Valgrind output for a bugging test.
(copied from valgrind-users list following Julian Seward request) Hi, I recently upgraded to valgrind 3.11.0 (was 3.8.1 before). I also upgraded Intel MKL to the 2015 version. We use valgrind since 2011-10-27 to test our code nightly with about 2200 tests. Since this upgrade, some tests (not always the same) have the same trouble (see full log in attachment): ================================ ... blockSane: fail -- redzone-hi valgrind: m_mallocfree.c:2042 (vgPlain_arena_free): Assertion 'blockSane(a, b)' failed. host stacktrace: ... ================================ I would like some help to debug this. It is quite annoying since for the same test, the problem appear or disappear from one day over another... Today we got the problem on 6 tests (on 2200), yesterday, only 1 test failed, before yesterday, none failed.... Thanks for any insights or clues! Eric
Some other informations: Distribution: openSUSE 12.3 Kernel: cat /proc/version Linux version 3.7.10-1.45-desktop (geeko@buildhost) (gcc version 4.7.2 20130108 [gcc-4_7-branch revision 195012] (SUSE Linux) ) #1 SMP PREEMPT Tue Dec 16 20:27:58 UTC 2014 (4c885a1) g++: g++ (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012] when launching the 2200 tests, we run at most 12 at the same time. ie, we almost never run a test "alone"... so there is some concurrency for the resources.... I just tried relaunching a faulty test, but I am unable to reproduce. Looking at /var/log/messages, I found just one entry like this: 2016-10-19T03:37:51.503636-04:00 melkor kernel: [1032084.013381] userif-3: sent link up event. 2016-10-19T03:37:51.503655-04:00 melkor kernel: [2920320.962816] memcheck-amd64-[26315]: segfault at 1c04 ip 0000000038055b90 sp 0000000039553ed8 error 4 in memcheck-amd64-linux[38000000+21f000] which corresponds to one of the 6 failling tests of last night. Eric Eric
(In reply to Eric Chamberland from comment #18) > which corresponds to one of the 6 failling tests of last night. For these 6 failing tests (or the failing tests of the other runs) is it always (or often ?) just after seeing the line telling that it is loading the syms of libgiref_opt_Interface.so ? (wondering if this bug is linked to some specific debug info/symbols in a library). Or is it after a seemingly random library load ? Also, a recent commit has added a check on valgrind's own heap, to detect double free (that could potentially create such a assert). So, it would be good if you could try the valgrind svn version, just in case this would be a double free.
Yet another trial you can do (if you try the svn version) is to activate DEBUG_MALLOC in coregrind/m_mallocfree.c file. This might give a chance to detect a possible corruption closer to the origin of the corruption. Thanks
Finally I got not 6 but 8 failing tests (wrote the report before they all completed...) I have 7 different libraries where valgrind hangs: #1: --17999-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_dev_ChampsUtil.so blockSane: fail -- redzone-hi #2: --17515-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_dev_Contact.so blockSane: fail -- redzone-hi #3: --26315-- Reading syms from /usr/local/valgrind-3.11.0/lib64/valgrind/memcheck-amd64-linux --26315-- object doesn't have a dynamic symbol table blockSane: fail -- redzone-hi #4: --21183-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_opt_Interface.so blockSane: fail -- redzone-hi #5: --2349-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_opt_Adaptation.so blockSane: fail -- redzone-hi #6: --2929-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_opt_Geometrie.so blockSane: fail -- redzone-hi #7: --19038-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_dev_LecteurDeclaration.so blockSane: fail -- redzone-hi #8: --31828-- Reading syms from /pmi/cmpbib/compilation_BIB_gcc-4.5.1_64bit_valgrind/COMPILE_AUTO/GIREF/lib/libgiref_dev_LecteurDeclaration.so blockSane: fail -- redzone-hi I launched a test 165 minutes ago with the #define DEBUG_MALLOC (but not the svn version) and it is not terminated... but has passed all the libraries loading phase... :/ Eric
In my case, the issue has disappeared, and the 'only' thing changed is that the server has been updated and is now running Red Hat Enterprise Linux Server release 7.2, which for example includes a newer kernel 3.10.0-327.13.1.el7.x86_64. Valgrind, gcc etc. are still the same version. So, I would suspect this is some interaction with the OS causing this.
Thanks Joost for these informations: In fact, we will upgrade the os when possible... This morning, I have 4 (different) failing tests. I launched 3.11.0 with -v -v -v -d -d -d, please see valgrind_with_vvvddd.txt. I will extract the svn version and next night we will see if the error is still there... Thanks, Eric
Created attachment 101660 [details] valgrind launched with -v -v -v -d -d -d
FWIW, the svn version of valgrind didn't produced any blockSane assertion last night, but since it was randomly happening, we have to let go some days to see if this bug is definitely not occurring again. Thanks, Eric
(In reply to Eric Chamberland from comment #25) > FWIW, the svn version of valgrind didn't produced any blockSane assertion > last night, but since it was randomly happening, we have to let go some days > to see if this bug is definitely not occurring again. > > Thanks, > > Eric Everything is fine since 21 days now. I think this can be "closed". Eric