Running cvs head. Different Linux 2.6 kernels following bk. This is a longstanding problem. I run my regression tests where a server stays up and a few hundred clients execute a test. The server is launching a thread for each client. After a few 100s clients run successfully the server aborts with: VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed. VG_(get_memory_from_mmap): 33599270 bytes already allocated. Sorry. You could try using a tool that uses less memory; eg. addrcheck instead of memcheck. The recent memory mgmt improvement had no effect. The failure is always in exactly the same point in the process. No unusual errors were logged before the failure. Options used: valgrind --tool=memcheck \ --leak-check=yes \ --show-reachable=no \ --num-callers=50 \ --error-limit=no \ --run-libc-freeres=no \ --time-stamp=yes \ --trace-syscalls=no \ --quiet \ MyProgram MyOptions I did not find an option combination that changes the situation. A log of the run is prohitively large (it takes a few hours to reach the failure point). I am now generating one. As an aside, should we have Severity 'abort' to differenciate it from 'crash'? The first indicates VG was in control, the latter means it lost it.
Hm, it looks like it might be heap fragmentation. There could only be 32Mbyte of actual allocated stuff, but its scattered around so much that there isn't space for whatever it is its trying to allocate. Do you know what it is trying to do at the time? Load a new .so maybe? Report an error?
Only 32MB of allocated stuff - you mean malloc()ed stuff? I surely use mnore than that. Not sure what it was doing as it failed, I will investigate.
My investigation is prgressing slowly (log files up to 1GB and single runs of a few hours). I could use a status dump for the above error though. Having a stack trace of all the threads active at the time, and knowing which one lead to the abort, would be very usefull. Is this something simple to add? I think that any abort should provide a state report.
Created attachment 9854 [details] tail of VG loag A 'tail -n 100000' of the VG log.
Bloody Hell! I wrote a nice post then clicked "Create a New Attachment" which worked well, except it lost my actual "Additional Comment". Here I go again... I now have a log and my own stack trace. The situation at the time of failure has 13 active thread, 5 of them are clients. Only one thread (a client) seems to be active at this point: 0 pthread_join main 1 pthread_join servers create/join 2 accept server 1 3 accept server 2 4 accept server 3 5 select server 1 housekeeping, sleeping 6 select server 2 housekeeping, sleeping 7 select server 3 housekeeping, sleeping 8 (exited) 9 ssamutex_lock client, rulebase_commit 10 recv client, idle 11 (exited) 12 recv client, idle 13 (exited) 14 (exited) 15 recv client, idle 16 recv client, idle 17 (exited) The active thread (#9) is executing a complex request (a database commit) and according to my stack trace it was executing a pthread_getspecific() call. The tail of the VG log is attached (the full log size is 447,094,059).
Try putting a VG_(core_panic)("get mem failed") in vg_mylibc.c:VG_(get_memory_from_mmap)(), just before the "VG_(exit)(1)".
I now ran with the core_panic() added. I will attach the log.
Created attachment 9860 [details] Stack trace at abort The end of the VG log that has the extra trace.
Looking at the active thread. I see that it is exiting and producing a usage report (usage_line). The crash indicates a line that I would not expect to be the failure. It simply converts a test number to 'unsigned long'. This means that it reads a buffer, and as such while I will accept a 'read violation' I cannot see how it leads to the mmap request. I build with -O0 for accurate line numbers. 614 for (n = 0;; ++i) { 615 if (len && i >= len) 616 break; 617 d = (int)((uuchar *)v)[i];
Well, it's trying to allocate memory while doing a sanity check, so it doesn't have anything to do with what your program was specifically doing at that moment. The memory its allocating is for reading /proc/self/maps, which is presumably quite large. Could you sample the contents of /proc/<pid>/maps for your process as it runs under Valgrind to see what it looks like around the time of the crash?
6346 lines in /proc/self/maps as it crashes. Log attached.
Created attachment 9879 [details] 'cat /proc/self/maps' at crash time.
Created attachment 9880 [details] memory map growth The number of lines in the map, measured once a second from test start until the crash.
Hm, a huge number of 64k mmaps in the valgrind part of the address space... I wonder what can do that? Ah, it looks like thread stacks. Your app creates lots of (short-lived?) threads as it runs, doesn't it?
Created attachment 9882 [details] Don't leak valgrind's thread stacks I think this should fix you up.
Yes, the run tests many (hundresds) of clients, each is allocated own thread for the duration of the interaction, which is often very short. I am now testing the patch.
The test is now runnig for 20m, it finished about 15% of the work and the size of /proc/self/maps is stable. So the fix has a +ve effect.
CVS commit by fitzhardinge: Fix leak of Valgrind-internal thread stacks. This was leaking 64k each time a thread exited. BUGS: 100036 M +1 -1 vg_scheduler.c 1.225 --- valgrind/coregrind/vg_scheduler.c #1.224:1.225 @@ -513,5 +513,5 @@ void mostly_clear_thread_record ( Thread VG_(sigemptyset)(&VG_(threads)[tid].eff_sig_mask); - VGA_(os_state_init)(&VG_(threads)[tid]); + VGA_(os_state_clear)(&VG_(threads)[tid]); /* start with no altstack */
I have to reopen this. I'm still seeing that bug. For me it's much easier to reproduce, as it always strikes when valgrinding a simple kde3 application. A few specs: 512Mb ram, 2G swap, Kernel 2.4.24, glibc 2.2, gcc 2.95.2 With valgrind CVS >= 18.3.2005 it seems to have improved, i. e. it takes longer for the request to fail. Here's the transkript: ---------------------------------------------------------------------- > valgrind --tool=memcheck kwrite regression.nachher-2 ==13947== Memcheck, a memory error detector for x86-linux. ==13947== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al. ==13947== Using valgrind-2.3.0.CVS, a program supervision framework for x86-linux. ==13947== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al. ==13947== For more details, rerun with: -v ==13947== ==13947== Conditional jump or move depends on uninitialised value(s) ==13947== at 0x1B8E617E: _dl_start (in /lib/ld-2.2.so) ==13947== by 0x1B8E5E85: (within /lib/ld-2.2.so) ==13947== ==13947== Conditional jump or move depends on uninitialised value(s) ==13947== at 0x1B8E61FE: _dl_start (in /lib/ld-2.2.so) ==13947== by 0x1B8E5E85: (within /lib/ld-2.2.so) ==13947== ==13947== Conditional jump or move depends on uninitialised value(s) ==13947== at 0x1CF267BA: strstr (in /lib/libc.so.6) ==13947== by 0x1CE1C69E: is_smp_system (in /lib/libpthread.so.0) ==13947== by 0x1CE1C868: pthread_initialize (in /lib/libpthread.so.0) ==13947== by 0x1CE21986: (within /lib/libpthread.so.0) ==13947== by 0x1CE188ED: (within /lib/libpthread.so.0) ==13947== by 0x1B8F161E: _dl_init (in /lib/ld-2.2.so) ==13947== by 0x1B8E5EC0: (within /lib/ld-2.2.so) Fontconfig warning: line 247: invalid edit binding "same" Fontconfig warning: line 259: invalid edit binding "same" ==13947== ==13947== Invalid read of size 1 ==13947== at 0x1B8F5923: memcpy (in /lib/ld-2.2.so) ==13947== by 0x1B8E91B4: decompose_rpath (in /lib/ld-2.2.so) ==13947== by 0x1B8EB757: _dl_map_object (in /lib/ld-2.2.so) ==13947== by 0x1B8F0480: _dl_map_object_deps (in /lib/ld-2.2.so) ==13947== by 0x1CFA2F20: dl_open_worker (in /lib/libc.so.6) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CFA31BD: _dl_open (in /lib/libc.so.6) ==13947== by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2) ==13947== by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2) ==13947== by 0x1C8E67DE: QLibraryPrivate::loadLibrary(void) (tools/qlibrary_unix.cpp:114) ==13947== Address 0x1D2F83F3 is 0 bytes after a block of size 99 alloc'd ==13947== at 0x1B904641: malloc (vg_replace_malloc.c:130) ==13947== by 0x1B8E8B56: expand_dynamic_string_token (in /lib/ld-2.2.so) ==13947== by 0x1B8E8F1C: decompose_rpath (in /lib/ld-2.2.so) ==13947== by 0x1B8EB757: _dl_map_object (in /lib/ld-2.2.so) ==13947== by 0x1B8F0480: _dl_map_object_deps (in /lib/ld-2.2.so) ==13947== by 0x1CFA2F20: dl_open_worker (in /lib/libc.so.6) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CFA31BD: _dl_open (in /lib/libc.so.6) ==13947== by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2) ==13947== by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2) kio (KSycoca): Trying to open ksycoca from /var/tmp/kdecache-kderun/ksycoca VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed. VG_(get_memory_from_mmap): 241057574 bytes already allocated. Sorry. You could try using a tool that uses less memory; eg. addrcheck instead of memcheck. valgrind: the `impossible' happened: get mem failed Basic block ctr is approximately 114836780 ==13947== at 0xB00328BB: (within /leo/apps/valgrind-2.3/lib/valgrind/stage2) ==13947== by 0xB00328BA: panic (../../coregrind/vg_mylibc.c:1185) ==13947== by 0xB00328D7: vgPlain_core_panic (../../coregrind/vg_mylibc.c:1190) ==13947== by 0xB003336D: vgPlain_get_memory_from_mmap (../../coregrind/vg_mylibc.c:1728) ==13947== by 0xB002CCBC: newSuperblock (../../coregrind/vg_malloc2.c:479) ==13947== by 0xB002DD83: vgPlain_arena_malloc (../../coregrind/vg_malloc2.c:983) ==13947== by 0xB003E143: getStabType (../../coregrind/vg_stabs.c:293) ==13947== by 0xB003E517: stabtype_parser (../../coregrind/vg_stabs.c:544) ==13947== by 0xB003ED1B: stabtype_parser (../../coregrind/vg_stabs.c:877) ==13947== by 0xB003E5F7: stabtype_parser (../../coregrind/vg_stabs.c:587) ==13947== by 0xB003E822: stabtype_parser (../../coregrind/vg_stabs.c:669) ==13947== by 0xB003F2A8: initSym (../../coregrind/vg_stabs.c:1076) sched status: running_tid=1 Thread 1: status = VgTs_Runnable ==13947== at 0x1B8F688D: mmap (in /lib/ld-2.2.so) ==13947== by 0x1B8EBE12: _dl_map_object (in /lib/ld-2.2.so) ==13947== by 0x1CFA2E0A: dl_open_worker (in /lib/libc.so.6) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CFA31BD: _dl_open (in /lib/libc.so.6) ==13947== by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2) ==13947== by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so) ==13947== by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2) ==13947== by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2) ==13947== by 0x1C281988: sys_dl_open (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:614) ==13947== by 0x1C282235: tryall_dlopen (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:1519) ==13947== by 0x1C2823D0: find_module (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:1582) Note: see also the FAQ.txt in the source distribution. It contains workarounds to several common problems. If that doesn't help, please report this bug to: valgrind.kde.org In the bug report, send all the above text, the valgrind version, and what Linux distro you are using. Thanks. ----------------------------------------------------------------------
This bug remains in valgrind 2.4.0, and I think I have a crude fix My project's executables are quite large, and so memcheck in both valgrind 2.2.0 and 2.4.0 fails with a message of the form VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed. VG_(get_memory_from_mmap): 258608880 bytes already allocated. Sorry. You could try using a tool that uses less memory; eg. addrcheck instead of memcheck. massif fails as well; addrcheck works fine, as does valgrind 2.0.0. After a few hours of studying the part of the valgrind 2.4.0 source code that is responsible for its internal memory allocation - in fact, one of the first things that valgrind does when launched - I came up with a very crude hack that increases the amount of memory allocated to valgrind's own heap and stack from 256 MB to 512 MB: The crude hack is to change in coregrind/vg_main.c case AT_PHDR: VG_(valgrind_base) = PGROUNDDN(auxv->u.a_val); break; to case AT_PHDR: VG_(valgrind_base) = 0xa0000000; /* boooo! */ break; A few months ago in the valgrind user mailing list there was a posting suggesting as a solition to this problem in valgrind 2.3 an adjustment to the variable KICKSTART_BASE in the configure files. Unfortunately, this variable appears to be effectively ignored in valgrind 2.4.0. On the other hand, my crude hack does succeed in changing valgrind's memory allocation from client_base 0 (937MB) client_mapbase 3A965000 (1877MB) client_end AFF00000 (1MB) shadow_base B0000000 (0MB) shadow_end B0000000 valgrind_base B0000000 (255MB) valgrind_last BFFFFFFF to client_base 0 (852MB) client_mapbase 35425000 (1706MB) client_end 9FF00000 (1MB) shadow_base A0000000 (0MB) shadow_end A0000000 valgrind_base A0000000 (511MB) valgrind_last BFFFFFFF This seems to be working fine in RedHat Enterprise Linux release 3 with 4 GB RAM and in Scientific Linux 3.0.3 with 2 GB RAM, both with 2.4.21-series kernels. I can't tell if this would work with newer kernels or in other Linux distributions. One thing is certain: the amount of space allocated between valgrind_last and valgrind_base has got to get an increase.
With valgrind 2.4.0 running memcheck or calltree, I reliably get a similar abort: === ==2005-07-12 20:59:04.538 21197== Reading syms from /home/msimons/libreallybig.so (0x3B127000) VG_(get_memory_from_mmap): newSuperblock's request for 1499136 bytes failed. VG_(get_memory_from_mmap): 214010961 bytes already allocated. Sorry. You could try using a tool that uses less memory; eg. addrcheck instead of memcheck. === Valgrind was still loading shared libraries for the target process at this point. Valgrind had able to run the same code without debug symbols... the only change between binaries that work and fail is a -g at build time. Using the work around from Alexandre (0xa0...), gets us past this crash. Altough valgrind running these objects with debug symbols is at least a factor of twenty slower to boot our process... I wonder if there is a performance problem dealing with > 500,000 debug symbols.
This bug is complicated - it looks like the original probelm was a memory leak that got fixed. It then got reopened on the basis of a similar looking error message which probably had a completely different cause and since then any old memory exhaustion seems to have been tacked on to this bug. Most of the memory exhaustions reported look like they arose while readint the debug information, which should be fixed with the new address space manager in the current SVN code. Given that, and that the original leak was fixed, I am going to close this now. If any of you is still experienceing memory exhaustion problems in the current SVN code please open a new bug with the details.