Bug 100036 - "newSuperblock's request for 1048576 bytes failed"
Summary: "newSuperblock's request for 1048576 bytes failed"
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 2.3 CVS
Platform: Debian testing Linux
: NOR crash
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-22 23:22 UTC by Eyal Lebedinsky
Modified: 2005-10-06 01:15 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
tail of VG loag (38.53 KB, application/x-bzip2)
2005-02-27 05:26 UTC, Eyal Lebedinsky
Details
Stack trace at abort (14.73 KB, text/plain)
2005-02-27 14:40 UTC, Eyal Lebedinsky
Details
'cat /proc/self/maps' at crash time. (36.83 KB, application/x-bzip2)
2005-02-28 09:38 UTC, Eyal Lebedinsky
Details
memory map growth (5.67 KB, application/x-bzip2)
2005-02-28 09:40 UTC, Eyal Lebedinsky
Details
Don't leak valgrind's thread stacks (616 bytes, patch)
2005-02-28 11:29 UTC, Jeremy Fitzhardinge
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eyal Lebedinsky 2005-02-22 23:22:58 UTC
Running cvs head. Different Linux 2.6 kernels following bk. This is a longstanding
problem.
I run my regression tests where a server stays up and a few hundred clients
execute a test. The server is launching a thread for each client. After a few
100s clients run successfully the server aborts with:
 
VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed.
VG_(get_memory_from_mmap): 33599270 bytes already allocated.

Sorry.  You could try using a tool that uses less memory;
eg. addrcheck instead of memcheck.

The recent memory mgmt improvement had no effect. The failure is always in
exactly the same point in the process.

No unusual errors were logged before the failure. Options used:

valgrind --tool=memcheck \
        --leak-check=yes \
        --show-reachable=no \
        --num-callers=50 \
        --error-limit=no \
        --run-libc-freeres=no \
        --time-stamp=yes \
        --trace-syscalls=no \
        --quiet \
        MyProgram MyOptions

I did not find an option combination that changes the situation. A log of the
run is prohitively large (it takes a few hours to reach the failure point). I
am now generating one.

As an aside, should we have Severity 'abort' to differenciate it from 'crash'?
The first indicates VG was in control, the latter means it lost it.
Comment 1 Jeremy Fitzhardinge 2005-02-22 23:40:45 UTC
Hm, it looks like it might be heap fragmentation.  There could only be 32Mbyte of actual allocated stuff, but its scattered around so much that there isn't space for whatever it is its trying to allocate. 

Do you know what it is trying to do at the time?  Load a new .so maybe?  Report an error?
Comment 2 Eyal Lebedinsky 2005-02-23 10:47:45 UTC
Only 32MB of allocated stuff - you mean malloc()ed stuff? I surely use mnore than
that. Not sure what it was doing as it failed, I will investigate.
Comment 3 Eyal Lebedinsky 2005-02-26 12:44:07 UTC
My investigation is prgressing slowly (log files up to 1GB and single runs of
a few hours).

I could use a status dump for the above error though. Having a stack trace of all
the threads active at the time, and knowing which one lead to the abort, would be
very usefull.

Is this something simple to add? I think that any abort should provide a state
report.
Comment 4 Eyal Lebedinsky 2005-02-27 05:26:45 UTC
Created attachment 9854 [details]
tail of VG loag

A 'tail -n 100000' of the VG log.
Comment 5 Eyal Lebedinsky 2005-02-27 05:37:22 UTC
Bloody Hell! I wrote a nice post then clicked "Create a New Attachment" which
worked well, except it lost my actual "Additional Comment". Here I go again...

I now have a log and my own stack trace. The situation at the time of failure
has 13 active thread, 5 of them are clients. Only one thread (a client) seems
to be active at this point:
 0 pthread_join         main
 1 pthread_join         servers create/join
 2 accept               server 1
 3 accept               server 2
 4 accept               server 3
 5 select               server 1 housekeeping, sleeping
 6 select               server 2 housekeeping, sleeping
 7 select               server 3 housekeeping, sleeping
 8 (exited)
 9 ssamutex_lock        client, rulebase_commit
10 recv                 client, idle
11 (exited)
12 recv                 client, idle
13 (exited)
14 (exited)
15 recv                 client, idle
16 recv                 client, idle
17 (exited)

The active thread (#9) is executing a complex request (a database commit) and
according to my stack trace it was executing a pthread_getspecific() call.

The tail of the VG log is attached (the full log size is 447,094,059).
Comment 6 Jeremy Fitzhardinge 2005-02-27 10:53:50 UTC
Try putting a VG_(core_panic)("get mem failed") in vg_mylibc.c:VG_(get_memory_from_mmap)(), just before the "VG_(exit)(1)".
Comment 7 Eyal Lebedinsky 2005-02-27 14:39:02 UTC
I now ran with the core_panic() added. I will attach the log.
Comment 8 Eyal Lebedinsky 2005-02-27 14:40:34 UTC
Created attachment 9860 [details]
Stack trace at abort

The end of the VG log that has the extra trace.
Comment 9 Eyal Lebedinsky 2005-02-27 14:58:06 UTC
Looking at the active thread. I see that it is exiting and producing a usage
report (usage_line). The crash indicates a line that I would not expect to be the
failure. It simply converts a test number to 'unsigned long'. This means that it
reads a buffer, and as such while I will accept a 'read violation' I cannot see
how it leads to the mmap request. I build with -O0 for accurate line numbers.

614     for (n = 0;; ++i) {
615             if (len && i >= len)
616                     break;
617             d = (int)((uuchar *)v)[i];
Comment 10 Jeremy Fitzhardinge 2005-02-27 18:34:42 UTC
Well, it's trying to allocate memory while doing a sanity check, so it doesn't have anything to do with what your program was specifically doing at that moment.

The memory its allocating is for reading /proc/self/maps, which is presumably quite large.  Could you sample the contents of /proc/<pid>/maps for your process as it runs under Valgrind to see what it looks like around the time of the crash?
Comment 11 Eyal Lebedinsky 2005-02-28 09:28:59 UTC
6346 lines in /proc/self/maps as it crashes. Log attached.
Comment 12 Eyal Lebedinsky 2005-02-28 09:38:34 UTC
Created attachment 9879 [details]
'cat /proc/self/maps' at crash time.
Comment 13 Eyal Lebedinsky 2005-02-28 09:40:13 UTC
Created attachment 9880 [details]
memory map growth

The number of lines in the map, measured once a second from test start until
the crash.
Comment 14 Jeremy Fitzhardinge 2005-02-28 10:58:59 UTC
Hm, a huge number of 64k mmaps in the valgrind part of the address space...  I wonder what can do that?

Ah, it looks like thread stacks.  Your app creates lots of (short-lived?) threads as it runs, doesn't it?
Comment 15 Jeremy Fitzhardinge 2005-02-28 11:29:45 UTC
Created attachment 9882 [details]
Don't leak valgrind's thread stacks

I think this should fix you up.
Comment 16 Eyal Lebedinsky 2005-02-28 12:22:06 UTC
Yes, the run tests many (hundresds) of clients, each is allocated own thread for
the duration of the interaction, which is often very short.

I am now testing the patch.
Comment 17 Eyal Lebedinsky 2005-02-28 12:50:39 UTC
The test is now runnig for 20m, it finished about 15% of the work and the size
of /proc/self/maps is stable. So the fix has a +ve effect.
Comment 18 Jeremy Fitzhardinge 2005-02-28 16:59:48 UTC
CVS commit by fitzhardinge: 

Fix leak of Valgrind-internal thread stacks.  This was leaking 64k
each time a thread exited.

BUGS: 100036


  M +1 -1      vg_scheduler.c   1.225


--- valgrind/coregrind/vg_scheduler.c  #1.224:1.225
@@ -513,5 +513,5 @@ void mostly_clear_thread_record ( Thread
    VG_(sigemptyset)(&VG_(threads)[tid].eff_sig_mask);
 
-   VGA_(os_state_init)(&VG_(threads)[tid]);
+   VGA_(os_state_clear)(&VG_(threads)[tid]);
 
    /* start with no altstack */


Comment 19 Leo Savernik 2005-03-18 19:31:02 UTC
I have to reopen this. I'm still seeing that bug.

For me it's much easier to reproduce, as it always strikes when valgrinding a simple kde3 application.

A few specs: 512Mb ram, 2G swap, Kernel 2.4.24, glibc 2.2, gcc 2.95.2

With valgrind CVS >= 18.3.2005 it seems to have improved, i. e. it takes longer for the request to fail.

Here's the transkript:

----------------------------------------------------------------------
> valgrind --tool=memcheck kwrite regression.nachher-2
==13947== Memcheck, a memory error detector for x86-linux.
==13947== Copyright (C) 2002-2005, and GNU GPL'd, by Julian Seward et al.
==13947== Using valgrind-2.3.0.CVS, a program supervision framework for x86-linux.
==13947== Copyright (C) 2000-2005, and GNU GPL'd, by Julian Seward et al.
==13947== For more details, rerun with: -v
==13947== 
==13947== Conditional jump or move depends on uninitialised value(s)
==13947==    at 0x1B8E617E: _dl_start (in /lib/ld-2.2.so)
==13947==    by 0x1B8E5E85: (within /lib/ld-2.2.so)
==13947== 
==13947== Conditional jump or move depends on uninitialised value(s)
==13947==    at 0x1B8E61FE: _dl_start (in /lib/ld-2.2.so)
==13947==    by 0x1B8E5E85: (within /lib/ld-2.2.so)
==13947== 
==13947== Conditional jump or move depends on uninitialised value(s)
==13947==    at 0x1CF267BA: strstr (in /lib/libc.so.6)
==13947==    by 0x1CE1C69E: is_smp_system (in /lib/libpthread.so.0)
==13947==    by 0x1CE1C868: pthread_initialize (in /lib/libpthread.so.0)
==13947==    by 0x1CE21986: (within /lib/libpthread.so.0)
==13947==    by 0x1CE188ED: (within /lib/libpthread.so.0)
==13947==    by 0x1B8F161E: _dl_init (in /lib/ld-2.2.so)
==13947==    by 0x1B8E5EC0: (within /lib/ld-2.2.so)
Fontconfig warning: line 247: invalid edit binding "same"
Fontconfig warning: line 259: invalid edit binding "same"
==13947== 
==13947== Invalid read of size 1
==13947==    at 0x1B8F5923: memcpy (in /lib/ld-2.2.so)
==13947==    by 0x1B8E91B4: decompose_rpath (in /lib/ld-2.2.so)
==13947==    by 0x1B8EB757: _dl_map_object (in /lib/ld-2.2.so)
==13947==    by 0x1B8F0480: _dl_map_object_deps (in /lib/ld-2.2.so)
==13947==    by 0x1CFA2F20: dl_open_worker (in /lib/libc.so.6)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CFA31BD: _dl_open (in /lib/libc.so.6)
==13947==    by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2)
==13947==    by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2)
==13947==    by 0x1C8E67DE: QLibraryPrivate::loadLibrary(void) (tools/qlibrary_unix.cpp:114)
==13947==  Address 0x1D2F83F3 is 0 bytes after a block of size 99 alloc'd
==13947==    at 0x1B904641: malloc (vg_replace_malloc.c:130)
==13947==    by 0x1B8E8B56: expand_dynamic_string_token (in /lib/ld-2.2.so)
==13947==    by 0x1B8E8F1C: decompose_rpath (in /lib/ld-2.2.so)
==13947==    by 0x1B8EB757: _dl_map_object (in /lib/ld-2.2.so)
==13947==    by 0x1B8F0480: _dl_map_object_deps (in /lib/ld-2.2.so)
==13947==    by 0x1CFA2F20: dl_open_worker (in /lib/libc.so.6)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CFA31BD: _dl_open (in /lib/libc.so.6)
==13947==    by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2)
==13947==    by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2)
kio (KSycoca): Trying to open ksycoca from /var/tmp/kdecache-kderun/ksycoca

VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed.
VG_(get_memory_from_mmap): 241057574 bytes already allocated.

Sorry.  You could try using a tool that uses less memory;
eg. addrcheck instead of memcheck.


valgrind: the `impossible' happened:
   get mem failed
Basic block ctr is approximately 114836780
==13947==    at 0xB00328BB: (within /leo/apps/valgrind-2.3/lib/valgrind/stage2)
==13947==    by 0xB00328BA: panic (../../coregrind/vg_mylibc.c:1185)
==13947==    by 0xB00328D7: vgPlain_core_panic (../../coregrind/vg_mylibc.c:1190)
==13947==    by 0xB003336D: vgPlain_get_memory_from_mmap (../../coregrind/vg_mylibc.c:1728)
==13947==    by 0xB002CCBC: newSuperblock (../../coregrind/vg_malloc2.c:479)
==13947==    by 0xB002DD83: vgPlain_arena_malloc (../../coregrind/vg_malloc2.c:983)
==13947==    by 0xB003E143: getStabType (../../coregrind/vg_stabs.c:293)
==13947==    by 0xB003E517: stabtype_parser (../../coregrind/vg_stabs.c:544)
==13947==    by 0xB003ED1B: stabtype_parser (../../coregrind/vg_stabs.c:877)
==13947==    by 0xB003E5F7: stabtype_parser (../../coregrind/vg_stabs.c:587)
==13947==    by 0xB003E822: stabtype_parser (../../coregrind/vg_stabs.c:669)
==13947==    by 0xB003F2A8: initSym (../../coregrind/vg_stabs.c:1076)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable
==13947==    at 0x1B8F688D: mmap (in /lib/ld-2.2.so)
==13947==    by 0x1B8EBE12: _dl_map_object (in /lib/ld-2.2.so)
==13947==    by 0x1CFA2E0A: dl_open_worker (in /lib/libc.so.6)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CFA31BD: _dl_open (in /lib/libc.so.6)
==13947==    by 0x1CCE9353: dlopen_doit (in /lib/libdl.so.2)
==13947==    by 0x1B8F1341: _dl_catch_error (in /lib/ld-2.2.so)
==13947==    by 0x1CCE978F: _dlerror_run (in /lib/libdl.so.2)
==13947==    by 0x1CCE9392: dlopen@@GLIBC_2.1 (in /lib/libdl.so.2)
==13947==    by 0x1C281988: sys_dl_open (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:614)
==13947==    by 0x1C282235: tryall_dlopen (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:1519)
==13947==    by 0x1C2823D0: find_module (/leo/projekte/Fremde/kde/src/kdelibs/libltdl/ltdl.c:1582)


Note: see also the FAQ.txt in the source distribution.
It contains workarounds to several common problems.

If that doesn't help, please report this bug to: valgrind.kde.org

In the bug report, send all the above text, the valgrind
version, and what Linux distro you are using.  Thanks.

----------------------------------------------------------------------


Comment 20 Alexandre Telnov 2005-04-21 07:00:01 UTC
This bug remains in valgrind 2.4.0, and I think I have a crude fix

My project's executables are quite large, and so memcheck in both valgrind 2.2.0 and 2.4.0 fails with a message of the form 

VG_(get_memory_from_mmap): newSuperblock's request for 1048576 bytes failed.
VG_(get_memory_from_mmap): 258608880 bytes already allocated.

Sorry.  You could try using a tool that uses less memory;
eg. addrcheck instead of memcheck.

massif fails as well; addrcheck works fine, as does valgrind 2.0.0.

After a few hours of studying the part of the valgrind 2.4.0
source code that is responsible for its internal memory allocation - in
fact, one of the first things that valgrind does when launched - I came up
with a very crude hack that increases the amount of memory allocated to
valgrind's own heap and stack from 256 MB to 512 MB:

The crude hack is to change in coregrind/vg_main.c

      case AT_PHDR:
         VG_(valgrind_base) = PGROUNDDN(auxv->u.a_val);
         break;

to

      case AT_PHDR:
         VG_(valgrind_base) = 0xa0000000; /* boooo! */
         break;

A few months ago in the valgrind user mailing list there was a posting suggesting as a solition to this problem in valgrind 2.3 an adjustment to the variable KICKSTART_BASE in the configure files. Unfortunately, this variable 
appears to be effectively ignored in valgrind 2.4.0. On the other hand, my crude hack does succeed in changing valgrind's memory allocation from

client_base               0 (937MB)
client_mapbase     3A965000 (1877MB)
client_end         AFF00000 (1MB)
shadow_base        B0000000 (0MB)
shadow_end         B0000000
valgrind_base      B0000000 (255MB)
valgrind_last      BFFFFFFF

to

client_base               0 (852MB)
client_mapbase     35425000 (1706MB)
client_end         9FF00000 (1MB)
shadow_base        A0000000 (0MB)
shadow_end         A0000000
valgrind_base      A0000000 (511MB)
valgrind_last      BFFFFFFF

This seems to be working fine in RedHat Enterprise Linux release 3 with 4 GB RAM and in Scientific Linux 3.0.3 with 2 GB RAM, both with 2.4.21-series kernels. I can't tell if this would work with newer kernels or in other Linux distributions. One thing is certain: the amount of space allocated between valgrind_last and valgrind_base has got to get an increase.
Comment 21 Mike Simons 2005-07-13 06:42:36 UTC
With valgrind 2.4.0 running memcheck or calltree, I reliably get a similar
abort:
===
==2005-07-12 20:59:04.538 21197== Reading syms from /home/msimons/libreallybig.so (0x3B127000)

VG_(get_memory_from_mmap): newSuperblock's request for 1499136 bytes failed.
VG_(get_memory_from_mmap): 214010961 bytes already allocated.

Sorry.  You could try using a tool that uses less memory;
eg. addrcheck instead of memcheck.
===

Valgrind was still loading shared libraries for the target process at this
point.  Valgrind had able to run the same code without debug symbols... 
the only change between binaries that work and fail is a -g at build time.

Using the work around from Alexandre (0xa0...), gets us past this crash.

Altough valgrind running these objects with debug symbols is at least a 
factor of twenty slower to boot our process... I wonder if there is a
performance problem dealing with > 500,000 debug symbols.
Comment 22 Tom Hughes 2005-10-06 01:15:19 UTC
This bug is complicated - it looks like the original probelm was a memory leak that got fixed. It then got reopened on the basis of a similar looking error message which probably had a completely different cause and since then any old memory exhaustion seems to have been tacked on to this bug.

Most of the memory exhaustions reported look like they arose while readint the debug information, which should be fixed with the new address space manager in the current SVN code. Given that, and that the original leak was fixed, I am going to close this now.

If any of you is still experienceing memory exhaustion problems in the current SVN code please open a new bug with the details.