Memory allocations of 34255421417 and higher cause a std::bad_alloc exception to be thrown when running a process with valgrind even though the process runs fine on it's own. This is not a memory overhead issue since I've run this on a machine with >250GB free memory and a machine with <50GB free memory and had the same results. If the allocation is lowered to 34255421416 the exception is not thrown. -bash-4.2$ cat test.cpp /* Hello World program */ #include<iostream> #include <stdio.h> #include <unistd.h> using namespace std; typedef struct test test; struct test { char arr[34255421417] = {}; }; int main() { cout << "Before Allocation\n"; try{ test * t1 = new test; }catch(const std::exception &exc){ std::string exception = exc.what(); std::cerr << "Caught exception: " + exception + "\n"; } cout << "After Allocation\n"; cin.get(); return 0; } -bash-4.2$ g++ -std=c++11 test.cpp -bash-4.2$ valgrind --tool=cachegrind ./a.out ==43814== Cachegrind, a cache and branch-prediction profiler ==43814== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al. ==43814== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info ==43814== Command: ./a.out ==43814== --43814-- warning: L3 cache found, using its data for the LL simulation. Before Allocation Caught exception: std::bad_alloc After Allocation ==43814== ==43814== I refs: 1,477,163 ==43814== I1 misses: 1,756 ==43814== LLi misses: 1,683 ==43814== I1 miss rate: 0.11% ==43814== LLi miss rate: 0.11% ==43814== ==43814== D refs: 499,289 (371,405 rd + 127,884 wr) ==43814== D1 misses: 12,494 ( 10,762 rd + 1,732 wr) ==43814== LLd misses: 7,472 ( 6,209 rd + 1,263 wr) ==43814== D1 miss rate: 2.5% ( 2.8% + 1.3% ) ==43814== LLd miss rate: 1.4% ( 1.6% + 0.9% ) ==43814== ==43814== LL refs: 14,250 ( 12,518 rd + 1,732 wr) ==43814== LL misses: 9,155 ( 7,892 rd + 1,263 wr) ==43814== LL miss rate: 0.4% ( 0.4% + 0.9% ) -bash-4.2$ ./a.out Before Allocation After Allocation -bash-4.2$ free -g total used free shared buff/cache available Mem: 251 2 33 0 215 248 Swap: 3 0 3 -bash-4.2$ valgrind --version valgrind-3.10.0
Yes there is a compiled in memory limit imposed by the addressing scheme valgrind uses for it's shadow memory. See https://stackoverflow.com/questions/8644234/why-is-valgrind-limited-to-32-gb-on-64-bit-architectures for more information and how to patch valgrind to allow larger address spaces.
In the trunk right now we have N_PRIMARY_BITS = 20, which according to the svn log makes the maximum usable memory amount be 64G. That was done at end-Jan 2013 and should surely be in 3.10 and later. Maybe we should bump this up to 21 bits, hence giving 128G usable memory on 64 bit targets? It would slow down startup a bit because that array needs to be zeroed out, and would soak up a bit more memory, but otherwise seems harmless. Presumably at some point we can outrun (the ever decelerating) Moore's law with this game ;-)
The primary_map array, I mean. I didn't mean the whole 128GB needs to be zeroed out at startup.
I fully agree. Server systems these days have even TBs of memory to play with. In addition to initializing primary map, N_PRIMARY_BITS come into play also in mc_expensive_sanity_check(). Hopefully it won't be a big deal.
I had hoped to do this for 3.12.0, but after looking at the #ifdef swamp in VG_(am_startup) that sets aspacem_maxAddr, I think it is too risky, because of the number of different cases that need to be verified. So I'd propose to leave it till after the release. The number of users that this will affect is tiny and those that really need it in 3.12.x can cherry pick the trunk commit into their own custom 3.12.x build, once we fix it on the trunk.
Created attachment 105548 [details] test case (allocgig.c)
Created attachment 105549 [details] Proposed fix (so far, Linux only)
Fails at 32Gb without patch: trying for 31 GB .. ==12078== Warning: set address range perms: large range [0x3960c040, 0x7f960c040) (defined) .. OK ==12078== Warning: set address range perms: large range [0x3960c028, 0x7f960c058) (noaccess) trying for 32 GB .. allocgig: allocgig.c:15: main: Assertion `p' failed. ==12078== ==12078== Process terminating with default action of signal 6 (SIGABRT) ==12078== at 0x4E6E428: raise (raise.c:54) ==12078== by 0x4E70029: abort (abort.c:89) ==12078== by 0x4E66BD6: __assert_fail_base (assert.c:92) ==12078== by 0x4E66C81: __assert_fail (assert.c:101) ==12078== by 0x400700: main (in /home/tomh/allocgig) and at 64Gb with the patch: trying for 63 GB .. ==14020== Warning: set address range perms: large range [0x10060c4040, 0x1fc60c4040) (defined) .. OK ==14020== Warning: set address range perms: large range [0x10060c4028, 0x1fc60c4058) (noaccess) trying for 64 GB .. allocgig: allocgig.c:15: main: Assertion `p' failed. ==14020== ==14020== Process terminating with default action of signal 6 (SIGABRT) ==14020== at 0x4E6E428: raise (raise.c:54) ==14020== by 0x4E70029: abort (abort.c:89) ==14020== by 0x4E66BD6: __assert_fail_base (assert.c:92) ==14020== by 0x4E66C81: __assert_fail (assert.c:101) ==14020== by 0x400700: main (in /home/tomh/allocgig)
Created attachment 105561 [details] Proposed fix (so far, Linux and Solaris only) Solaris changes.
Unfortunately I do not have a machine with >32 GB of physical memory where I can install Solaris and try it out. Solaris does not overcommit when allocating memory. Regression tests passed ok.
Solaris and Linux limit increased to 128GB in r16381. OSX is so far unchanged. Rhys, do you want to change OSX too? I think nothing will break if OSX isn't changed. So, as you like.
Closing, for now. If we need to change the OSX limits later then, well, fine.
Any changes for macOS can come later.