The included test example shows a significant slow down when using valgrind. The example executes a series of malloc() and free() calls and nothing much else, so it is likely the memory allocator has some nonlinear behavior. The mm.log file was generated by a real application - the real application logs the malloc() and free calls to this file. The program mm_play will read the logfile file and play-back the malloc() and free calls to mimic the actual applications sequence, and amount of memory allocated. There are 2 experiments: 1. Run the playback with skipping the malloc() and free() calls but executing everything else (reading the file and looping) 2. Run the playback and actually calling malloc() and free() Here are the results: Experiment Version CPU Valgrind CPU Ratio #1 v2.0.0 0.86 22.98 26.7 #1 v2.4.0 0.86 21.95 25.5 #2 v2.0.0 4.91 4348 885.5 #2 v2.4.0 4.1 20048 4889.8 The first experiment shows the overhead of the non-malloc/free part of the play back program. The second experiment results show that malloc/free contribute a significant portion of the run-time during the play-back. The experiments using valgrind v2.4.0 were run on a machine with Redhat Linux 7.3, kernel 2.4.18-27.7.xsmp, dual processor 2.791mhz machine with 3.78gig ram.
Created attachment 10885 [details] Tar file contain results and test program Make.out.v2* - results running mm_play using valgrind v200 and v240 mm_play.c - playback program Makefile - Make file that builds mm_play and reproduce results
Bugzilla won't let me attach the mm.log (playback) file because it is too large - Compressed it is 14meg. I will email it to you upon request.
> Experiment Version CPU Valgrind CPU Ratio > #1 v2.0.0 0.86 22.98 26.7 > #1 v2.4.0 0.86 21.95 25.5 > #2 v2.0.0 4.91 4348 885.5 > #2 v2.4.0 4.1 20048 4889.8 Can you explain this table? What do the columns mean?
Column #1: The experiment that was run. Experiment #1 runs the mmplay option with the -n option. In the Makefile this is target "mmplay-it-malloc-disabled". This experiments is used to show the overhead of the mmplay to simply read the file and execute everything except the actual malloc() and free() calls. Experiment #2 runs mm_play without -n, and malloc's and free's are actually called. In the Makefile this is target mmplay-it Column#2: Version of valgrind used Column#3: Cpu time in seconds running mmplay without valgrind Column#4: Cpu time in seconds running mmplay with valgrind Column#5: Ratio of column #4/#3 Here's the table with better formating (I hope): Experiment Version CPU ValgrindCPU ValgrindCPU/CPU #1 v2.0.0 0.86 23 26.7 #1 v2.4.0 0.86 22 25.5 #2 v2.0.0 4.91 4348 885.5 #2 v2.4.0 4.1 20048 4889.8
I've had a look at this, I think I understand what's going on. Basically, we're getting hammered by paging slowdowns when Valgrind+client are larger than the memory size. It's not the memory allocator's fault. I had a test program that does a lot of mallocs and frees. Once the amount of memory allocated with malloc exceeded about 400MB, performance dropped precipitously as the page fault rate jumped. This makes sense as my machine has 1GB of RAM, Memcheck more than doubles the client's memory usage due to the V bits, and then Valgrind needs some more for normal operation. I tried Nulgrind on the same test -- no problem, it was very fast. Same was true for Addrcheck. So both of those corroborate the story. The best way to address this would be to implement a compressed representation of V bits in memory, as we've discussed previously, as this would reduce the memory footprint significantly, resulting in much-reduced paging.
I think in the mm_play test case there is something else going on. My machine has 4gig RAM and the process seemed to hold steady around 1.8gig (reported by top), and I don't believe it to be paging. Further, running -tool=addrcheck takes a significant time (20K seconds) and its size is only 490meg. Running valgrind on mm_play with -tool=none does run fast, it only takes 18 seconds.
Hmm, I just realised the --tool=none case is misleading, because in that case Valgrind uses the native malloc/free. I won't have time to look at this any time soon; perhaps you could do some more digging, Pete? You could try pulling Valgrind's allocator out into a user-space program and experiment with it. That should tell us whether Valgrind's allocator has some inherently bad behaviour, or whether it's something about the interaction with shadow memory in Memcheck/Addrcheck that is the problem. I would guess that the allocator itself is not the problem, since it seems to be paging that is causing the slowdown (as opposed to eg. slow algorithms in the allocator) but I could be wrong. If you check out the Subversion repository, the allocator is in coregrind/m_mallocfree.c. You'll need to write some scaffolding code to use it in a user-space program, but it shouldn't be hard. Any results you can report about this would be much appreciated.
Created attachment 11067 [details] Simple test program for stressing memcheck's malloc/free implementation.
I did some snooping around the 3.0 line. I did discover that memcheck behaves poorly when there are a huge number of blocks simultaneously live -- I tested with 500k and 1 million. See the attached program. NLIVE controls the number of simultaneously live blocks. NITERS controls the total amount of work (# mallocs and frees). Interestingly, as you jack up NLIVE through 500k, 1M, etc, the speed of the program basically halves each time, despite the total amount of malloc/free work staying the same. I traced this to the hashtable in coregrind/vg_hashtable.c (2.4.0) or m_hashtable (3.0 line). Changing the #define of VG_N_CHAINS from 4999 to 80021 (also a prime number) reduces the run time of the program when NLIVE is 1M from 315 seconds to 125 seconds. I would be interested to hear if this fixes the problem for you.
Hmm, interesting. Well found. I see you've increased the size of that hashtable type; a better fix would be to replace it with an expanding table that uses linear probing rather than separate chaining. Being able to specify an initial size would be useful too, for different situations. However, I wonder if this is the whole story, since I found that Addrcheck didn't suffer nearly as badly as Memcheck in some cases, and it shares the same hashtable structure.
I was thinking about replacing the hashtable with a skiplist in mem/addrcheck. Partly because it scales better, but also because it seems to be a better match for how that structure is used. It is already sorted, and it allows non-exact matches, which would simplify the leakcheck phase. Currently the leakcheck code has to create a sorted array from the hashtable in order to be able to scan memory in address order and to be able to find inner pointers.
The run-time of mm_play() is not significantly improved when I rebuilt valgrind with #define VG_N_CHAINS 500009. The runtime of addrcheck was 20608 seconds. It is possible that there are still significant collisions in the table - but it will require some modifications to track the chain lengths.
Bummer. How hard would it be for you to produce a test example that doesn't require a machine with 3.7GB of memory to run? I'd like to help but the most memory I have is 768M.
Julian Seward wrote: [bugs.kde.org quoted mail] If you run the addrcheck tool it only takes ~500meg and takes just as long.
I managed to build and profile a version of mm_play that calls the malloc/free/realloc api's defined in vg_malloc2.c. The profile revealed that most of the time was spent searching for a large enough block on a given freelist. mm_play has a lot of 8 byte and 16 byte allocations and I bet that there were a lot of one or the other at the beginning of the list. If you change the bins for the memory such that 8 byte (payload size) and 16 byte chunks go into their own individual bins, then the huge run-time goes away. With the change the test case that took 20,000+ seconds runs in 148 seconds. If you add in a larger hash table size of 500009 then it brings it down to 132seconds. The changes to vg_malloc2.c were: #define VG_N_MALLOC_LISTS 18 and modifying the function: static UInt pszB_to_listNo ( SizeT pszB ) { assert(0 == pszB % VG_MIN_MALLOC_SZB); pszB /= VG_MIN_MALLOC_SZB; if (pszB == 0) return 0; if (pszB == 1) return 1; if (pszB == 2) return 2; if (pszB <= 3) return 3; if (pszB <= 4) return 4; if (pszB <= 5) return 5; if (pszB <= 6) return 6; if (pszB <= 7) return 7; if (pszB <= 8) return 8; if (pszB <= 9) return 9; if (pszB <= 10) return 10; if (pszB <= 11) return 11; if (pszB <= 12) return 12; if (pszB <= 16) return 13; if (pszB <= 32) return 14; if (pszB <= 64) return 15; if (pszB <= 128) return 16; return 17; } Of course this even with my patch, this bin strategy will have similar problems if a program chooses one of the larger bins and there are lots and lots of allocations of one or more types.
> On Thursday 19 May 2005 08:10, pete moceyunas wrote: > If you change the bins for the memory such that 8 byte (payload size) and > 16 byte chunks go into their own individual bins, then the huge run-time > goes away. Yowza! It's not just with your program either. The same change reduces the startup/exit time for Qt designer on my old P3 from 564 to 418 seconds. That run does about 3.9 million mallocs and the same number of frees. Well spotted -- thanks. I'll commit this on the 3.0 line and I guess it should go in 2.4 just in case. > Of course this even with my patch, this bin strategy will have similar > problems if a program chooses one of the larger bins and there are lots and > lots of allocations of one or more types. That's true. But a program doing large allocs/frees would spend proportionally more time doing useful work with the allocated areas and so the problem would be less acute. J
Wow! Great work, Pete, thanks for tracking this down.
I'm closing this since it's now fixed.
Pete, can you email me the mm.log file, or put it on a website somewhere? I want to test how an AVL tree compares against the current hash table for holding heap block information. Thanks.
*** Removed by KDE Sysadmin - Contained Commercial Advertising ***