Bug 105039

Summary: long run-times probably due to memory manager
Product: [Developer tools] valgrind Reporter: pete moceyunas <moce>
Component: memcheckAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: major CC: liezielimbo92
Priority: NOR    
Version: 2.4   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:
Attachments: Tar file contain results and test program
Simple test program for stressing memcheck's malloc/free implementation.

Description pete moceyunas 2005-05-03 18:50:26 UTC
The included test example shows a significant slow down when using valgrind. The
example executes a series of malloc() and free() calls and nothing much else, so
it is likely the memory allocator has some nonlinear behavior.

The mm.log file was generated by a real application - the real application logs
the malloc() and free calls to this file. The program mm_play will read the
logfile file and play-back the malloc() and free calls to mimic the actual
applications sequence, and amount of memory allocated.

There are 2 experiments:
1. Run the playback with skipping the malloc() and free() calls but executing
everything else (reading the file and looping)
2. Run the playback and actually calling malloc() and free()

Here are the results:
Experiment	Version	CPU	Valgrind CPU	Ratio
#1	        v2.0.0	0.86	22.98	        26.7
#1	        v2.4.0	0.86	21.95	        25.5
#2	        v2.0.0	4.91	4348	        885.5
#2	        v2.4.0	4.1	20048	        4889.8

The first experiment shows the overhead of the non-malloc/free part of the play
back program. The second experiment results show that malloc/free contribute a
significant portion of the run-time during the play-back.

The experiments using valgrind v2.4.0 were run on a machine with Redhat Linux
7.3, kernel 2.4.18-27.7.xsmp, dual processor 2.791mhz machine with 3.78gig ram.
Comment 1 pete moceyunas 2005-05-03 19:52:10 UTC
Created attachment 10885 [details]
Tar file contain results and test program

Make.out.v2* - results running mm_play using valgrind v200 and v240
mm_play.c - playback program
Makefile - Make file that builds mm_play and reproduce results
Comment 2 pete moceyunas 2005-05-03 20:00:29 UTC
Bugzilla won't let me attach the mm.log (playback) file because it is too large - Compressed it is 14meg. I will email it to you upon request.
Comment 3 Nicholas Nethercote 2005-05-04 04:18:11 UTC
> Experiment	Version	CPU	Valgrind CPU	Ratio
> #1	        v2.0.0	0.86	22.98	        26.7
> #1	        v2.4.0	0.86	21.95	        25.5
> #2	        v2.0.0	4.91	4348	        885.5
> #2	        v2.4.0	4.1	20048	        4889.8


Can you explain this table?  What do the columns mean?
Comment 4 pete moceyunas 2005-05-04 07:44:38 UTC
Column #1: The experiment that was run. 
    Experiment #1 runs the mmplay option with the -n option. In the Makefile this is target "mmplay-it-malloc-disabled". This experiments is used to show the overhead of the mmplay to simply read the file and execute everything except the actual malloc() and free() calls.
    Experiment #2 runs mm_play without -n, and malloc's and free's are actually called. In the Makefile this is target mmplay-it

Column#2: Version of valgrind used
Column#3: Cpu time in seconds running mmplay without valgrind
Column#4: Cpu time in seconds running mmplay with valgrind
Column#5: Ratio of column #4/#3

Here's the table with better formating (I hope):
                     
Experiment Version    CPU    ValgrindCPU   ValgrindCPU/CPU
#1         v2.0.0     0.86         23            26.7
#1         v2.4.0     0.86         22            25.5
#2         v2.0.0     4.91       4348           885.5
#2         v2.4.0     4.1        20048          4889.8 

Comment 5 Nicholas Nethercote 2005-05-09 19:34:08 UTC
I've had a look at this, I think I understand what's going on.  Basically, we're
getting hammered by paging slowdowns when Valgrind+client are larger than the memory size.  It's not the memory allocator's fault.

I had a test program that does a lot of mallocs and frees.  Once the amount of memory allocated with malloc exceeded about 400MB, performance dropped precipitously as the page fault rate jumped.  This makes sense as my machine has 1GB of RAM, Memcheck more than doubles the client's memory usage due to the V bits, and then Valgrind needs some more for normal operation.

I tried Nulgrind on the same test -- no problem, it was very fast.  Same was true for Addrcheck.  So both of those corroborate the story.

The best way to address this would be to implement a compressed representation of V bits in memory, as we've discussed previously, as this would reduce the
memory footprint significantly, resulting in much-reduced paging.
Comment 6 pete moceyunas 2005-05-10 17:51:33 UTC
I think in the mm_play test case there is something else going on. My machine has 4gig RAM and the process seemed to hold steady around 1.8gig (reported by top), and I don't believe it to be paging. Further, running -tool=addrcheck takes a significant time (20K seconds) and its size is only 490meg. 

Running valgrind on mm_play with -tool=none does run fast, it only takes 18 seconds.
Comment 7 Nicholas Nethercote 2005-05-10 20:51:28 UTC
Hmm, I just realised the --tool=none case is misleading, because in that 
case Valgrind uses the native malloc/free.

I won't have time to look at this any time soon;  perhaps you could do 
some more digging, Pete?  You could try pulling Valgrind's allocator out 
into a user-space program and experiment with it.  That should tell us 
whether Valgrind's allocator has some inherently bad behaviour, or whether 
it's something about the interaction with shadow memory in 
Memcheck/Addrcheck that is the problem.  I would guess that the allocator 
itself is not the problem, since it seems to be paging that is causing the 
slowdown (as opposed to eg. slow algorithms in the allocator) but I could 
be wrong.

If you check out the Subversion repository, the allocator is in 
coregrind/m_mallocfree.c.  You'll need to write some scaffolding code to 
use it in a user-space program, but it shouldn't be hard.  Any results you 
can report about this would be much appreciated.
Comment 8 Julian Seward 2005-05-17 04:03:00 UTC
Created attachment 11067 [details]
Simple test program for stressing memcheck's malloc/free implementation.
Comment 9 Julian Seward 2005-05-17 04:05:22 UTC
I did some snooping around the 3.0 line.  I did discover that
memcheck behaves poorly when there are a huge number of
blocks simultaneously live -- I tested with 500k and 1 million.
See the attached program.  NLIVE controls the number of simultaneously
live blocks.  NITERS controls the total amount of work (# mallocs
and frees).  Interestingly, as you jack up NLIVE through 500k, 1M, 
etc, the speed of the program basically halves each time, despite
the total amount of malloc/free work staying the same.

I traced this to the hashtable in coregrind/vg_hashtable.c (2.4.0)
or m_hashtable (3.0 line).  Changing the #define of VG_N_CHAINS
from 4999 to 80021 (also a prime number) reduces the run time of
the program when NLIVE is 1M from 315 seconds to 125 seconds.

I would be interested to hear if this fixes the problem for you.
Comment 10 Nicholas Nethercote 2005-05-17 05:26:45 UTC
Hmm, interesting.  Well found.  I see you've increased the size of that 
hashtable type;  a better fix would be to replace it with an expanding 
table that uses linear probing rather than separate chaining.  Being able 
to specify an initial size would be useful too, for different situations.

However, I wonder if this is the whole story, since I found that Addrcheck 
didn't suffer nearly as badly as Memcheck in some cases, and it shares the 
same hashtable structure.
Comment 11 Jeremy Fitzhardinge 2005-05-17 09:09:06 UTC
I was thinking about replacing the hashtable with a skiplist in mem/addrcheck.  Partly because it scales better, but also because it seems to be a better match for how that structure is used.  It is already sorted, and it allows non-exact matches, which would simplify the leakcheck phase.  Currently the leakcheck code has to create a sorted array from the hashtable in order to be able to scan memory in address order and to be able to find inner pointers.
Comment 12 pete moceyunas 2005-05-18 01:25:55 UTC
The run-time of mm_play() is not significantly improved when I rebuilt valgrind with #define VG_N_CHAINS 500009. The runtime of addrcheck was 20608 seconds. It is possible that there are still significant collisions in the table - but it will require some modifications to track the chain lengths.
Comment 13 Julian Seward 2005-05-18 01:40:17 UTC
Bummer.

How hard would it be for you to produce a test example that 
doesn't require a machine with 3.7GB of memory to run?
I'd like to help but the most memory I have is 768M.
Comment 14 pete moceyunas 2005-05-18 01:48:49 UTC
Julian Seward wrote:

[bugs.kde.org quoted mail]
If you run the addrcheck tool it only takes ~500meg and takes just as long.
Comment 15 pete moceyunas 2005-05-19 09:10:31 UTC
I managed to build and profile a version of mm_play that calls the malloc/free/realloc api's defined in vg_malloc2.c. The profile revealed that most of the time was spent searching for a large enough block on a given freelist. mm_play has a lot of 8 byte and 16 byte allocations and I bet that there were a lot of one or the other at the beginning of the list. If you change the bins for the memory such that 8 byte (payload size) and 16 byte chunks go into their own individual bins, then the huge run-time goes away. With the change the test case that took 20,000+ seconds runs in 148 seconds. If you add in a larger hash table size of 500009 then it brings it down to 132seconds.
The changes to vg_malloc2.c were:

#define VG_N_MALLOC_LISTS     18

and modifying the function:

static
UInt pszB_to_listNo ( SizeT pszB )
{
   assert(0 == pszB % VG_MIN_MALLOC_SZB);
   pszB /= VG_MIN_MALLOC_SZB;
   if (pszB == 0)   return 0;
   if (pszB == 1)   return 1;
   if (pszB == 2)   return 2;
   if (pszB <= 3)   return 3;
   if (pszB <= 4)   return 4;
   if (pszB <= 5)   return 5;
   if (pszB <= 6)   return 6;
   if (pszB <= 7)   return 7;
   if (pszB <= 8)   return 8;
   if (pszB <= 9)   return 9;
   if (pszB <= 10)  return 10;
   if (pszB <= 11)  return 11;
   if (pszB <= 12)  return 12;
   if (pszB <= 16)  return 13;
   if (pszB <= 32)  return 14;
   if (pszB <= 64)  return 15;
   if (pszB <= 128) return 16;
   return 17;
}

Of course this even with my patch, this bin strategy will have similar problems if a program chooses one of the larger bins and there are lots and lots of allocations of one or more types.
Comment 16 Julian Seward 2005-05-19 12:42:11 UTC
> On Thursday 19 May 2005 08:10, pete moceyunas wrote:
> If you change the bins for the memory such that 8 byte (payload size) and
> 16 byte chunks go into their own individual bins, then the huge run-time
> goes away.


Yowza!  It's not just with your program either.  The same change
reduces the startup/exit time for Qt designer on my old P3 from 564 to
418 seconds.  That run does about 3.9 million mallocs and the same
number of frees.  Well spotted -- thanks.  I'll commit this on the
3.0 line and I guess it should go in 2.4 just in case.

> Of course this even with my patch, this bin strategy will have similar
> problems if a program chooses one of the larger bins and there are lots and
> lots of allocations of one or more types.


That's true.  But a program doing large allocs/frees would
spend proportionally more time doing useful work with the  
allocated areas and so the problem would be less acute.

J
Comment 17 Nicholas Nethercote 2005-05-19 14:59:39 UTC
Wow!  Great work, Pete, thanks for tracking this down.
Comment 18 Nicholas Nethercote 2005-05-19 15:57:02 UTC
I'm closing this since it's now fixed.
Comment 19 Nicholas Nethercote 2005-08-30 03:10:38 UTC
Pete, can you email me the mm.log file, or put it on a website somewhere?  I want to test how an AVL tree compares against the current hash table for holding heap block information.  Thanks.
Comment 20 Zel Limbo 2016-01-09 09:50:18 UTC
*** Removed by KDE Sysadmin - Contained Commercial Advertising ***