Bug 282230 - memshrink: implement a group allocator for small fixed size objects, use it for MC_Chunk
Summary: memshrink: implement a group allocator for small fixed size objects, use it f...
Status: RESOLVED FIXED
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (show other bugs)
Version: 3.7 SVN
Platform: Unlisted Binaries Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-17 17:14 UTC by Philippe Waroquiers
Modified: 2012-01-17 21:22 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
group alloc implementation, use it for memcheck MC_Chunk (14.92 KB, text/plain)
2011-09-17 17:14 UTC, Philippe Waroquiers
Details
patch for helgrind to use the new pub_tool_groupalloc.h to replace the local group allocator (6.46 KB, text/plain)
2011-09-20 21:51 UTC, Philippe Waroquiers
Details
groupalloc for helgrind, memcheck MC_Chunk+SecVBitTable (35.50 KB, text/plain)
2011-09-24 20:53 UTC, Philippe Waroquiers
Details
groupalloc for helgrind, memcheck MC_Chunk+SecVBitTable (upgraded to rev 12271) (33.61 KB, text/plain)
2011-11-17 21:36 UTC, Philippe Waroquiers
Details
new interface for the pool alloc (134.68 KB, text/plain)
2011-12-29 01:28 UTC, Philippe Waroquiers
Details
pool alloc for memcheck mc_chunk and sec vbit + helgrind (136.39 KB, text/plain)
2012-01-16 22:41 UTC, Philippe Waroquiers
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Philippe Waroquiers 2011-09-17 17:14:53 UTC
Created attachment 63734 [details]
group alloc implementation, use it for memcheck MC_Chunk

The flexibility of m_mallocfree.c has a price in overhead per block:
16Bytes for 32bits, 32Bytes for 64bits.
When a lot of small fixed size objects are allocated, this overhead
is significant => allocating them in group reduces the memory usage
significantly.

Description:
1. pub_tool_groupalloc.c and m_groupalloc.c implements a group allocator
with reduced overhead on top of m_mallocfree.c
(this code is inspired from libhb_core.c group allocator).
2. memcheck MC_Chunk are allocated using such a group allocator.
3. new argument 'freenode-fn" added to VG_(HT_destroy): the client of
   hashtable is responsible to allocate the elements (so can use whatever
   way to allocate them), but HT_destroy frees the elements using hardcoded
   VG_(free), which is not ok when elements are allocated with the group
   allocator.

Note: a possible followup is to replace the helgrind group allocator
by using pub_tool_groupalloc.h. The only drwaback of this is that
the alloc/free of helgrind group allocator are marked inlined, and the
pub_tool_groupalloc equivalent will not be inlined.

Results: perf/heap is both faster and uses significantly less memory.
Tested also on a firefox 64bits (together with patch from bug 282105) :
Compared to trunk, a firefox startup opening google.com uses about
40Mb less memory.
Comment 1 Philippe Waroquiers 2011-09-20 21:51:28 UTC
Created attachment 63806 [details]
patch for helgrind to use the new pub_tool_groupalloc.h to replace the local group allocator

Additional patch to have helgrind using the group allocator.
Comment 2 Julian Seward 2011-09-21 11:10:41 UTC
Philippe, none of the new files are in the patch.  You need to do 
'svn add' of them in your local tree, before svn diff will show them.
Can you redo the patch?
Comment 3 Philippe Waroquiers 2011-09-21 13:44:26 UTC
(In reply to comment #2)
> Philippe, none of the new files are in the patch.  You need to do 
> 'svn add' of them in your local tree, before svn diff will show them.
> Can you redo the patch?
I was not clear enough in my comment 2.
There are two patches attached to this bug:
* first patch introduces the group allocator files, and uses them for MC_Chunk.
* second patch is just the part adding the usage of group allocator
  for helgrind. It can be applied as another commit (following the first patch).
Comment 4 Philippe Waroquiers 2011-09-24 20:53:32 UTC
Created attachment 63933 [details]
groupalloc for helgrind, memcheck MC_Chunk+SecVBitTable

* new files include/pub_tool_groupalloc.h and coregrind/m_groupalloc.c
  implementing a group allocator (based on helgrind group alloc).
* include/Makefile.am coregrind/Makefile.am : added pub_tool_groupalloc.h
  and m_groupalloc.c
* helgrind/libhb_core.c : use pub_tool_groupalloc.h/m_groupalloc.c
  insteqd  of the local implementation.
* include/pub_tool_oset.h coregrind/m_oset.c : allow to set a group allocator
  to allocate and free the OSet nodes.
* memcheck/tests/unit_oset.c drd/tests/unit_bitmap.c : modified
  so that it compiles with the new m_oset.c
* memcheck/mc_main.c : use group alloc for MC_Chunk
  memcheck/mc_include.h : declare the MC_Chunk group alloc
* memcheck/mc_main.c : use group alloc for the nodes of the secVBitTable OSet
* lackey/lk_main.c coregrind/m_translate.c
  fixed buffer overrun due to too short char buffers
  when using VG_(percentify)
* coregrind/m_sparsewa.c : modified code counting the nr of elements
  in use to use the nInUse component instead of doing the count.
* include/pub_tool_hashtable.h coregrind/m_hashtable.c : pass the free node
  function in the VG_(HT_destruct).
  (needed as the hashtable user can allocate a node with its own alloc,
  the hash table destroy must be able to free the nodes with the user
  own free).
* coregrind/m_gdbserver/m_gdbserver.c : pass free function to VG_(HT_destruct)
* memcheck/mc_replace_strmem.c memcheck/mc_machine.c
  memcheck/mc_malloc_wrappers.c memcheck/mc_leakcheck.c
  memcheck/mc_errors.c memcheck/mc_translate.c : new include needed
  due to group alloc.
Comment 5 Florian Krohm 2011-09-30 03:03:28 UTC
Your latest patch applies cleanly on top of r12071. I regtested it on x86 (Ubuntu 10.10) and on s390x (z900). No new regressions.
I did not make any performance measurements, though.
Comment 6 Philippe Waroquiers 2011-09-30 06:07:46 UTC
(In reply to comment #5)
> Your latest patch applies cleanly on top of r12071. I regtested it on x86
> (Ubuntu 10.10) and on s390x (z900). No new regressions.
> I did not make any performance measurements, though.

Thanks for this validation work.

The patch mostly slightly increase the performance,
except for fbench (slight decrease) and heap (big increase).
Details below (trunk of a few days ago, measured on a Xeon  X3480  @ 3.07GHz).

In terms of memory, for tool arena,
perf/heap uses 71 Mb less on 157 Mb, perf/bz2 uses 4 Mb less on 8 Mb.
(other arenas not influenced by this patch).


Note that the 'perf/vg_perf' measurements are not always very "stable".
I obtained the less variations by setting max clock speed (using the
"performance governor") and by forcing the tests to run on one
single cpu (using "taskset") and by using --reps argument.
(but even with this setup, I e.g. obtained once better performance for fbench,
and twice lower performance). So for very small differences, --reps=10
seems not good enough to eliminate variations.


taskset 4 perl perf/vg_perf --tools=memcheck --vg=../trunk_untouched  --vg=../sec_vbits_group_alloc --reps=10 perf
-- Running  tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 trunk_untouched:0.11s  me: 3.9s (35.6x, -----)
bigcode1 sec_vbits_group_alloc:0.11s  me: 3.9s (35.5x,  0.3%)
-- bigcode2 --
bigcode2 trunk_untouched:0.11s  me:10.0s (91.3x, -----)
bigcode2 sec_vbits_group_alloc:0.11s  me:10.0s (90.9x,  0.4%)
-- bz2 --
bz2      trunk_untouched:0.64s  me: 6.7s (10.5x, -----)
bz2      sec_vbits_group_alloc:0.64s  me: 6.6s (10.3x,  1.8%)
-- fbench --
fbench   trunk_untouched:0.23s  me: 4.2s (18.5x, -----)
fbench   sec_vbits_group_alloc:0.23s  me: 4.3s (18.6x, -0.7%)
-- ffbench --
ffbench  trunk_untouched:0.21s  me: 3.4s (16.1x, -----)
ffbench  sec_vbits_group_alloc:0.21s  me: 3.4s (16.0x,  0.3%)
-- heap --
heap     trunk_untouched:0.11s  me:10.4s (94.3x, -----)
heap     sec_vbits_group_alloc:0.11s  me: 7.9s (71.5x, 24.2%)
-- sarp --
sarp     trunk_untouched:0.01s  me: 2.4s (236.0x, -----)
sarp     sec_vbits_group_alloc:0.01s  me: 2.4s (235.0x,  0.4%)
-- tinycc --
tinycc   trunk_untouched:0.16s  me:10.4s (64.9x, -----)
tinycc   sec_vbits_group_alloc:0.16s  me:10.2s (63.7x,  1.9%)
-- Finished tests in perf ----------------------------------------------

== 8 programs, 16 timings =================
Comment 7 Julian Seward 2011-09-30 08:00:08 UTC
Looks good to me.  I'll try to work through it in the next day or so.

There's some bits I don't yet understand:

* why is there a new global variable, GroupAlloc* MC_(chunk_groupalloc);
  Any way to make it non-global?
  Or at least visible only in mc_malloc_wrappers.c?

* I didn't yet understand the interface changes for OSet or 
  VgHashTable.

* removal of the popcount loop in m_sparsewa, that needs to be
  committed separately; it is logically independent from this
  work.
Comment 8 Philippe Waroquiers 2011-09-30 10:20:21 UTC
(In reply to comment #7)
> Looks good to me.  I'll try to work through it in the next day or so.
> 
> There's some bits I don't yet understand:
> 
> * why is there a new global variable, GroupAlloc* MC_(chunk_groupalloc);
>   Any way to make it non-global?
>   Or at least visible only in mc_malloc_wrappers.c?

The group allocator must be created at initialization time.
It is initialized together/similarly to e.g. MC_(malloc_list).
It looks a good idea to make it static inside mc_malloc_wrappers.c.
For this, we could add a function MC_(init_malloc_wrappers)() that will do
the allocation of MC_(malloc_list), MC_(mempool_list) and the new
MC_(chunk_groupalloc) (which can then become static in mc_malloc_wrappers.c).


> 
> * I didn't yet understand the interface changes for OSet or 
>   VgHashTable.
* OSet: basically, it allows to set a group allocator in an OSet *,
so that the OSet will allocate nodes using this group allocator.

* Hash table: this avoids the problem created when nodes of
an hash table are allocated with a certain alloc technique (e.g.
in arena X) and then the hash table is destroyed using
HT_(destroy). VG_(HT_destruct) was hard-coding that nodes are freed with
VG_(free), which does not necessarily match the way the nodes are allocated
for this hashtable.
in an arena X.


> * removal of the popcount loop in m_sparsewa, that needs to be
>   committed separately; it is logically independent from this
>   work.
Effectively.
Comment 9 Philippe Waroquiers 2011-11-17 21:36:49 UTC
Created attachment 65801 [details]
groupalloc for helgrind, memcheck MC_Chunk+SecVBitTable (upgraded to rev 12271)

(same description as comment 4, except that the sparse wa improvement
is not done as part of this patch, as there is no link).

* new files include/pub_tool_groupalloc.h and coregrind/m_groupalloc.c
  implementing a group allocator (based on helgrind group alloc).
* include/Makefile.am coregrind/Makefile.am : added pub_tool_groupalloc.h
  and m_groupalloc.c
* helgrind/libhb_core.c : use pub_tool_groupalloc.h/m_groupalloc.c
  insteqd  of the local implementation.
* include/pub_tool_oset.h coregrind/m_oset.c : allow to set a group allocator
  to allocate and free the OSet nodes.
* memcheck/tests/unit_oset.c drd/tests/unit_bitmap.c : modified
  so that it compiles with the new m_oset.c
* memcheck/mc_main.c : use group alloc for MC_Chunk
  memcheck/mc_include.h : declare the MC_Chunk group alloc
* memcheck/mc_main.c : use group alloc for the nodes of the secVBitTable OSet
* lackey/lk_main.c coregrind/m_translate.c
  fixed buffer overrun due to too short char buffers
  when using VG_(percentify)
* include/pub_tool_hashtable.h coregrind/m_hashtable.c : pass the free node
  function in the VG_(HT_destruct).
  (needed as the hashtable user can allocate a node with its own alloc,
  the hash table destroy must be able to free the nodes with the user
  own free).
* coregrind/m_gdbserver/m_gdbserver.c : pass free function to VG_(HT_destruct)
* memcheck/mc_replace_strmem.c memcheck/mc_machine.c
  memcheck/mc_malloc_wrappers.c memcheck/mc_leakcheck.c
  memcheck/mc_errors.c memcheck/mc_translate.c : new include needed
  due to group alloc.
Comment 10 Julian Seward 2011-12-27 12:14:53 UTC
(In reply to comment #9)
> Created an attachment (id=65801) [details]
> groupalloc for helgrind, memcheck MC_Chunk+SecVBitTable (upgraded to rev 12271)
> 
> (same description as comment 4, except that the sparse wa improvement
> is not done as part of this patch, as there is no link).

The sparse wa stuff is already committed, yes?

---------

Comments on the patch:

Almost OK.  Still a bit worried about the changes to m_oset interface,
but maybe ok.  Specific points:


-      Char percbuf[6];
+      Char percbuf[7];

Please commit this separately -- +1 for fixing it, but it is not
logically part of this patch.


naming:
  VG_(Oset_XXX) -> VG_(OSet_XXX) to be consistent w/ existing fns


include/pub_tool_oset.h should not include pub_tool_groupalloc.h


+extern void  VG_(Oset_SetGroupAllocator) (OSet* os, GroupAlloc* node_group_alloc);
+// VG_(Oset_GroupAllocatorNodeElemSzB) returns the element size needed for

pls add a blank line here!


+// VG_(Oset_GroupAllocatorNodeElemSzB) returns the element size needed for
+// the nodes in the group allocator so that fixed size payload for the OSet
+// elements in elemSize.

not sure what this means.  (some part of the sentence missing?)


+extern void  VG_(Oset_SetGroupAllocator) (OSet* os, GroupAlloc* node_group_alloc);
+extern SizeT VG_(Oset_GroupAllocatorNodeElemSzB) (OSet* os, SizeT elemSize);

should these be OSetGen_ rather than OSet_ ?


--- coregrind/m_groupalloc.c    (revision 0)

Add the standard copyright notice, and include yourself on the
list of copyright holders/names.
Comment 11 Bart Van Assche 2011-12-27 12:43:57 UTC
Some further comments:
- Why the name "group allocator" ? It seems to me that the rest of the world uses the name "block allocator" instead of "group allocator".
- Making an OSet owner of a group allocator / block allocator object looks wrong to me. These should be allocated and freed outside of the OSet object.
- I'm not enthusiast about the coupling between OSet objects and the block allocator implementation. Personally I would prefer an interface similar to what's below (and with better names) for telling the OSet implementation to use a block allocator:
typedef void* (*OSetAlloc2_t)       ( HChar* cc, SizeT szB, void* allocator_data );
typedef void  (*OSetFree2_t)        ( void* p, void* allocator_data );
extern OSet* VG_(OSetGen_Create2)   ( PtrdiffT keyOff, OSetCmp_t cmp,
                                      OSetAlloc2_t alloc, HChar* cc,
                                      OSetFree2_t _free, void *allocator_data );
Comment 12 Bart Van Assche 2011-12-27 13:05:45 UTC
Another note: have you noticed that there is already a block allocator implementation present in the Valgrind source tree, a block allocator that is used for allocating OSet nodes ? See also drd/drd_bitmap2_node.c.
Comment 13 Julian Seward 2011-12-27 18:52:14 UTC
(In reply to comment #11)
> - Making an OSet owner of a group allocator / block allocator object looks
> wrong to me. These should be allocated and freed outside of the OSet object.
> [...]

Yeah, I'm not entirely happy about the changes to the OSet interface either,
but that said I don't completely understand them.  I'll study them more this
evening and comment more.
Comment 14 Philippe Waroquiers 2011-12-27 21:23:29 UTC
(In reply to comment #10)
> The sparse wa stuff is already committed, yes?
Yes, committed as rev 12272.

> -      Char percbuf[6];
> +      Char percbuf[7];
> Please commit this separately -- +1 for fixing it, but it is not
> logically part of this patch.
Committed as rev 12322.


... Will handle all minor comments (blank lines, naming, copyright, etc)


> include/pub_tool_oset.h should not include pub_tool_groupalloc.h
This is the biggest problem (also reported by Bart in comment 11).
I guess the approach suggested by Bart will solve this dependency problem.
Comment 15 Philippe Waroquiers 2011-12-27 21:46:55 UTC
(In reply to comment #11)
> Some further comments:
> - Why the name "group allocator" ? It seems to me that the rest of the world
> uses the name "block allocator" instead of "group allocator".
I started from the group allocator in helgrind.
So, Valgrind 3.7.0 has currently 2 "specialised" group/block allocators
(one in helgrind, one in drd).
Fine for me to use the name "block allocator" (unless Julian has a strong
preference for the helgrind 'group allocator' name).

> - Making an OSet owner of a group allocator / block allocator object looks
> wrong to me. These should be allocated and freed outside of the OSet object.
Giving the ownership of the block allocator to the OSet allows the OSet
to very efficiently and consistently destroy itself
(see VG_(OSet_Destroy_OSet_And_GroupAlloc)).

The alternative would be to have a VG_(OSet_Destroy_but_dont_free_the_nodes)
followed or preceded by a call to GroupAllocator Destroy).
I find the ownership approach cleaner/safer.


> - I'm not enthusiast about the coupling between OSet objects and the block
> allocator implementation. Personally I would prefer an interface similar to
> what's below (and with better names) for telling the OSet implementation to use
> a block allocator:
> typedef void* (*OSetAlloc2_t)       ( HChar* cc, SizeT szB, void*
> allocator_data );
> typedef void  (*OSetFree2_t)        ( void* p, void* allocator_data );
> extern OSet* VG_(OSetGen_Create2)   ( PtrdiffT keyOff, OSetCmp_t cmp,
>                                       OSetAlloc2_t alloc, HChar* cc,
>                                       OSetFree2_t _free, void *allocator_data
> );
Yes, I think will avoid to include pub_tool_groupalloc.h 
in multiple files (in particular, some files which do not need a group allocator,
but becomes dependent on it indirectly).
Unless Julian night study of the current patch results in an alternative suggestion,
I will rework the patch based on your suggestion.
Comment 16 Philippe Waroquiers 2011-12-27 21:50:07 UTC
(In reply to comment #12)
> Another note: have you noticed that there is already a block allocator
> implementation present in the Valgrind source tree, a block allocator that is
> used for allocating OSet nodes ? See also drd/drd_bitmap2_node.c.
I was not aware of this block allocator. From what I can see, it should be
possible to have a common block allocator, used instead of:
    the helgrind group allocator
    the drd drd_bitmap2_node.c block allocator
and used for
    the OSet of MC_Chunk.

Philippe
Comment 17 Julian Seward 2011-12-27 22:33:52 UTC
(In reply to comment #11)
> Some further comments:
> - Why the name "group allocator" ? It seems to me that the rest of the world
> uses the name "block allocator" instead of "group allocator".

I only used "group allocator" for the helgrind hackery because I did not know
the de-facto name was "block allocator".  I think perhaps the rest of the
world also uses "pool allocator"?  Maybe we can use "pool" ?  Although I
don't feel strongly, so "block" is also ok by me.


> - I'm not enthusiast about the coupling between OSet objects and the block
> allocator implementation. Personally I would prefer an interface similar to
> what's below (and with better names) for telling the OSet implementation to use
> a block allocator:
> typedef void* (*OSetAlloc2_t)       ( HChar* cc, SizeT szB, void*
> allocator_data );
> typedef void  (*OSetFree2_t)        ( void* p, void* allocator_data );
> extern OSet* VG_(OSetGen_Create2)   ( PtrdiffT keyOff, OSetCmp_t cmp,
>                                       OSetAlloc2_t alloc, HChar* cc,
>                                       OSetFree2_t _free, void *allocator_data
> );

I see two usage scenarios.  In the first, I say "please create me an OSet,
here are the allocation and free functions and the cc tag.  No group allocator."

Second scenario is the same as the first, except "please use a group allocator
with groups of size (eg) 1000".

This could easily be achieved thusly:

// (these are unchanged from present)
typedef void* (*OSetAlloc_t)       ( HChar* cc, SizeT szB );
typedef void  (*OSetFree_t)        ( void* p );

extern OSet* VG_(OSetGen_Create)    ( PtrdiffT keyOff, OSetCmp_t cmp,
                                      OSetAlloc_t alloc, HChar* cc,
                                      OSetFree_t _free,
                                      SizeT groupSize );  <---- the only change

When groupSize == 0, we have the first scenario (no group allocator).  Else
we have the second scenario (group allocator), with the underlying allocator
for the group being 'alloc'/'free'/'cc'.

This makes it very easy for users of OSet to switch to group allocation
(no need to understand anything about the group allocator interface)
and at the same time it means that the interface (headers) for OSet does
not depend on the interface for the group allocator (the implementation does,
but that's OK).

Bart -- your suggestion gives the maximum flexibility -- but it means users of
OSet now have to know about how to use the group allocator, and I just don't
think we need this level of flexibility right now.  I'd rather have a simpler
interface and let OSet's implementation depend on the group allocator 
implementation.
Comment 18 Julian Seward 2011-12-27 22:42:26 UTC
There's one other thing that concerns me, though.  Maybe I misunderstand ..

The description of AllocNode (pre-patch) is

// * AllocNode: Allocate and zero memory for a node to go into the OSet.
//   Uses the alloc function given to VG_(OSetGen_Create)() to allocated a
//   node which is big enough for both an element and the OSet metadata.
//   Not all elements in one OSet have to be the same size.

It's this last sentence .. doesn't that give a problem when using a 
group allocator that assumes fixed size objects?  I don't see how this
is compatible with using a group allocator.  Maybe I missed something
(I hope).
Comment 19 Philippe Waroquiers 2011-12-27 23:03:14 UTC
(In reply to comment #17)

> I only used "group allocator" for the helgrind hackery because I did not know
> the de-facto name was "block allocator".  I think perhaps the rest of the
> world also uses "pool allocator"?  Maybe we can use "pool" ?  Although I
> don't feel strongly, so "block" is also ok by me.
Ok for pool allocator (it is consistent with e.g. mempool naming convention
in memcheck).


> extern OSet* VG_(OSetGen_Create)    ( PtrdiffT keyOff, OSetCmp_t cmp,
>                                       OSetAlloc_t alloc, HChar* cc,
>                                       OSetFree_t _free,
>                                       SizeT groupSize );  <---- the only change
> 
> When groupSize == 0, we have the first scenario (no group allocator).  Else
> we have the second scenario (group allocator), with the underlying allocator
> for the group being 'alloc'/'free'/'cc'.
Looks simple and elegant to me. It however does not cover one use case
(which I believe is needed for drd):
   have a collection of OSet which are sharing a common pool allocator.
Without being able to share a pool allocator, OSet in the collection
might have a significant overhead of unused memory in the non shared pool
allocator, either because the OSet had never many nodes
or many nodes have been removed : the pool allocator memory cannot be
recuperated unless the OSet is destroyed.

=> I think we need an interface which allows to control sharing of a pool allocator
between multiple OSet.
Comment 20 Philippe Waroquiers 2011-12-27 23:06:43 UTC
(In reply to comment #18)
> There's one other thing that concerns me, though.  Maybe I misunderstand ..
> 
> The description of AllocNode (pre-patch) is
> 
> // * AllocNode: Allocate and zero memory for a node to go into the OSet.
> //   Uses the alloc function given to VG_(OSetGen_Create)() to allocated a
> //   node which is big enough for both an element and the OSet metadata.
> //   Not all elements in one OSet have to be the same size.
> 
> It's this last sentence .. doesn't that give a problem when using a 
> group allocator that assumes fixed size objects?  I don't see how this
> is compatible with using a group allocator.  Maybe I missed something
> (I hope).
We need to add something:
  // If a group allocator is used, all elements have a maximum size, which is
  // the element size of the group allocator.

And I see we need to have something like:
extern OSet* VG_(OSetGen_Create)    ( PtrdiffT keyOff, OSetCmp_t cmp,
                                      OSetAlloc_t alloc, HChar* cc,
                                      OSetFree_t _free,
                                      SizeT groupSize,  <--------- the only changes
                                      SizeT groupEltMaxSize );<-------------------|
Comment 21 Philippe Waroquiers 2011-12-29 01:28:56 UTC
Created attachment 67211 [details]
new interface for the pool alloc

New version of the group allocator.
Main changes are:
  * group allocator renamed to pool allocator
  * OSet interface changed to be simpler but still allow sharing
    of a pool allocator between multiple OSets.

Review welcome (mostly of pub_tool_poolalloc.h and pub_tool_oset.h)

Still to do:
* Verify that the drd bitmap block allocator for OSet can be replaced
  by the new interface + confirm a shared pool is needed for this usage.
* analyse more in details the performances on the perf regtest:
  In 32 bits, patch seems to be neutral or improve performances
  but in 64 bits, memcheck sarp perf test decreases significantly 
    (3.7 seconds instead of 3.1 seconds without the patch).
Comment 22 Philippe Waroquiers 2011-12-29 14:01:55 UTC
(In reply to comment #21)
> * analyse more in details the performances on the perf regtest:
>   In 32 bits, patch seems to be neutral or improve performances
>   but in 64 bits, memcheck sarp perf test decreases significantly 
>     (3.7 seconds instead of 3.1 seconds without the patch).

I used callgrind on memcheck running sarp, to try to understand.
With callgrind numbers, I cannot see what can explain a 0.6 cpu seconds
difference on 3.1 seconds.
(the below is on ppc64).

Callgrind stats of the untouched trunk:
==13988== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim Ge
==13988== Collected : 19334675273 1694556180 3478413913 2651026 5736657 4179635 522860 3716567 3451203 2855300150 35374174 66562138 25744035 20
==13988== 
==13988== I   refs:      19,334,675,273
==13988== I1  misses:         2,651,026
==13988== LLi misses:           522,860
==13988== I1  miss rate:            0.1%
==13988== LLi miss rate:            0.0%
==13988== 
==13988== D   refs:       5,172,970,093  (1,694,556,180 rd + 3,478,413,913 wr)
==13988== D1  misses:         9,916,292  (    5,736,657 rd +     4,179,635 wr)
==13988== LLd misses:         7,167,770  (    3,716,567 rd +     3,451,203 wr)
==13988== D1  miss rate:            0.1% (          0.3%   +           0.1%  )
==13988== LLd miss rate:            0.1% (          0.2%   +           0.0%  )
==13988== 
==13988== LL refs:           12,567,318  (    8,387,683 rd +     4,179,635 wr)
==13988== LL misses:          7,690,630  (    4,239,427 rd +     3,451,203 wr)
==13988== LL miss rate:             0.0% (          0.0%   +           0.0%  )
==13988== 
==13988== Branches:       2,921,862,288  (2,855,300,150 cond +    66,562,138 ind)
==13988== Mispredicts:       61,118,209  (   35,374,174 cond +    25,744,035 ind)
==13988== Mispred rate:             2.0% (          1.2%     +          38.6%   )

Callgrind stats of the pool alloc patch
==60249== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim Ge
==60249== Collected : 19342543593 1695598725 3478778973 2627781 5696574 4180368 520632 3709848 3467497 2856966909 34627155 66592401 25824091 24
==60249== 
==60249== I   refs:      19,342,543,593
==60249== I1  misses:         2,627,781
==60249== LLi misses:           520,632
==60249== I1  miss rate:            0.1%
==60249== LLi miss rate:            0.0%
==60249== 
==60249== D   refs:       5,174,377,698  (1,695,598,725 rd + 3,478,778,973 wr)
==60249== D1  misses:         9,876,942  (    5,696,574 rd +     4,180,368 wr)
==60249== LLd misses:         7,177,345  (    3,709,848 rd +     3,467,497 wr)
==60249== D1  miss rate:            0.1% (          0.3%   +           0.1%  )
==60249== LLd miss rate:            0.1% (          0.2%   +           0.0%  )
==60249== 
==60249== LL refs:           12,504,723  (    8,324,355 rd +     4,180,368 wr)
==60249== LL misses:          7,697,977  (    4,230,480 rd +     3,467,497 wr)
==60249== LL miss rate:             0.0% (          0.0%   +           0.0%  )
==60249== 
==60249== Branches:       2,923,559,310  (2,856,966,909 cond +    66,592,401 ind)
==60249== Mispredicts:       60,451,246  (   34,627,155 cond +    25,824,091 ind)
==60249== Mispred rate:             2.0% (          1.2%     +          38.7%   )
Comment 23 Philippe Waroquiers 2011-12-29 20:56:09 UTC
Done some measurements on f12/x86, deb5/amd64, f16/ppc64
Globally, the patch improves the performances of many tests.

sarp perf  however decreases for unclear reasons (see comment 22).

By removing bit and pieces of the patch, I could determine
that creating the MC_Chunk pool allocator has a significant
effect on sarp performance.

What is bizarre:
Creating the pool is a fast operation
(alloc 2 small struct and init them) :
this should not have a 0.6 second effect.

Moreover, it seems the (single) creation of this (unused by sarp)
pool has a *linear* effect on the performance if I increase
the nr of loops in sarp.

In summary, a mystery.
For the moment, I can only (again) conclude that the
perf. tests are really difficult to work with.
(already encountered such mystery in the past:
http://permalink.gmane.org/gmane.comp.debugging.valgrind.devel/7704)

It would be nice if more relevant real life reproducible perf benchmark
could be run with the patch on a 64 bits system
and/or looking more in depth why sarp is (unexpectedly)
slower when creating an unused small data struct.
Comment 24 Philippe Waroquiers 2011-12-30 10:34:35 UTC
(In reply to comment #12)
> Another note: have you noticed that there is already a block allocator
> implementation present in the Valgrind source tree, a block allocator that is
> used for allocating OSet nodes ? See also drd/drd_bitmap2_node.c.

Bart, 
There is one diff. between the drd_bitmap2_node.c block allocator
and m_poolalloc.c (derived from helgrind) :
m_poolalloc.c does only free the chunks when the pool is destroyed.
The drd one maintains a cntr of allocated elements per chunk, and
release a chunk when the cnt drops to 0. This cntr is maintained
using a linear scan in the chunks when a free is done.

In summary, m_poolalloc.c does O(1) malloc/free, drd block alloc does a O(n)
but recuperates memory.

I have put some tracing in drd block allocator and run all the drd tests
+ perf tests. There are very few free chunks being done (a lot less
than the nr of allocate chunks), and the nr of free chunks is not
deterministic (depends on the order in which nodes are alloc/freed).

Based on this, I believe a (shared) m_poolalloc.c would be ok to
replace drd_bitmap2_node.c (but I am not sure, confirmation by a drd
master is for sure needed :).

Is there "real life" benchmark of drd that would exercise this part 
more significantly ?
Comment 25 Bart Van Assche 2011-12-30 11:35:43 UTC
(In reply to comment #24)
> Is there "real life" benchmark of drd that would exercise this part 
> more significantly ?

The script drd/scripts/run-splash2 might be such a benchmark, but I'm not sure the information in that script is sufficient to rerun the benchmark. CONFIG_BSD_PROCESS_ACCT_V3 must have been enabled in the Linux kernel and the acct-v3 package must have been installed before that script produces meaningful output. Also, you will have to run drd/scripts/download-and-build-splash2. If your distro doesn't provide acct-v3, you can find it here: http://savannah.gnu.org/cvs/?group=acct. Also, please keep in mind that the table in that script has been generated on a 32-bit kernel.
Comment 26 Bart Van Assche 2011-12-30 14:02:24 UTC
(In reply to comment #21)
> Created an attachment (id=67211) [details]
> new interface for the pool alloc
> 
> New version of the group allocator.
> Main changes are:
>   * group allocator renamed to pool allocator
>   * OSet interface changed to be simpler but still allow sharing
>     of a pool allocator between multiple OSets.

I've had a look at the pub_core_oset.h and m_oset.c changes. The pool allocator object is still allocated from inside VG_(OSetGen_Create)(). That means that the OSet implementation is limited to using a single pool allocator implementation. Why not to make the OSet interface a little more general such that the OSet and pool implementations become independent of each other and such that alternative block allocator implementations can be added later on ? Or in other words, I'd like to see the #include "pub_tool_poolalloc.h" disappear from m_oset.c too.
Comment 27 Philippe Waroquiers 2012-01-08 20:57:14 UTC
(In reply to comment #26)

> I've had a look at the pub_core_oset.h and m_oset.c changes. The pool allocator
> object is still allocated from inside VG_(OSetGen_Create)(). That means that
> the OSet implementation is limited to using a single pool allocator
> implementation. Why not to make the OSet interface a little more general such
> that the OSet and pool implementations become independent of each other and
> such that alternative block allocator implementations can be added later on ?
> Or in other words, I'd like to see the #include "pub_tool_poolalloc.h"
> disappear from m_oset.c too.
Effectively, a more sophisticated interface allows different pool allocators.
However, at this moment, it is not clear  where we would need another
pool allocator.

Unless there is an identified need for another pool allocator, I would
be in favour of the simpler interface (which can always be made
more sophisticated if/when the need arrives).
Comment 28 Bart Van Assche 2012-01-09 17:15:43 UTC
DRD currently uses one pool for multiple OSet objects. The proposed OSet API changes in the attached patch do not allow that (yet ?).
Comment 29 Philippe Waroquiers 2012-01-09 18:46:41 UTC
(In reply to comment #28)
> DRD currently uses one pool for multiple OSet objects. The proposed OSet API
> changes in the attached patch do not allow that (yet ?).

Sharing a pool is possible via the below new (part of the last patch)
OSetGen creation function:
  extern OSet* VG_(OSetGen_EmptyClone) (OSet* os);
  // Creates a new empty OSet.
  // The new OSet will have the same characteristics as os.
  // If os uses a pool allocator, this pool allocator will be shared with
  // the new OSet. A shared pool allocator is only deleted (and its memory is
  // released) when the last OSet using the shared pool is destroyed.
Comment 30 Bart Van Assche 2012-01-09 18:56:44 UTC
Sorry but such an approach for sharing allocators doesn't make me enthusiast. That means that the first OSet has to be allocated in another way than the second and later created OSets. That's asymmetric.
Comment 31 Philippe Waroquiers 2012-01-09 21:08:46 UTC
(In reply to comment #30)
> Sorry but such an approach for sharing allocators doesn't make me enthusiast.
> That means that the first OSet has to be allocated in another way than the
> second and later created OSets. That's asymmetric.

If OSets sharing a pool are destroyed, then an asymmetrical creation would
be difficult, as we would never know from which oset the clone would
have to be done (as this last oset could have been destroyed)

So, for DRD_(bm_init), something like the below code would be
relatively symmetrical for each "real oset" to be created.
In the below, creating the pool_holder is the logical equivalent of creating
the pool allocator.

   static OSet* pool_holder = NULL;
   ....
   if (NULL == pool_holder)
      pool_holder = VG_(OSetGen_Create)(0, 0, VG_(malloc),
                                         "drd.bitmap.bn.2", VG_(free),
                                         1000, bm2_node_size);

   bm->oset = VVG_(OSetGen_EmptyClone) (OSet* os) (pool_holder);
   ....
Comment 32 Bart Van Assche 2012-01-10 17:55:35 UTC
Do you realize that with the above explanation and sample code that you just confirmed that the proposed interface leads to ugly code ?
Comment 33 Philippe Waroquiers 2012-01-10 20:54:31 UTC
(In reply to comment #32)
> Do you realize that with the above explanation and sample code that you just
> confirmed that the proposed interface leads to ugly code ?


If I have properly understood the proposal in comment 11, the code
of comment 32 would become something like the below if the OSet creation
gets "pool arguments"  :

   static ASpecificPoolAllocatorType* pool_allocator = NULL;
   ....
   if (NULL == pool_allocator)
      pool_allocator =
          VG_(SpecificPoolAllocatorCreate)(VG_(malloc),
                                           "drd.bitmap.bn.2", VG_(free),
                                           1000, bm2_node_size);

   bm->oset = VG_(OSetGen_Create_With_Pool)
               (keyOff, cmp,
                VG_(malloc), "drd.bitmap.oset.2" ,
                VG_(free),
                (void *)pool_allocator,
                (OSetPoolAllocFn*) VG_(SpecificPoolAllocatorAlloc),
                (OSetPoolFreeFn*) VG_(SpecificPoolAllocatorFree));

This code structure is not different, we just have more
(loosely typed/casted) arguments in OSetGen_Create_With_Pool.

For other aspects (like when to destroy a shared pool), there is again
not much difference (or the current patch gives a better
support for the easy sharing use cases).

I understand the flexibility of the above but there are also disadvantages
e.g. any specific optimisation needed implies to change the interface of
OSetGen_Create_With_Pool.
For example, with the above interface, there is no implementation of
the "fast OSet destroy"  of the current patch, when the last (or single)
OSet using a pool is destroyed.

Note that I am not strongly opposed to the VG_(OSetGen_Create_With_Pool) approach even if I find this more general approach
somewhat (but not excessively :) over-engineered
(as I currently see no use for the flexibility).
However, from the discussions with Julian (which have lead to the
current patch) + the last paragraph of comment 17,
I understand he also favours the simplicity over the sophistication.

If there would be a good (existing) use case for the flexibility of having
multiple pool allocators types, then this would be a more convincing
argument.
Without such a use case, I am more in favour of simplicity:
   A single pool allocator in Valgrind
   used by all the other data structures that needs a pool allocator
rather than having these data structures having a sophisticated interface
allowing any kind of pool allocator to be used, with multiple
pool allocators appearing in Valgrind  here and there.


At least currently, the (I agree somewhat not fully nice :)
OsetGen_EmptyClone would be at one single place in drd
(and is not that different of creating a pool).

I believe this single ugliness is an improvement compared to
the somewhat hacky drd allocator/helgrind allocator, which appeared
independently without real reasons.

A (at least initially) simpler approach looks better to me
than starting with a structured approach to "support and/or promote"
the creation of various pool allocators in Valgrind.
Comment 34 Bart Van Assche 2012-01-12 18:58:21 UTC
(In reply to comment #33)
> A (at least initially) simpler approach looks better to me
> than starting with a structured approach to "support and/or promote"
> the creation of various pool allocators in Valgrind.

I do not agree that the interface proposed in the attached patch is a good starting point. The cleanest software design is what should be preferred, and "clean software design" means to decouple different concepts as much as possible. The interface you proposed implies a much stronger coupling between OSet and memory allocators than necessary. The above makes me assume that you have not yet read the book "The Pragmatic Programmer" nor that you are familiar with the Law of Demeter (http://en.wikipedia.org/wiki/Demeter).
Comment 35 Bart Van Assche 2012-01-12 19:03:37 UTC
That last link should have been: http://en.wikipedia.org/wiki/Law_of_Demeter.
Comment 36 philippe.waroquiers 2012-01-15 14:58:06 UTC
(In reply to comment #23)

> In summary, a mystery.
> For the moment, I can only (again) conclude that the
> perf. tests are really difficult to work with.
> (already encountered such mystery in the past:
> http://permalink.gmane.org/gmane.comp.debugging.valgrind.devel/7704)
Conclusion re-confirmed now at least on ppc64.
The sarp test degrades of about 10% with the pool alloc patch.
It does not degrades with exactly the same patch, if I put an
  "always true" condition in front of the creation of the pool allocator

   if (MC_(clo_mc_level) >= 1) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< always true
   MC_(chunk_poolalloc) = VG_(newPA) (sizeof(MC_Chunk),
                                      1000,
                                      VG_(malloc),
                                      "mc.cMC.1 (MC_Chunk pools)",
                                      VG_(free));

I double checked it was always true by using --profile-heap=yes and looking how much
was allocated under the above cost center.


In summary: an irrelevant single "if" somewhere can increase or decrease the
performance of the small performance test by a significant amount.
Comment 37 Josef Weidendorfer 2012-01-16 16:52:48 UTC
> In summary: an irrelevant single "if" somewhere...

Philippe, I have no idea what's going on here, but did you check with
"perf"? The sample profile must be different if something takes 0.6 secs
more time. Perhaps it's something in the kernel, so callgrind can not
give a hint.
Comment 38 Philippe Waroquiers 2012-01-16 21:23:05 UTC
(In reply to comment #37)
> > In summary: an irrelevant single "if" somewhere...
> 
> Philippe, I have no idea what's going on here, but did you check with
> "perf"? The sample profile must be different if something takes 0.6 secs
> more time. Perhaps it's something in the kernel, so callgrind can not
> give a hint.

Yes, I checked with perf but couldn't make any sense of the differences:
I could not see the link between the place "more" heavy as seen
by perf and the patch.
But I do not master perf much, and I am not sure about e.g. what
kind of events to ask. Which events are recommended to be looked at ?
I used the defaults (the cycles). The "perf annotate" is also very
difficult to read for me (*far* to be a kcachegrind :).


What I start to suspect for the sarp test is that an irrelevant
change (add an "if" in valgrind code) and/or Address Space Layout
Randomisation can make some data (e.g. the sarp stack) be put at another
place.
Then depending on where the stack is, the numerous calls done inside
sarp do change slightly what valgrind has to do, giving a 10 or 20%
difference on a small test such as sarp.

At work, with no ASLR, with cpu always at max speed, I do obtain
good and understandable results of the patch:
First, for the perf heap test: the tool mmap-ed size
decreases from about 160Mb to about 88Mb, which is a nice improvement.
Second, with the patch, no slow down on running a big test at work.
Third, a big application at work failed without the patch (out of memory)
and succeeded with the patch.

And the valgrind perf test are ok (see below). In particular, the sarp
test (slowed down on a gcc compile farm amd64) is not influenced
by the patch. We also see the heap test is faster (as expected).

Pḧilippe

model name      : Intel(R) Xeon(R) CPU           X3480  @ 3.07GHz

taskset 2 perl perf/vg_perf --vg=../trunk_untouched --vg=../euro_collection --reps=10 perf
-- Running  tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 trunk_untouched:0.11s  no: 2.2s (19.9x, -----)  me: 4.1s (37.1x, -----)
bigcode1 euro_collection:0.11s  no: 2.2s (20.2x, -1.4%)  me: 4.1s (37.4x, -0.7%)
-- bigcode2 --
bigcode2 trunk_untouched:0.12s  no: 5.0s (41.4x, -----)  me:10.4s (86.8x, -----)
bigcode2 euro_collection:0.12s  no: 4.9s (41.1x,  0.8%)  me:10.3s (86.0x,  0.9%)
-- bz2 --
bz2      trunk_untouched:0.67s  no: 2.3s ( 3.4x, -----)  me: 7.1s (10.5x, -----)
bz2      euro_collection:0.67s  no: 2.4s ( 3.5x, -3.5%)  me: 6.9s (10.3x,  2.5%)
-- fbench --
fbench   trunk_untouched:0.24s  no: 1.2s ( 5.0x, -----)  me: 4.5s (18.8x, -----)
fbench   euro_collection:0.24s  no: 1.2s ( 4.8x,  3.3%)  me: 4.6s (19.0x, -0.9%)
-- ffbench --
ffbench  trunk_untouched:0.23s  no: 1.2s ( 5.3x, -----)  me: 3.5s (15.4x, -----)
ffbench  euro_collection:0.23s  no: 1.2s ( 5.3x,  0.0%)  me: 3.5s (15.4x, -0.3%)
-- heap --
heap     trunk_untouched:0.11s  no: 1.2s (10.8x, -----)  me: 9.3s (84.8x, -----)
heap     euro_collection:0.11s  no: 1.1s ( 9.5x, 11.8%)  me: 7.3s (66.3x, 21.9%)
-- many-loss-records --
many-loss-records trunk_untouched:0.02s  no: 0.2s (12.5x, -----)  me: 1.7s (87.0x, -----)
many-loss-records euro_collection:0.02s  no: 0.2s (12.5x,  0.0%)  me: 1.6s (78.5x,  9.8%)
-- many-xpts --
many-xpts trunk_untouched:0.06s  no: 0.4s ( 7.2x, -----)  me: 2.3s (38.5x, -----)
many-xpts euro_collection:0.06s  no: 0.5s ( 7.7x, -7.0%)  me: 2.2s (36.8x,  4.3%)
-- sarp --
sarp     trunk_untouched:0.01s  no: 0.2s (20.0x, -----)  me: 2.5s (248.0x, -----)
sarp     euro_collection:0.01s  no: 0.2s (20.0x,  0.0%)  me: 2.5s (247.0x,  0.4%)
-- tinycc --
tinycc   trunk_untouched:0.18s  no: 2.3s (12.7x, -----)  me:10.9s (60.7x, -----)
tinycc   euro_collection:0.18s  no: 2.3s (12.7x,  0.0%)  me:10.8s (60.0x,  1.2%)
-- Finished tests in perf ----------------------------------------------

== 10 programs, 40 timings =================
Comment 39 Philippe Waroquiers 2012-01-16 22:41:17 UTC
Created attachment 67910 [details]
pool alloc for memcheck mc_chunk and sec vbit + helgrind

Small updates in pub_tool_oset.h compared to previous version.
Revalidated on f12/x86, amd64/deb5.
Comment 40 Bart Van Assche 2012-01-17 08:14:23 UTC
A few remarks (still haven't analyzed the patch entirely yet):
1. If I interpret the source code correctly in VG_(OSetGen_Destroy)() the function VG_(freeEltPA)() can be invoked after the last VG_(releaseRefPA)(). Isn't that a use-after-free ?
2. Personally I would prefer to see different functions for non-pool OSet creation and pool-based OSet creation in the public OSet interface. That would not only allow to leave code that creates a non-pool OSet untouched but would also make the intent more clear in code that creates an OSet that uses the pool allocator.
3. Even when using a memory pool OSetGen_Create() and OSetGen_EmptyClone() allocate the root node via the non-pool allocator. Has it been considered to allocate the root node too from the pool ?
4. Passing alloc and free_fn function pointers to VG_(newPA) looks like an overgeneralization to me. My proposal is to drop these two arguments and also - if remark [2] would be implemented - to drop the alloc and _free arguments from the pool-based OSet creation function.
5. In memcheck/mc_translate.c:
+#include "pub_tool_poolalloc.h"     // For mc_include.h
Is the comment next to that #include directive correct ?
Comment 41 Bart Van Assche 2012-01-17 08:27:35 UTC
Please ignore remark [1] in the above - I got confused by the name "releaseRefPA". Apparently releaseRefPA only decrements the PA reference count but does not delete the PA. BTW, did you know that the name "release" usually refers to the combined action of decrementing a reference count and deleting if that count reaches zero ? See e.g. http://msdn.microsoft.com/en-us/library/aa910902.aspx.
Comment 42 Bart Van Assche 2012-01-17 09:18:30 UTC
Please ignore [3] too - it's not possible because sizeof(AvlTree) != sizeof(AvlNode).
Comment 43 Philippe Waroquiers 2012-01-17 11:45:17 UTC
(In reply to comment #40)

> 2. Personally I would prefer to see different functions for non-pool OSet
> creation and pool-based OSet creation in the public OSet interface. That would
> not only allow to leave code that creates a non-pool OSet untouched but would
> also make the intent more clear in code that creates an OSet that uses the pool
> allocator.
A different OSetGen_Create_With_Pool might effectively be more clear.

There is one advantage to the single OSetGen_Create: it is easier to
#ifdef or #define to switch between OSet with or without PA: just
#define the nr of elements for the PA to 0, and it switches to a non PA mode.
(a.o. EmptyClone will work properly on a non PA OSet).
Julian, an opinion on this ?


> 4. Passing alloc and free_fn function pointers to VG_(newPA) looks like an
> overgeneralization to me. My proposal is to drop these two arguments and also -
> if remark [2] would be implemented - to drop the alloc and _free arguments from
> the pool-based OSet creation function.
It would effectively be simpler but that is not compatible with the
concept of multiple arenas in valgrind. For example, when debug info
are creating oset, the memory for these oset (and so the underlying PA(s))
is supposed to be allocated from the dinfo arena.
Same for other OSet (e.g. execontext, or ...).
Otherwise, an hardcoded arena would have to be chosen in the OSet and PA.


> 5. In memcheck/mc_translate.c:
> +#include "pub_tool_poolalloc.h"     // For mc_include.h
> Is the comment next to that #include directive correct ?
Yes I believe it is correct.
It is similar to the next line, including pub_tool_hashtable.h to make the
subsequent #include "mc_include.h"  'happy'.
I just followed the hashtable "pattern" for this (rather than to put the
#include inside mc_include.h).
Comment 44 Philippe Waroquiers 2012-01-17 11:48:01 UTC
(In reply to comment #41)
> Please ignore remark [1] in the above - I got confused by the name
> "releaseRefPA". Apparently releaseRefPA only decrements the PA reference count
> but does not delete the PA. BTW, did you know that the name "release" usually
> refers to the combined action of decrementing a reference count and deleting if
> that count reaches zero ? See e.g.
> http://msdn.microsoft.com/en-us/library/aa910902.aspx.
My initial proposal was something like "incRefCount" and "decRefCount".
Julian suggested these wording (I guess because these are used somewhere else).
If releaseRefPA is confusing, maybe the initial proposal is better.
Julian, feedback also on this ?
Comment 45 Bart Van Assche 2012-01-17 11:57:25 UTC
(In reply to comment #43)
> (In reply to comment #40)
> > 5. In memcheck/mc_translate.c:
> > +#include "pub_tool_poolalloc.h"     // For mc_include.h
> > Is the comment next to that #include directive correct ?
> Yes I believe it is correct.
> It is similar to the next line, including pub_tool_hashtable.h to make the
> subsequent #include "mc_include.h"  'happy'.
> I just followed the hashtable "pattern" for this (rather than to put the
> #include inside mc_include.h).

Have you considered to change "PoolAlloc" into "struct _PoolAlloc" in mc_include.h ? That would allow to eliminate most of the new '#include "pub_tool_poolalloc.h"' directives. In addition to this change it will probably be necessary to add "struct _PoolAlloc;" before that extern declaration on a line by itself to avoid a compiler warning.
Comment 46 Florian Krohm 2012-01-17 13:06:22 UTC
(In reply to comment #45)

> Have you considered to change "PoolAlloc" into "struct _PoolAlloc" in
> mc_include.h ?

Note: C99 section 7.1.3

  All identifiers that begin with an underscore and either an uppercase 
  letter or another underscore are always reserved for any use.

Yes, I know, existing code is in violent use of tag names beginning with
an underscore.
Comment 47 Bart Van Assche 2012-01-17 13:20:09 UTC
(In reply to comment #46)
> Note: C99 section 7.1.3
> 
>   All identifiers that begin with an underscore and either an uppercase 
>   letter or another underscore are always reserved for any use.
> 
> Yes, I know, existing code is in violent use of tag names beginning with
> an underscore.

In the Net-SNMP project most structures are defined as follows:

typedef struct <name>_s {
   ...
} <name>;
Comment 48 Julian Seward 2012-01-17 14:48:25 UTC
(In reply to comment #43)
> (In reply to comment #40)
> 
> A different OSetGen_Create_With_Pool might effectively be more clear.
> 

yeah, +1 for OSetGen_Create_With_Pool.  Clearer than having special
interpretation for magic values of arguments.
Comment 49 Julian Seward 2012-01-17 14:53:28 UTC
(In reply to comment #44)
> > http://msdn.microsoft.com/en-us/library/aa910902.aspx.
> My initial proposal was something like "incRefCount" and "decRefCount".
> Julian suggested these wording (I guess because these are used somewhere else).
> If releaseRefPA is confusing, maybe the initial proposal is better.

I'd prefer to use exactly the microsoft wording (addRef/release or
AddRef/Release) as it's widely used.
Comment 50 Julian Seward 2012-01-17 15:12:15 UTC
+1 to commit from me.  Minor comments only:

* changes as per comment #48 and comment #49

* does this need to be in the interface?

  UWord VG_(nrRefPA) (PoolAlloc* pa)
  {
     return pa->nrRef;
  }

  I ask because, even in threaded scenarios, it is possible to make
  AddRef and Release work race-free by using atomic ref counts.  But
  then it is pretty meaningless to have a nrRef method because there's
  no way to use it race-free relative to later AddRef/Release by
  different threads.

  (This is not important right now, just a curiosity.  But if it isn't
  necessary in the interface, please remove it.)

* copyright notices: please add an email address for you
  wherever you add your name.  The skynet one is ok.

* (general paranoia about block sizes): do you have assertions
  that the max alloc size <= the specified maxEltSize (or whatever
  is relevant?)  Just checking that you have assertions to catch as
  many incorrect use cases as possible.
Comment 51 Philippe Waroquiers 2012-01-17 16:39:02 UTC
(In reply to comment #50)
> +1 to commit from me.  Minor comments only:
Will handle the comments this evening.
 
> * changes as per comment #48 and comment #49
  extern void VG_(addRefPA) ( PoolAlloc* pa);
  extern void VG_(releaseRefPA) ( PoolAlloc* pa);
The above looks to match the comment 49, does it ?

> * does this need to be in the interface?
>   UWord VG_(nrRefPA) (PoolAlloc* pa)
>   (This is not important right now, just a curiosity.  But if it isn't
>   necessary in the interface, please remove it.)
It is needed now. Atomic handling of all that will for sure imply
a lot of changes.

Will do the below.
> * copyright notices: please add an email address for you
>   wherever you add your name.  The skynet one is ok.
> 
> * (general paranoia about block sizes): do you have assertions
>   that the max alloc size <= the specified maxEltSize (or whatever
>   is relevant?)  Just checking that you have assertions to catch as
>   many incorrect use cases as possible.

Thanks
Comment 52 Julian Seward 2012-01-17 20:14:12 UTC
(In reply to comment #51)
> (In reply to comment #50)
> > +1 to commit from me.  Minor comments only:
> Will handle the comments this evening.
> 
> > * changes as per comment #48 and comment #49
>   extern void VG_(addRefPA) ( PoolAlloc* pa);
>   extern void VG_(releaseRefPA) ( PoolAlloc* pa);
> The above looks to match the comment 49, does it ?

No .. the names should be "addRef" and "release", not
"releaseRef".  Yes I know it's not logical, but that's
what is widely used.
Comment 53 Philippe Waroquiers 2012-01-17 20:16:44 UTC
(In reply to comment #47)
> (In reply to comment #46)
> > Note: C99 section 7.1.3
> > 
> >   All identifiers that begin with an underscore and either an uppercase 
> >   letter or another underscore are always reserved for any use.
> > 
> > Yes, I know, existing code is in violent use of tag names beginning with
> > an underscore.
> 
> In the Net-SNMP project most structures are defined as follows:
> 
> typedef struct <name>_s {
>    ...
> } <name>;
Looks like a nice naming convention. We should maybe apply this systematically
so as to e.g. ease the use of the 'struct .... *' approach rather than
the #include (currently propagating _ outside the implementation makes me a
little bit nervous after Florian's comment :).
Comment 54 Philippe Waroquiers 2012-01-17 20:35:57 UTC
(In reply to comment #52)

> No .. the names should be "addRef" and "release", not
> "releaseRef".  Yes I know it's not logical, but that's
> what is widely used.
Ok, will use the wording addRef and release
(but I guess with a VG_ prefix and a PA suffix to avoid collision
of names with other potential usages in other data structures).
In other words, we will have
  extern void VG_(addRefPA) ( PoolAlloc* pa);
  extern void VG_(releasePA) ( PoolAlloc* pa);

(hope I got it this time :).
Comment 55 Philippe Waroquiers 2012-01-17 21:22:11 UTC
(In reply to comment #54)
> (hope I got it this time :).
Committed revision 12341.
(In case I did not got it, will fix it asap).

Thanks for the review.