Bug 297147

Summary: drd false positives on newly allocated memory
Product: [Developer tools] valgrind Reporter: Bradley C. Kuszmaul <kuszmaul>
Component: drdAssignee: Bart Van Assche <bart.vanassche+kde>
Status: RESOLVED FIXED    
Severity: normal    
Priority: NOR    
Version: 3.7.0   
Target Milestone: ---   
Platform: Fedora RPMs   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Bradley C. Kuszmaul 2012-03-30 21:18:05 UTC
I see this with drd 3.6.1, drd 3.7.0 and the head of the svn repository (r12471).

I'm seeing problems with drd where I get
 "Conflicting store by thread X at Y size Z"
 where this operation is something like
      int *m=malloc(100*sizeof(int));
      for (int i=0; i<100; i++)
         m[i] = i;
the conflict is on the store which is initializing m.  The pointer to m hasn't been stored anywhere so no other thread should be able to conflict.

As is often the case with drd, the "other segment" is usually a bunch of code, so it's hard to track it down (it sure would be nice if drd had a mode where it kept exact information about the "other" segment).  It seems like it's unrelated code.  Once in a while I get very precise information that suggests that in the "other" segment, the address is a pthread mutex, and it's being locked or unlocked.

My theory (based on very thin evidence) is that there is a bug in drd when a memory location refers to a mutex in a malloced chunk of memory.  Somehow, after the mutex is destroyed and the memory is free'd, the memory's information should be erased but isn't.  My theory is that it has something to do with the fact that the location is a pthread_mutex_t.

The code is a big proprietary code , and I haven't been able to isolate the problem down to something small that I can share.

The code passes memcheck, so I doubt that the pointer is somehow being corrupted.  

I'm using the standard libc malloc on a fedora 16, but I've seen this behavior on other OSs such as various flavors of CentOS.

I know this isn't much to go on, but maybe it will ring a bell and someone will recognize the problem.
Comment 1 Bart Van Assche 2012-03-31 06:29:56 UTC
Which command line options were passed to drd ? Probably at least --check-stack-var=yes ?
Comment 2 Bradley C. Kuszmaul 2012-03-31 09:26:11 UTC
No arguments.  Just 

valgrind --tool=drd ./test

If I add --check-stack-var=yes the reported errors seem to go away.  I guess I don't understand drd very well: how can checking more things result in fewer reported errors?
Comment 3 Bart Van Assche 2012-03-31 10:34:13 UTC
Good question. With --check-stack-var=yes DRD does not only enable data race detection on stack variables but also cleans stack memory references every time the stack pointer is increased (stacks grow downward on x86). So if any stack object has been shared over threads without having invoked ANNOTATE_NEW_MEMORY() after the last use of that object with --check-stack-var=no that will result in false positives being reported on new stack variables that use the same stack memory addresses later on.
Comment 4 Bradley C. Kuszmaul 2012-03-31 14:13:27 UTC
Perhaps the manual should be updated with that explanation, at which point I would consider this issue resolved.
Comment 5 Bradley C. Kuszmaul 2012-03-31 14:22:34 UTC
It looks like --check-stack-var=yes is making my program run much slower than without.  It ran all night without finishing the first second of computation.

One strange thing about this problem is that the memory that's being complained about is malloc()'d.  Could it be the case that there was pthread stack that was freed (because the pthread finished) and then the same memory got used for a malloc()'d object?   Or maybe the other way around?  

Is there a way to not complain about stack variables that are shared.
Comment 6 Bart Van Assche 2012-03-31 15:22:57 UTC
(In reply to comment #5)
> Could it be the case that there was pthread stack that was freed (because the pthread finished)
> and then the same memory got used for a malloc()'d object?

pthread stacks are already cleaned as soon as a thread finishes. This is done by the drd_stop_using_mem() call in drd_thread_finished().
Comment 7 Bart Van Assche 2012-03-31 18:03:56 UTC
(In reply to comment #5)
> Is there a way to not complain about stack variables that are shared.

If there are not too much stack variables that are shared, one way is to insert ANNOTATE_NEW_MEMORY() annotations in the scope where the stack variable is defined after the last use of such a variable. If you are using C++, the end of the destructor is a good place to insert such annotations.
Comment 8 Bart Van Assche 2012-04-01 14:41:23 UTC
(In reply to comment #4)
> Perhaps the manual should be updated with that explanation, at which point I
> would consider this issue resolved.

I've taken another approach in r12474 - the changes in that revision cause drd not to complain about stack variables that are shared over threads.
Comment 9 Bart Van Assche 2012-04-01 14:41:48 UTC
... at least with --check-stack-var=no.
Comment 10 Bradley C. Kuszmaul 2012-04-01 18:36:40 UTC
I ran r12475 with --check-stack-var=no.  My original problem is still there. 

I don't think that running with --check-stack-var=yes actually solved the problem.  Instead it just slowed down the test so much that the test timed out before the warning was emitted.

So, I've still got some problem where I malloc() some memory and then initialize that memory (and at that point no other thread knows about that memory).  Somehow the memory returned by malloc() is sometimes retaining some conflict information from it's previous life.

Even if I put an ANNOTATE_NEW_MEMORY() call after every malloc then the problem goes away.  It seems like somehow, malloc isn't treating the new memory as new in some cases.
Comment 11 Bart Van Assche 2012-04-02 08:26:28 UTC
Which malloc() implementation are you using ? glibc, tcmalloc or even another implementation ? Using the --trace-alloc=yes option might help to verify that DRD actually "sees" the malloc() calls invoked from the application.
Comment 12 Bradley C. Kuszmaul 2012-04-02 14:03:51 UTC
I'm using glibc.

When I turn on --trace-alloc I get 
 a) The memory was recently released
 b) The memory was then reallocated using malloc
 c) the initialization complains

It looks to me like there may be a race in drd_malloc_wrappers.c.  When I look at r12467 at drd_malloc_wrappers.c:135 (in freelike_block()), I see


      if (dealloc)
	 VG_(cli_free)((void*)p);
      if (mc->size > 0)
         s_stop_using_mem_callback(mc->data, mc->size);
      VG_(HT_remove)(s_malloc_list, (UWord)p);
      VG_(free)(mc);

So it looks like first the pointer is freed, then drd records that the memory is no longer in use.

I read that as a race condition, because another thread can then malloc the memory between the cli_free() and the s_stop_using_mem_callback().
Comment 13 Bart Van Assche 2012-04-02 14:38:02 UTC
(In reply to comment #12)
> So it looks like first the pointer is freed, then drd records that the
> memory is no longer in use.
> 
> I read that as a race condition, because another thread can then malloc the
> memory between the cli_free() and the s_stop_using_mem_callback().

Inside Valgrind malloc() and free() intercepts are invoked on behalf of client code and hence with the "big lock" held. The only client code invoked without the big lock held is a blocking system call. mmap() is considered by Valgrind as a non-blocking system call. So although I have swapped these two statements on the trunk, I do not expect that this change will cause a behavior change in drd.
Comment 14 Bradley C. Kuszmaul 2012-04-02 15:07:36 UTC
Yes, r12478 does not fix the problem.  Also the corresponding change to realloc produces no change.

But adding an explicit ANNOTATE_NEW_MEMORY(p, size) in my code after the malloc() does change the behavior.

It's acting "racy" in that it's slightly different on every run, and it's infrequent (a few times in a hundred thousand mallocs and frees).
Comment 15 Bart Van Assche 2012-04-02 15:35:21 UTC
(In reply to comment #14)
> But adding an explicit ANNOTATE_NEW_MEMORY(p, size) in my code after the
> malloc() does change the behavior.
> 
> It's acting "racy" in that it's slightly different on every run, and it's
> infrequent (a few times in a hundred thousand mallocs and frees).

It would help if you could make a small test program available that triggers this behavior. That would also exclude that the reported race is caused by a use-after-free in the client program.
Comment 16 Bart Van Assche 2012-04-29 08:17:51 UTC
Closing this issue because it's not clear whether it's caused by the client program or by the tool. Please reopen once more information can be provided.
Comment 17 Bart Van Assche 2012-06-10 10:44:23 UTC
Should be fixed by r12629.