Summary: | false positive while destroying mutex | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Samuel Thibault <samuel.thibault> |
Component: | helgrind | Assignee: | Julian Seward <jseward> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | philippe.waroquiers, pjfloyd, saurabh |
Priority: | NOR | ||
Version First Reported In: | 3.9.0 | ||
Target Milestone: | --- | ||
Platform: | unspecified | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
testcase
false positive log |
Description
Samuel Thibault
2013-11-13 09:33:24 UTC
Created attachment 83543 [details]
testcase
Created attachment 83544 [details]
false positive log
The same bug was discussed on valusers see http://sourceforge.net/mailarchive/message.php?msg_id=31611446 A patch was proposed to solve it (just disabling/enabling checking around the my_memcmp call). However, it is not clear to me if that is the correct fix to do. I see in the default suppression files that there are already a bunch of suppressions entries for helgrind, matching only one function. So maybe we better add an additional supp entry there. I don't get an error on FreeBSD. I'll check if the problem still exists on Linux. I ran this for a couple of hours without any errors. It could be related to the hardware. Reopen this if anyone else can reproduces the problem. To provide various datapoints, I have tried it valgrind --tool=helgrind ./test on: - Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz - AMD EPYC 7352 24-Core Processor - AMD EPYC 7502 32-Core Processor with glibc 2.35 I got the hit ==420467== Helgrind, a thread error detector ==420467== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al. ==420467== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info ==420467== Command: ./test ==420467== ==420467== ---Thread-Announcement------------------------------------------ ==420467== ==420467== Thread #1 is the program's root thread ==420467== ==420467== ---Thread-Announcement------------------------------------------ ==420467== ==420467== Thread #2 was created ==420467== at 0x49792FF: clone (clone.S:76) ==420467== by 0x497A146: __clone_internal (clone-internal.c:83) ==420467== by 0x48F6484: create_thread (pthread_create.c:295) ==420467== by 0x48F6F78: pthread_create@@GLIBC_2.34 (pthread_create.c:828) ==420467== by 0x484E5D7: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467== by 0x1092DA: main (in /net/inria/home/sthibault/test) ==420467== ==420467== ---------------------------------------------------------------- ==420467== ==420467== Possible data race during read of size 1 at 0x10C128 by thread #1 ==420467== Locks held: none ==420467== at 0x484B225: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467== by 0x484B4CA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467== by 0x10937C: main (in /net/inria/home/sthibault/test) ==420467== ==420467== This conflicts with a previous write of size 4 by thread #2 ==420467== Locks held: none ==420467== at 0x48FB4D8: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==420467== by 0x484BCE8: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467== by 0x10929A: f (in /net/inria/home/sthibault/test) ==420467== by 0x484E7D6: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==420467== by 0x48F6849: start_thread (pthread_create.c:442) ==420467== by 0x497930F: clone (clone.S:100) ==420467== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==420467== immediately on almost all of them, only AMD EPYC 7352 24-Core Processor took 1m to reproduce. On an AMD Opteron(tm) Processor 6174 with glibc 2.28 I couldn't reproduce it within an hour. On a Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz, I tried with various glibcs: 2.19, 2.24, 2.28, 2.31, 2.36, with all versions it happens within half an hour, sometimes within a minute. On an arm64 with glibc 2.36 it appeared immediately On a mips64el Cavium Octeon III V0.2 FPU V0.0 with glibc 2.36 it appeared immediately On a ppc64el POWER8 with glibc 2.36 it appeared within a minute. On an s390x it appeared in 8 minutes. (with the various versions of valgrind between 3.9 and 3.19) I can reproduce this on a RHEL 7.6 machine. Can you try this patch please? diff --git a/helgrind/hg_intercepts.c b/helgrind/hg_intercepts.c index 5a83996..8c98e4e 100644 --- a/helgrind/hg_intercepts.c +++ b/helgrind/hg_intercepts.c @@ -879,7 +879,9 @@ static int mutex_destroy_WRK(pthread_mutex_t *mutex) if (mutex != NULL) { static const pthread_mutex_t mutex_init = PTHREAD_MUTEX_INITIALIZER; + VALGRIND_HG_DISABLE_CHECKING(mutex, sizeof(*mutex)); mutex_is_init = my_memcmp(mutex, &mutex_init, sizeof(*mutex)) == 0; + VALGRIND_HG_ENABLE_CHECKING(mutex, sizeof(*mutex)); } else { mutex_is_init = 0; } @@ -1785,7 +1787,9 @@ static int pthread_cond_destroy_WRK(pthread_cond_t* cond) if (cond != NULL) { const pthread_cond_t cond_init = PTHREAD_COND_INITIALIZER; + VALGRIND_HG_DISABLE_CHECKING(cond, sizeof(*cond)); cond_is_init = my_memcmp(cond, &cond_init, sizeof(*cond)) == 0; + VALGRIND_HG_ENABLE_CHECKING(cond, sizeof(*cond)); } else { cond_is_init = 0; } This is the same as what Philippe suggested in the e-mail thread, but with similar protection for cond vars. Yes that avoids that exact issue, but then a flurry of new ones pop up, I was not getting them before: ==2511392== Possible data race during read of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F31A2: pthread_mutex_lock@@GLIBC_2.2.5 (pthread_mutex_lock.c:94) ==2511392== by 0x48487F8: mutex_lock_WRK (hg_intercepts.c:937) ==2511392== by 0x109272: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x48EF26D: __pthread_cond_wait_common (pthread_cond_wait.c:419) ==2511392== by 0x48EF26D: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F31B9: pthread_mutex_lock@@GLIBC_2.2.5 (pthread_mutex_lock.c:182) ==2511392== by 0x48487F8: mutex_lock_WRK (hg_intercepts.c:937) ==2511392== by 0x109272: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x48EF26D: __pthread_cond_wait_common (pthread_cond_wait.c:419) ==2511392== by 0x48EF26D: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Lock at 0x10C120 was first observed ==2511392== at 0x484CFE9: pthread_mutex_init (hg_intercepts.c:818) ==2511392== by 0x1092EE: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c120 is 0 bytes inside data symbol "mutex" ==2511392== ==2511392== Possible data race during write of size 8 at 0x10C0E8 by thread #2 ==2511392== Locks held: 1, at address 0x10C120 ==2511392== at 0x48EEE29: __atomic_wide_counter_add_relaxed (atomic_wide_counter.h:57) ==2511392== by 0x48EEE29: __condvar_add_g1_start_relaxed (pthread_cond_common.c:52) ==2511392== by 0x48EEE29: __condvar_quiesce_and_switch_g1 (pthread_cond_common.c:294) ==2511392== by 0x48EEE29: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.c:77) ==2511392== by 0x4848FB8: pthread_cond_signal_WRK (hg_intercepts.c:1570) ==2511392== by 0x10928B: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous read of size 8 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48EF39E: __atomic_wide_counter_load_relaxed (atomic_wide_counter.h:30) ==2511392== by 0x48EF39E: __condvar_load_g1_start_relaxed (pthread_cond_common.c:46) ==2511392== by 0x48EF39E: __pthread_cond_wait_common (pthread_cond_wait.c:486) ==2511392== by 0x48EF39E: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c0e8 is 8 bytes inside data symbol "cond" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x48EF26D: __pthread_cond_wait_common (pthread_cond_wait.c:419) ==2511392== by 0x48EF26D: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Lock at 0x10C120 was first observed ==2511392== at 0x484CFE9: pthread_mutex_init (hg_intercepts.c:818) ==2511392== by 0x1092EE: main (in /net/inria/home/sthibault/test) ==2511392== Address 0x10c120 is 0 bytes inside data symbol "mutex" ==2511392== ==2511392== Possible data race during read of size 8 at 0x10C0E8 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48EF2C1: __atomic_wide_counter_load_relaxed (atomic_wide_counter.h:30) ==2511392== by 0x48EF2C1: __condvar_load_g1_start_relaxed (pthread_cond_common.c:46) ==2511392== by 0x48EF2C1: __pthread_cond_wait_common (pthread_cond_wait.c:539) ==2511392== by 0x48EF2C1: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 8 by thread #2 ==2511392== Locks held: 1, at address 0x10C120 ==2511392== at 0x48EEE29: __atomic_wide_counter_add_relaxed (atomic_wide_counter.h:57) ==2511392== by 0x48EEE29: __condvar_add_g1_start_relaxed (pthread_cond_common.c:52) ==2511392== by 0x48EEE29: __condvar_quiesce_and_switch_g1 (pthread_cond_common.c:294) ==2511392== by 0x48EEE29: pthread_cond_signal@@GLIBC_2.3.2 (pthread_cond_signal.c:77) ==2511392== by 0x4848FB8: pthread_cond_signal_WRK (hg_intercepts.c:1570) ==2511392== by 0x10928B: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c0e8 is 8 bytes inside data symbol "cond" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during read of size 4 at 0x10C128 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F2472: __pthread_mutex_cond_lock (pthread_mutex_lock.c:94) ==2511392== by 0x48EF2EF: __pthread_cond_wait_common (pthread_cond_wait.c:607) ==2511392== by 0x48EF2EF: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F2485: __pthread_mutex_cond_lock (pthread_mutex_lock.c:182) ==2511392== by 0x48EF2EF: __pthread_cond_wait_common (pthread_cond_wait.c:607) ==2511392== by 0x48EF2EF: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:618) ==2511392== by 0x484BB03: pthread_cond_wait_WRK (hg_intercepts.c:1291) ==2511392== by 0x10933B: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C128 by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x109354: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A47: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:62) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c128 is 8 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during read of size 4 at 0x10C12C by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A52: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:65) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x109354: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A52: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:65) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c12c is 12 bytes inside data symbol "mutex" ==2511392== ==2511392== ---------------------------------------------------------------- ==2511392== ==2511392== Possible data race during write of size 4 at 0x10C12C by thread #1 ==2511392== Locks held: none ==2511392== at 0x48F4A52: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:65) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x109354: main (in /net/inria/home/sthibault/test) ==2511392== ==2511392== This conflicts with a previous write of size 4 by thread #2 ==2511392== Locks held: none ==2511392== at 0x48F4A52: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:65) ==2511392== by 0x4848DD8: mutex_unlock_WRK (hg_intercepts.c:1184) ==2511392== by 0x10929A: f (in /net/inria/home/sthibault/test) ==2511392== by 0x484B966: mythread_wrapper (hg_intercepts.c:406) ==2511392== by 0x48EFFD3: start_thread (pthread_create.c:442) ==2511392== by 0x496F8CF: clone (clone.S:100) ==2511392== Address 0x10c12c is 12 bytes inside data symbol "mutex" ==2511392== Looking at the glibc source pthread_mutex_destroy is fairly simple. It does an atomic read of the kind and a plain read of nusers and returns EBUSY if nusers is not 0. It then sets kind to an invalid value, -1. pthread_mutex_unlock is a lot more complicated. It also does atomic reads of the kind, decrements nusers and checks the lock then does a cas on the lock. If I run with --default-suppressions=no then there huge numbers of hazards, but none related to pthread_mutex_destroy. I think that is because Helgrind sees the lock order is such that pthread_mutex_destroy is called in a serial section. FYI, it's glibc 2.36: - pthread_mutex_lock.c:94 is when it checks assert (mutex->__data.__owner == 0); after obtaining the lock - pthread_mutex_lock.c:182 is when it sets mutex->__data.__owner after obtaining the lock - pthread_mutex_unlock.c:62 is when it clears mutex->__data.__owner before releasing the lock - pthread_mutex_unlock.c:65 is when it --mutex->__data.__nusers; before releasing the lock The other cond-related lines are more complex to describe, but they'll probably get fixed the same way as mutex would get fixed. I guess the disabling of checking somehow perturbates helgrind's history record (I guess that's part of the "This puts it in the same state as new memory allocated by this thread -- that is, basically owned exclusively by this thread." comment for the VALGRIND_HG_ENABLE_CHECKING macro) (In reply to Samuel Thibault from comment #12) > Yes that avoids that exact issue, but then a flurry of new ones pop up, I > was not getting them before: > > ==2511392== Possible data race during read of size 4 at 0x10C128 by thread #2 > ==2511392== Locks held: none > ==2511392== at 0x48F31A2: pthread_mutex_lock@@GLIBC_2.2.5 > (pthread_mutex_lock.c:94) [snip] I would expect these to be covered by the default suppressions: { helgrind-glibc2X-004 Helgrind:Race obj:*/lib*/libc-2.*so* } and { helgrind-glibc2X-005 Helgrind:Race obj:*/lib*/libpthread-2.*so* } which are very broad and cover most of the pthread internals. I'll try to see if I can test on 2.36 > I would expect these to be covered by the default suppressions:
> {
> helgrind-glibc2X-004
> Helgrind:Race
> obj:*/lib*/libc-2.*so*
> }
Ah, but glibc renamed libc-2.*so* into just libc.so.6, so that suppression won't take effect. Duplicating the entry with libc.so.* does indeed seem to avoid the warnings. I'll try to test on more machines and archs.
Ah, configure.ac already detects which case that is. So we need a valgrind built against the proper glibc, that's why I hadn't the proper name. I have run the reproducer on all the various systems previously mentioned, with no issue so far, so it seems we're good with that fix. (ran it for an hour) Now fixed commit 74e20007467b930fb14264519f74a45179b5f740 (HEAD -> master, origin/master, origin/HEAD) Author: Paul Floyd <pjfloyd@wanadoo.fr> Date: Tue Jan 3 21:28:42 2023 +0100 BugĀ 327548 - false positive while destroying mutex |