Consider the following reproduction case: #include <immintrin.h> __attribute__((noinline)) int gather(const int *array, void *p, void *p2) { __m256i ret = _mm256_i32gather_epi32((const int*)array, _mm256_setzero_si256(), 4); return ret[0]; } struct bunch_of_chars { char stuff[128]; }; int main() { int* array = new int[8]{}; bunch_of_chars x, y = x; return gather(array, &x, &y); } Here, some copying of uninitialized value happens at y = x. Neither x nor y are used, so this doesn't usually trigger a warning in Valgrind (and indeed such copying of uninitalized values w/o using them is common so avoiding a false positive here is important). Then, however, we call gather() function, only only operates on zero-initialized array using a gather instruction. We only pass pointers to x and y to avoid the compiler optimizing them away completely (you could do it in other ways too, e.g., assigning them to volatile: in the real example this was derived from some parts of x and y were later used). We get a valgrind failure as follows: ==13033== Memcheck, a memory error detector ==13033== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==13033== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==13033== Command: ./bench ==13033== ==13033== Use of uninitialised value of size 8 ==13033== at 0x400718: _mm256_i32gather_epi32 (avx2intrin.h:1598) ==13033== by 0x400718: gather(int const*, void*, void*) (main.cpp:6) ==13033== by 0x400608: main (in /home/tdowns/dev/sorted-intersections/bench) Here's the assembly for gather function: 0000000000400710 <gather(int const*, void*, void*)>: 400710: vpxor xmm1,xmm1,xmm1 400714: vpcmpeqd ymm2,ymm2,ymm2 400718: vpgatherdd ymm0,DWORD PTR [rdi+ymm1*4],ymm2 40071e: vmovq rax,xmm0 400723: vzeroupper 400726: ret 400727: nop WORD PTR [rax+rax*1+0x0] Here's an excerpt of the assembly leading up to the main function: 40057f: vmovdqa xmm0,XMMWORD PTR [rsp] 400584: vmovdqa xmm1,XMMWORD PTR [rsp+0x10] 40058a: mov rsi,rsp 40058d: lea rdx,[rsp+0x80] 400595: vmovdqa xmm2,XMMWORD PTR [rsp+0x20] 40059b: vmovdqa xmm3,XMMWORD PTR [rsp+0x30] 4005a1: mov rdi,rax 4005a4: vmovdqa xmm4,XMMWORD PTR [rsp+0x40] 4005aa: vmovdqa xmm5,XMMWORD PTR [rsp+0x50] 4005b0: vmovaps XMMWORD PTR [rsp+0x80],xmm0 4005b9: vmovdqa xmm6,XMMWORD PTR [rsp+0x60] 4005bf: vmovdqa xmm7,XMMWORD PTR [rsp+0x70] 4005c5: vmovaps XMMWORD PTR [rsp+0x90],xmm1 4005ce: vmovaps XMMWORD PTR [rsp+0xa0],xmm2 4005d7: vmovaps XMMWORD PTR [rsp+0xb0],xmm3 4005e0: vmovaps XMMWORD PTR [rsp+0xc0],xmm4 4005e9: vmovaps XMMWORD PTR [rsp+0xd0],xmm5 4005f2: vmovaps XMMWORD PTR [rsp+0xe0],xmm6 4005fb: vmovaps XMMWORD PTR [rsp+0xf0],xmm7 400604: call 400710 <gather(int const*, void*, void*)> We can see the uninitialized copy happening (all the movaps calls). Note that at this point, the xmm2 register is "tainted" since it contains uninitialized value. This is what triggers the false positive: the vgatherdd instruction uses ymm2 as input (the mask register). However, ymm2 was already set to "all ones" by the earlier 'vpcmpeqd ymm2,ymm2,ymm2' instruction, so its value is no longer tainted and has no relationship to the earlier xmm2 value. You can show that ymm2 is causing the problem by using inline assembly to insert a clearing at the start of gather: asm volatile ("vpxor %xmm2, %xmm2, %xmm2"); This has no effect on the logic, but now Valgrind is happy. Valgrind should understand vpcmpeq* with the same register for both inputs as independent of the previous value as this is common. This issue is very hard to track down because the false positive occurs at an unrelated place, potentially very far away from the original source since the xmm registers may go a long time w/o being used. SOFTWARE/OS VERSIONS Linux: Ubuntu 16.04 Valgrind: 3.11 g++: 8.1.0 compiled with: g++-8 -DNDEBUG -Wall -Wextra -O2 -g -march=haswell -Wno-unused-parameter -Wno-error=unused-variable -Wno-unknown-pragmas -std=c++14 -c -o main.o main.cpp although really it's -O2 that matters (so you get the vmovaps-based copy in main.
For info, reproduced on Ubuntu 19.04 with g++ 8.3.0 and valgrind trunk.
I think I fixed this on the 'grail' (noise-reduction) branch, but it has not yet been merged to trunk. That will happen before the next release. The commit is: commit 96de5118f5332ae145912ebe91b8fa143df74b8d Author: Julian Seward <jseward@acm.org> Date: Sat Nov 16 08:30:10 2019 +0100 Fold Iop_CmpEQ32x8(x,x) to all-1s .. .. hence treating it as a dependency-breaking idiom. Also handle the resulting IRConst_V256(0xFFFFFFFF) in the amd64 insn selector.
Fixed on the trunk, 79dd0bd6e88a65f435799d5d84165c260c9bbda7.
The following function allows detection of whether valgrind has the bug or not, when compiled with clang + AVX2 enabled: [[clang::optnone]] static bool checkValgrindBug409429Fixed() { if (RUNNING_ON_VALGRIND) { alignas(32) uint8_t m[32]; __m256i r = *reinterpret_cast<const __m256i *>(m); *reinterpret_cast<__m256i *>(m) = _mm256_cmpeq_epi8(r, r); if ((m[0] != 0xFF) || VALGRIND_COUNT_ERRORS) return false; } return true; }