Bug 294285

Summary: --partial-loads-ok does not work for 16-byte SSE loads
Product: [Developer tools] valgrind Reporter: Patrick J. LoPresti <lopresti>
Component: memcheckAssignee: Julian Seward <jseward>
Status: RESOLVED FIXED    
Severity: normal CC: akruppa, bugfeed, cpigat242, dejanjevtic87, mark, mips32r2
Priority: NOR    
Version: 3.7 SVN   
Target Milestone: ---   
Platform: Unlisted Binaries   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: main() for test
SSE4.2 assembly code for open-coded "strlen" generated by ICC
[VEX] Support V128 returns from dirty helpers (AMD64 only)
memcheck patch to use V128 helper for expr2vbits_Load
memcheck patch to use V128 helper for expr2vbits_Load (v2)
Better main() for test
Patch for 128/256 bit returns, vex side, 01 Aug 2013, x86/amd64/arm/ppc32/ppc64
Patch for 128/256 bit returns, valgrind side, 01 Aug 2013, x86/amd64/arm/ppc32/ppc64
MIPS part of the patch
s390 part of the patch
Generic mc_LOADVx implementation
Add support for V256 partial-loads-ok (memcheck)
Add support for V256 partial-loads-ok (VEX)

Description Patrick J. LoPresti 2012-02-17 05:44:25 UTC
Created attachment 68861 [details]
main() for test

Version:           3.7 SVN (using Devel) 
OS:                Linux

Summary pretty much says it all.

The Intel compiler tends to generate these sorts of aligned loads (past the end of a block) all the time.  I am attaching a test case, including a .s file containing code essentially identical to that produced by "icc -O3 -xSSE4.2".

Reproducible: Always

Steps to Reproduce:
gcc -o test test.c my_strlen.s
valgrind --partial-loads-ok=yes ./test


Actual Results:  
==7290== Invalid read of size 8
==7290==    at 0x40056F: my_strlen (in /home/ploprest/valgrind-bug/test)
==7290==    by 0x400515: main (in /home/ploprest/valgrind-bug/test)
==7290==  Address 0x4c1f048 is 7 bytes after a block of size 1 alloc'd
==7290==    at 0x4A07223: malloc (vg_replace_malloc.c:263)
==7290==    by 0x400507: main (in /home/ploprest/valgrind-bug/test)


Expected Results:  
No errors, since the data in the "invalid" region past the memory block is not being used.


Note that without "--partial-loads-ok=yes", the same error occurs, only it happens at byte 0.  It looks like valgrind treats the 16-byte load as a pair of 8-byte loads.

Note that this might also provide a decent work-around for bug 264936.
Comment 1 Patrick J. LoPresti 2012-02-17 05:45:09 UTC
Created attachment 68862 [details]
SSE4.2 assembly code for open-coded "strlen" generated by ICC
Comment 2 Patrick J. LoPresti 2012-11-11 01:25:24 UTC
Created attachment 75166 [details]
[VEX] Support V128 returns from dirty helpers (AMD64 only)

This patch to host_amd64_isel.c adds support for "dirty" helper functions that return a V128 union.
Comment 3 Patrick J. LoPresti 2012-11-11 01:28:34 UTC
Created attachment 75167 [details]
memcheck patch to use V128 helper for expr2vbits_Load

Together with the VEX patch, this patch fixes the test case attached to this bug without introducing new regressions on AMD64.

Of course, it will also break all other architectures until VEX is updated to support V128 dirty helpers on all of them.

Also it leaves intact the "four I64 loads" approach for V256 (AVX).
Comment 4 Patrick J. LoPresti 2012-11-11 02:09:42 UTC
Created attachment 75168 [details]
memcheck patch to use V128 helper for expr2vbits_Load (v2)

Previous patch had a bug.  Fixed.  Better test case coming soon, too...
Comment 5 Patrick J. LoPresti 2012-11-11 02:14:14 UTC
Created attachment 75169 [details]
Better main() for test

To compile, run

  gcc -o test-my-strlen test-my-strlen.c my_strlen.s

With existing Valgrind, this reports four "Invalid read of size 8" errors.  With the patches on this bug, it reports four "Invalid read of size 16" errors.  With the patches plus "--partial-loads-ok=yes", it reports none, which is correct :-)

(This assumes strdup() returns a 16-byte-aligned memory block, which is likely since malloc() returns storage with suitable alignment for anything.  This test case also prints out the pointers so you can see that the low nibble is zero.)
Comment 6 akruppa 2013-02-26 16:04:41 UTC
*** This bug has been confirmed by popular vote. ***
Comment 7 akruppa 2013-02-26 16:09:29 UTC
I ran into this bug with the GMP library, which uses 16-byte SSE memory copy. The problem is being discussed on the GMP bugs list, see 
http://gmplib.org/list-archives/gmp-bugs/2013-February/002959.html
Patrick's patch applied to Valgrind 3.8.1 results in the expected behaviour.
Comment 8 Julian Seward 2013-07-03 16:56:24 UTC
Oh the whole, this isn't very difficult.  We'll have to monkey around with the
insn selectors for all the back ends, but that's tedious mostly.

The real difficulty is (as usual) writing a comprehensive, portable test case
to verify that nothing got screwed up.
Comment 9 Julian Seward 2013-08-01 15:09:52 UTC
Created attachment 81508 [details]
Patch for 128/256 bit returns, vex side, 01 Aug 2013, x86/amd64/arm/ppc32/ppc64
Comment 10 Julian Seward 2013-08-01 15:12:28 UTC
Created attachment 81509 [details]
Patch for 128/256 bit returns, valgrind side, 01 Aug 2013, x86/amd64/arm/ppc32/ppc64

Includes Patrick's original core 128-bit PLO patch for Memcheck,
and a comprehensive test case in memcheck/tests/test-plo-vec.c
including reference 32/64-little-endian outputs 
(not wired up currently)
Comment 11 Petar Jovanovic 2013-08-02 15:30:06 UTC
Created attachment 81529 [details]
MIPS part of the patch

Apply patches in the comments #9 and #10 first, then this one.
Comment 12 Julian Seward 2013-08-03 08:23:55 UTC
(In reply to comment #11)
> Created attachment 81529 [details]
> MIPS part of the patch

Petar, thanks for that.  I just had a look; it seems OK.  One thing, though:
the _toIR.c bit removes IR generation (afaics, a pass-through to the host)
for a SYNC instruction.  Is that intended to be part of the patch? 
(and, unrelatedly, is it safe, not to pass the sync through to the host?)
Comment 13 Petar Jovanovic 2013-08-04 02:15:26 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > Created attachment 81529 [details]
> > MIPS part of the patch
> 
> Petar, thanks for that.  I just had a look; it seems OK.  One thing, though:
> the _toIR.c bit removes IR generation (afaics, a pass-through to the host)
> for a SYNC instruction.  Is that intended to be part of the patch? 
> (and, unrelatedly, is it safe, not to pass the sync through to the host?)

@Julian,
yes, removing SYNC dirty helper is intended part of the patch. It may not have been thoroughly thought through, we previously ignored instruction SYNCI, now we ignore SYNC. SYNC does ordering of load/store access to shared memory (completion barriers and ordering barriers).
If it is not safe, we would really benefit of seeing a Valgrind issue in which this is needed. All the tests pass without it though.
Comment 14 Petar Jovanovic 2013-08-04 15:43:50 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > (In reply to comment #11)
> > > Created attachment 81529 [details]
> > > MIPS part of the patch
> > 
> > Petar, thanks for that.  I just had a look; it seems OK.  One thing, though:
> > the _toIR.c bit removes IR generation (afaics, a pass-through to the host)
> > for a SYNC instruction.  Is that intended to be part of the patch? 
> > (and, unrelatedly, is it safe, not to pass the sync through to the host?)
> 
> @Julian,
> yes, removing SYNC dirty helper is intended part of the patch. It may not
> have been thoroughly thought through, we previously ignored instruction
> SYNCI, now we ignore SYNC. SYNC does ordering of load/store access to shared
> memory (completion barriers and ordering barriers).
> If it is not safe, we would really benefit of seeing a Valgrind issue in
> which this is needed. All the tests pass without it though.

Just to clarify the previous comment, we do plan to run additional tests and investigate further if it is required or not, but the approach is to ignore sync/synci instructions in this change (as you do not need it) and try to trigger an error with additional test examples.
Comment 15 Maran 2013-08-08 05:35:14 UTC
Created attachment 81604 [details]
s390 part of the patch

The patch includes
1) Introduction of RetLoc framework to s390
2) Move  the insn selection to copy the return value to dst register out of helper call insn selection.
3) s390 support for IRExprP__BBPTR and IRExprP__VECRET
Comment 16 Julian Seward 2013-08-08 11:00:11 UTC
Committed:
   2739 (IR and JIT front/middle/backend changes)
   13488  The basic Memcheck fix, from Patrick
   13489  Comprehensive test case (not properly connected up)

Thanks Patrick, Petar, Maran.

Do we want to also fix the 256-bit return case, so we don't have to come
back to this swamp later?  Or wait for another day?  The infrastructure
to do it is all in place, so it should be easy.
Comment 17 Julian Seward 2013-08-12 10:43:04 UTC
Added proper test cases for amd64: r13491.
Comment 18 Patrick J. LoPresti 2013-08-13 01:09:14 UTC
Created attachment 81686 [details]
Generic mc_LOADVx implementation

This patch creates a generic mc_LOADVx() and mc_LOADVx_slow(), taking the number of bits as an argument. By itself, this patch adds no functionality... But with it applied, helperc_LOADV256be and helperc_LOAD256le and helperc_LOADV512be and so forth will be one-line additions.

I have verified that the regression tests on AMD64 still pass after applying this.
Comment 19 Julian Seward 2013-08-14 12:46:30 UTC
Seems to confuse iterating over bytes with iterating over ULongs?

-      for (j=0 ; j<2 ; ++j) {
+      for (j=0 ; j<nBytes ; ++j) {
          sm       = get_secmap_for_reading_low(a + 8*j);
          sm_off16 = SM_OFF_16(a + 8*j);

Segfaults for me on Fedora 17.
Comment 20 Patrick J. LoPresti 2013-08-14 23:37:06 UTC
Created attachment 81711 [details]
Add support for V256 partial-loads-ok (memcheck)

This patch adds support for partial-loads-ok on 32-byte aligned accesses. Includes an update to the test case.
Comment 21 Patrick J. LoPresti 2013-08-14 23:40:13 UTC
Created attachment 81712 [details]
Add support for V256 partial-loads-ok (VEX)

VEX part of V256 partial-loads-ok patch. Adds support for V256 * arguments to dirty helpers. Only for AMD64 (do any other architectures have 256-bit registers?)
Comment 22 Julian Seward 2013-08-16 08:35:35 UTC
(In reply to comment #20)
> Created attachment 81711 [details]
> Add support for V256 partial-loads-ok (memcheck)

Committed, r13500, r13501.
Comment 23 Julian Seward 2013-08-16 08:36:21 UTC
(In reply to comment #21)
> Created attachment 81712 [details]
> Add support for V256 partial-loads-ok (VEX)

Committed, r2743.  Thanks for the patches.
Comment 24 Patrick J. LoPresti 2023-11-16 22:56:48 UTC
Thank you!