135264 – dcbzl instruction missing

Bug 135264 - dcbzl instruction missing

Summary: dcbzl instruction missing

Status:	RESOLVED FIXED

Alias:	None

Product:	valgrind
Classification:	Developer tools
Component:	vex (other bugs)
Version First Reported In:	3.2 SVN
Platform:	Compiled Sources Linux

Importance:	NOR crash
Target Milestone:	---
Assignee:	Julian Seward

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-10-07 21:37 UTC by Luca Barbato
Modified:	2010-09-03 17:52 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
patch that fixes dcbzl support for VEX@1972 (766 bytes, patch) 2010-04-13 17:22 UTC, Dave Goodell	Details
patch that fixes dcbzl support for VEX@1972 (accidental double submit, same as earlier patch) (766 bytes, patch) 2010-04-13 17:27 UTC, Dave Goodell	Details
test case for dcbzl (1.57 KB, text/plain) 2010-04-13 17:30 UTC, Dave Goodell	Details
patch implementing run-time dcbz/dcbzl behavior detection (6.77 KB, patch) 2010-06-09 21:23 UTC, Dave Goodell	Details
Show Obsolete (2) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Luca Barbato 2006-10-07 21:37:10 UTC

disInstr(ppc): unhandled instruction: 0x7C2907EC
                 primary 31(0x1F), secondary 2028(0x7EC)

dcbzl is a special case for of dcbz that extend the clean to a full cache line
(128 instead of 32 on ppc970)

references
http://sources.redhat.com/ml/binutils/2004-04/msg00783.html

Comment 1 Dave Goodell 2010-04-13 17:22:57 UTC

Created attachment 42748 [details]
patch that fixes dcbzl support for VEX@1972

Comment 2 Dave Goodell 2010-04-13 17:27:53 UTC

Created attachment 42749 [details]
patch that fixes dcbzl support for VEX@1972 (accidental double submit, same as earlier patch)

Comment 3 Dave Goodell 2010-04-13 17:30:54 UTC

Created attachment 42750 [details]
test case for dcbzl

Comment 4 Dave Goodell 2010-04-13 18:03:42 UTC

adding myself to CC list

Comment 5 Julian Seward 2010-05-04 00:20:52 UTC

Dave, what CPUs did you test this on?

Comment 6 Dave Goodell 2010-05-04 00:35:58 UTC

I used one of the login nodes to one of the IBM Blue Gene/P systems here at Argonne.  It claims to be PPC970MP, which I trust after asking around:

% cat /etc/issue

Welcome to SUSE Linux Enterprise Server 10 SP2 (ppc) - Kernel \r (\l).


% uname -a
Linux login1 2.6.16.60-0.42.8-ppc64 #1 SMP Tue Dec 15 17:28:00 UTC 2009 ppc64 ppc64 ppc64 GNU/Linux
% cat /proc/cpuinfo
processor       : 0
cpu             : PPC970MP, altivec supported
clock           : 2497.500000MHz
revision        : 1.1 (pvr 0044 0101)

processor       : 1
cpu             : PPC970MP, altivec supported
clock           : 2497.500000MHz
revision        : 1.1 (pvr 0044 0101)

processor       : 2
cpu             : PPC970MP, altivec supported
clock           : 2497.500000MHz
revision        : 1.1 (pvr 0044 0101)

processor       : 3
cpu             : PPC970MP, altivec supported
clock           : 2497.500000MHz
revision        : 1.1 (pvr 0044 0101)

timebase        : 14318000
machine         : CHRP IBM,8844-AC1

% gcc -v
Using built-in specs.
Target: powerpc64-suse-linux
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.1.2 --enable-ssp --disable-libssp --disable-libgcj --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --enable-libstdcxx-allocator=new --program-suffix= --enable-version-specific-runtime-libs --without-system-libunwind --with-cpu=default32 --enable-secureplt --with-long-double-128 --host=powerpc64-suse-linux
Thread model: posix
gcc version 4.1.2 20070115 (SUSE Linux)

It generated 32-bit code by default, which is the way that I compiled valgrind.  If it's useful, I can rebuild with -m64.

I don't have easy access to very many other PPC or POWER machines, so I unfortunately can't offer to test elsewhere right now.

Comment 7 Dave Goodell 2010-05-04 00:58:32 UTC

FWIW, I used this PDF manual for information about dcbzl on the 970MP: https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/DC3D43B729FDAD2C00257419006FB955

In particular, page 79 has info about how large the cache block is that is zeroed.  It is a function of bit 21 of the instruction and HID5[57:56].  It's not called dcbzl in the manual, it's just a different behavior for the same dcbz instruction.

As I mentioned out of band from this ticket, I'm not knowledgeable about the PPC/POWER family overall to say what the right behavior is on non-970 processors.  The only thing that seems to make a dcbz into a dcbzl is a reserved bit being set in the instruction.  If non-970 cores ignore reserved bits, my patch is probably OK-ish.  If that's not true and the VEX philosophy is to try to closely mirror the behavior of the underlying CPU, including when exceptions would be raised, then my patch is too permissive.  Also, my patch doesn't account for those HID5 bits.

On the mailing list, the person that I was helping actually encountered this when using FFMpeg on some e300c4 core.  Apparently FFMpeg uses the dcbzl instruction in either case but just checks to see how many bytes it zeros (32 or 128).  This is wrong IMO, but may be common practice.  It also seems to run fine on that particular CPU, just zeroing 32 bytes.

http://www.mail-archive.com/valgrind-users@lists.sourceforge.net/msg01552.html

Comment 8 Luca Barbato 2010-05-04 09:36:12 UTC

http://gitorious.org/ffmpeg/mainline/blobs/master/libavcodec/ppc/dsputil_ppc.c#line187

This checks if dcbz or dcbzl should be used and acts accordingly. If you think it could be improved please do tell =)

Comment 9 Dave Goodell 2010-05-04 19:59:14 UTC

Luca: I don't want to hijack this bug ticket, so I'll respond here once and if we want to continue we should probably take the discussion to email.

I'm only looking at the code through a peephole, without a full understanding of the build or deployment environment, or other experiments that you may have run, so take my comments with a big grain of salt.  AFAICT, the cited code is implementing a bzero or memset(X,0,Y) for  (2*6*64)

Comment 10 Dave Goodell 2010-05-04 20:18:14 UTC

[pardon the keyboard fat-fingering above...]

Luca: I don't want to hijack this bug ticket, so I'll respond here once and if we want to continue we should probably take the discussion to email.

I'm only looking at the code through a peephole, without a full understanding of the build or deployment environment, or other experiments that you may have run, so take my comments with a big grain of salt. AFAICT, the cited code is implementing a bzero or memset(X,0,Y) for 768 (2*6*64) bytes. On OS X, as on most platforms, the standard library memset/bzero has been carefully optimized and even uses dcbz/dcbzl [1]. I suspect that the system memset performs nearly as well as or better than the cited code. However, I'm guessing that wasn't true at some point in the past, otherwise you wouldn't have replaced the clear_blocks_c function.

If the clear_blocks_dcbz* functions do still perform better, then I would probably do something like what you have there, although possibly at configure time instead. It really comes down to whether the dcbzl instruction is safe to use on non-970 platforms, and I just don't know the answer to that question. We have at least one example of where it is, but the processor manuals seem to say that this is undefined behavior.

I might also install a SIGILL handler when checking dcbzl support in check_dcbzl_effect, although if you've never had any problems reported then that may just be a waste of time.

Other than using memset/bzero, this won't fix things under valgrind either way, since lack of dcbzl support is a current VEX shortcoming (w/o the patch, anyway).

In summary, if it's fast and nobody is complaining, leave it well enough alone :-)

[1] http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/commpage/memset_g5.s

Comment 11 Julian Seward 2010-05-06 10:10:20 UTC

What concerns me is this:

http://developer.apple.com/hardwaredrivers/ve/g5.html

  There is a dcbzl instruction newly introduced for the G5 that zeros
  a full 128-byte cacheline. On a G4, it behaves as a dcbz, and zeros
  a 32 byte cacheline. Code that uses dcbzl needs to dynamically
  determine the cacheline size at runtime and respond accordingly.

That means we can't hardwire 128 into the vex patch.  There is a way
to do this right, but it's a whole lot more hassle than Dave's current
patch, possibly a couple of hours hacking around.

In VEX/pub/libvex.h there is a type VexArchInfo, which contains info
about the host that Vex needs to know.  One of those bits of info is
ppc_cache_line_szB, which is the size cleared by dcbz, as told to us
by the kernel at startup.  This struct is passed down into
guest_ppc_toIR.c and so it can see that value.

So we could add another field indicating what the line size for dcbzl
is, assuming it is supported.  Problem is I don't know where to get
that information from apart from trying the instruction on a buffer
full of non-zero bytes and seeing how much gets zeroed.  If you want
to go that route, the place to add the test is VG_(machine_get_hwcaps)
in coregrind/m_machine.c.  If you do add that, it would be nice to
test dcbz too, since I was always a bit nervous about simply relying
on what the kernel says dcbz's behavior is, without verifying it.

Comment 12 Dave Goodell 2010-06-09 21:23:02 UTC

Created attachment 47832 [details]
patch implementing run-time dcbz/dcbzl behavior detection

against valgrind@11140 / VEX@1981

Comment 13 Dave Goodell 2010-06-09 21:29:33 UTC

I've attached a new patch that implements Julian's suggestion.  It worked fine against valgrind@11140/VEX@1981, but fails due to the (unrelated) "--build-id=none" error for 11164/1982.  It looks like the AC_TRY_COMPILE in r11164 should have been an AC_TRY_LINK (or the newer AC_LINK_IFELSE).  AC_TRY_COMPILE doesn't run the linker, IIRC.

Comment 14 Julian Seward 2010-06-14 02:08:50 UTC

(In reply to comment #13)
Looks good at a first glance; thanks.  I'll try to look over it
again in the coming week.

Comment 15 Dave Goodell 2010-06-18 19:56:53 UTC

I happened to be poking around in VEX for something else and noticed that I probably missed initializing those new fields in VEX/priv/main_main.c, in the LibVEX_default_VexArchInfo function.  I don't know if there's ever a case where this would matter, but it's easy to add and slightly improves safety.

Comment 16 Julian Seward 2010-09-03 17:52:20 UTC

Committed, vex r2028, valgrind r11337.  Thanks for the patch
and sorry to be so slow with it.