disInstr(ppc): unhandled instruction: 0x7C2907EC primary 31(0x1F), secondary 2028(0x7EC) dcbzl is a special case for of dcbz that extend the clean to a full cache line (128 instead of 32 on ppc970) references http://sources.redhat.com/ml/binutils/2004-04/msg00783.html
Created attachment 42748 [details] patch that fixes dcbzl support for VEX@1972
Created attachment 42749 [details] patch that fixes dcbzl support for VEX@1972 (accidental double submit, same as earlier patch)
Created attachment 42750 [details] test case for dcbzl
adding myself to CC list
Dave, what CPUs did you test this on?
I used one of the login nodes to one of the IBM Blue Gene/P systems here at Argonne. It claims to be PPC970MP, which I trust after asking around: % cat /etc/issue Welcome to SUSE Linux Enterprise Server 10 SP2 (ppc) - Kernel \r (\l). % uname -a Linux login1 2.6.16.60-0.42.8-ppc64 #1 SMP Tue Dec 15 17:28:00 UTC 2009 ppc64 ppc64 ppc64 GNU/Linux % cat /proc/cpuinfo processor : 0 cpu : PPC970MP, altivec supported clock : 2497.500000MHz revision : 1.1 (pvr 0044 0101) processor : 1 cpu : PPC970MP, altivec supported clock : 2497.500000MHz revision : 1.1 (pvr 0044 0101) processor : 2 cpu : PPC970MP, altivec supported clock : 2497.500000MHz revision : 1.1 (pvr 0044 0101) processor : 3 cpu : PPC970MP, altivec supported clock : 2497.500000MHz revision : 1.1 (pvr 0044 0101) timebase : 14318000 machine : CHRP IBM,8844-AC1 % gcc -v Using built-in specs. Target: powerpc64-suse-linux Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.1.2 --enable-ssp --disable-libssp --disable-libgcj --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --enable-libstdcxx-allocator=new --program-suffix= --enable-version-specific-runtime-libs --without-system-libunwind --with-cpu=default32 --enable-secureplt --with-long-double-128 --host=powerpc64-suse-linux Thread model: posix gcc version 4.1.2 20070115 (SUSE Linux) It generated 32-bit code by default, which is the way that I compiled valgrind. If it's useful, I can rebuild with -m64. I don't have easy access to very many other PPC or POWER machines, so I unfortunately can't offer to test elsewhere right now.
FWIW, I used this PDF manual for information about dcbzl on the 970MP: https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/DC3D43B729FDAD2C00257419006FB955 In particular, page 79 has info about how large the cache block is that is zeroed. It is a function of bit 21 of the instruction and HID5[57:56]. It's not called dcbzl in the manual, it's just a different behavior for the same dcbz instruction. As I mentioned out of band from this ticket, I'm not knowledgeable about the PPC/POWER family overall to say what the right behavior is on non-970 processors. The only thing that seems to make a dcbz into a dcbzl is a reserved bit being set in the instruction. If non-970 cores ignore reserved bits, my patch is probably OK-ish. If that's not true and the VEX philosophy is to try to closely mirror the behavior of the underlying CPU, including when exceptions would be raised, then my patch is too permissive. Also, my patch doesn't account for those HID5 bits. On the mailing list, the person that I was helping actually encountered this when using FFMpeg on some e300c4 core. Apparently FFMpeg uses the dcbzl instruction in either case but just checks to see how many bytes it zeros (32 or 128). This is wrong IMO, but may be common practice. It also seems to run fine on that particular CPU, just zeroing 32 bytes. http://www.mail-archive.com/valgrind-users@lists.sourceforge.net/msg01552.html
http://gitorious.org/ffmpeg/mainline/blobs/master/libavcodec/ppc/dsputil_ppc.c#line187 This checks if dcbz or dcbzl should be used and acts accordingly. If you think it could be improved please do tell =)
Luca: I don't want to hijack this bug ticket, so I'll respond here once and if we want to continue we should probably take the discussion to email. I'm only looking at the code through a peephole, without a full understanding of the build or deployment environment, or other experiments that you may have run, so take my comments with a big grain of salt. AFAICT, the cited code is implementing a bzero or memset(X,0,Y) for (2*6*64)
[pardon the keyboard fat-fingering above...] Luca: I don't want to hijack this bug ticket, so I'll respond here once and if we want to continue we should probably take the discussion to email. I'm only looking at the code through a peephole, without a full understanding of the build or deployment environment, or other experiments that you may have run, so take my comments with a big grain of salt. AFAICT, the cited code is implementing a bzero or memset(X,0,Y) for 768 (2*6*64) bytes. On OS X, as on most platforms, the standard library memset/bzero has been carefully optimized and even uses dcbz/dcbzl [1]. I suspect that the system memset performs nearly as well as or better than the cited code. However, I'm guessing that wasn't true at some point in the past, otherwise you wouldn't have replaced the clear_blocks_c function. If the clear_blocks_dcbz* functions do still perform better, then I would probably do something like what you have there, although possibly at configure time instead. It really comes down to whether the dcbzl instruction is safe to use on non-970 platforms, and I just don't know the answer to that question. We have at least one example of where it is, but the processor manuals seem to say that this is undefined behavior. I might also install a SIGILL handler when checking dcbzl support in check_dcbzl_effect, although if you've never had any problems reported then that may just be a waste of time. Other than using memset/bzero, this won't fix things under valgrind either way, since lack of dcbzl support is a current VEX shortcoming (w/o the patch, anyway). In summary, if it's fast and nobody is complaining, leave it well enough alone :-) [1] http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/commpage/memset_g5.s
What concerns me is this: http://developer.apple.com/hardwaredrivers/ve/g5.html There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline. On a G4, it behaves as a dcbz, and zeros a 32 byte cacheline. Code that uses dcbzl needs to dynamically determine the cacheline size at runtime and respond accordingly. That means we can't hardwire 128 into the vex patch. There is a way to do this right, but it's a whole lot more hassle than Dave's current patch, possibly a couple of hours hacking around. In VEX/pub/libvex.h there is a type VexArchInfo, which contains info about the host that Vex needs to know. One of those bits of info is ppc_cache_line_szB, which is the size cleared by dcbz, as told to us by the kernel at startup. This struct is passed down into guest_ppc_toIR.c and so it can see that value. So we could add another field indicating what the line size for dcbzl is, assuming it is supported. Problem is I don't know where to get that information from apart from trying the instruction on a buffer full of non-zero bytes and seeing how much gets zeroed. If you want to go that route, the place to add the test is VG_(machine_get_hwcaps) in coregrind/m_machine.c. If you do add that, it would be nice to test dcbz too, since I was always a bit nervous about simply relying on what the kernel says dcbz's behavior is, without verifying it.
Created attachment 47832 [details] patch implementing run-time dcbz/dcbzl behavior detection against valgrind@11140 / VEX@1981
I've attached a new patch that implements Julian's suggestion. It worked fine against valgrind@11140/VEX@1981, but fails due to the (unrelated) "--build-id=none" error for 11164/1982. It looks like the AC_TRY_COMPILE in r11164 should have been an AC_TRY_LINK (or the newer AC_LINK_IFELSE). AC_TRY_COMPILE doesn't run the linker, IIRC.
(In reply to comment #13) Looks good at a first glance; thanks. I'll try to look over it again in the coming week.
I happened to be poking around in VEX for something else and noticed that I probably missed initializing those new fields in VEX/priv/main_main.c, in the LibVEX_default_VexArchInfo function. I don't know if there's ever a case where this would matter, but it's easy to add and slightly improves safety.
Committed, vex r2028, valgrind r11337. Thanks for the patch and sorry to be so slow with it.