Version: unspecified OS: Linux Hello, here is an initial patch-set to add support for zSeries (aka IBM mainframe) to valgrind. The patch is against r11140 VEX r1981 and consists of three parts - patch for configury and make machinery - VEX patch - everything else In this initial version we are targeting the following tools: memcheck, massif, lackey, and none. The set of supported instructions is fairly complete with a few of the complex ones still missing. Binary floating point is supported. This is work in progress and not ready for primetime. But we wanted to announce it to get the review started. If you want to play with the patch please consult the file README.s390 for hardware requirements and limitations. Comments are welcome. Reproducible: Didn't try
Created attachment 48535 [details] Patch for configury and make
Created attachment 48536 [details] Patch for VEX
Created attachment 48537 [details] Patch for everything else
Created attachment 48600 [details] Potential fix for the last test case failure in none After some fixing this is the current status for the test suite for none. I only get a failure in faultstatus. faultstatus fails in 3 of the 4 cases, but also without valgrind: - one problem is an s390 kernel bug (the si_code for SIGBUS is broken) - the hardware does not deliver the page offset during a page fault - the ftruncate does not work because we open the temp file without write access. So the hardware can deliver two faults: SIGSEGV because of write access and SIGBUS because of no file backing. Turns out that on s390 we get SIGBUS for test2 instead of SIGSEV. This is a potential fix for 2 of the 3 failures.
Created attachment 49510 [details] Version2: patch to implement VEX for s390x
Created attachment 49511 [details] Version2: build system patch
Created attachment 49512 [details] Version2: coregrind and tools
Created attachment 49514 [details] Version2: Testsuite patches This is the last patch of the newest snapshot for valgrind on s390x. Several bugfixes and enhancements - make drd and helgrind work better by fixing noredir and the EX trampoline - make the EX trampoline robust against signal handler containing EX - new instructions (cvb,cvd,exrl) - a start to move our test cases into the regression suite - implement the s390x parts of other testcases - fix signal handling during system call - implement gdb attach - implement core dump - fix system calls with 6 parameters - implement hw capabilities by testing instructions - implement mkLazy3: I32 x I128 x I128 -> I128 - fix ptrace - improve the dispatcher code for performance The main todo is to fix MVC. MVC has two modes: 1. destructive overlap (looks like a bytewise left to right copy. Thats what we currently have) - used for memset 2. no destructive overlap (access concurrent for other CPUs on double word (64bit) boundary - used for memcopy and memory2memory assigments The current version implements 1, since it will also work for 2. But this will cause tests to see 1 byte read/writes instead of 2,4 or 8 byte reads if gcc used the mvc. The testsuite: massif: 0 errors lackey: 0 errors cachegrind: 0 errors none: 1 error left (known kernel problem - wrong si code for SIGBUS) memcheck: 5 errors left (mostly mvc related) helgrind: 6 errors left (also mostly mvc related) drd: 18 errors (not yet checked why) no support for exp-bbv,exp-ptrcheck and callgrind yet. As a funny side note: valgrind triggered a gcc error in gcc 4.6. Reported and fixed: (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45017)
Created attachment 49547 [details] experimental patch to fix mvc Here is a patch that implements mvc differently. With this patch the performance is worse, but the test suite looks good, only callgrind and exp-ptrcheck are still not working. == 439 tests, 38 stderr failures, 7 stdout failures, 0 post failures == callgrind/tests/clreq (stderr) callgrind/tests/notpower2-hwpref (stderr) callgrind/tests/notpower2-use (stderr) callgrind/tests/notpower2-wb (stderr) callgrind/tests/notpower2 (stderr) callgrind/tests/simwork1 (stdout) callgrind/tests/simwork1 (stderr) callgrind/tests/simwork2 (stdout) callgrind/tests/simwork2 (stderr) callgrind/tests/simwork3 (stdout) callgrind/tests/simwork3 (stderr) callgrind/tests/threads (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) exp-ptrcheck/tests/bad_percentify (stdout) exp-ptrcheck/tests/bad_percentify (stderr) exp-ptrcheck/tests/base (stderr) exp-ptrcheck/tests/ccc (stderr) exp-ptrcheck/tests/fp (stderr) exp-ptrcheck/tests/globalerr (stderr) exp-ptrcheck/tests/hackedbz2 (stdout) exp-ptrcheck/tests/hackedbz2 (stderr) exp-ptrcheck/tests/hp_bounds (stderr) exp-ptrcheck/tests/hp_dangle (stderr) exp-ptrcheck/tests/hsg (stdout) exp-ptrcheck/tests/hsg (stderr) exp-ptrcheck/tests/justify (stderr) exp-ptrcheck/tests/partial_bad (stderr) exp-ptrcheck/tests/partial_good (stderr) exp-ptrcheck/tests/preen_invars (stdout) exp-ptrcheck/tests/preen_invars (stderr) exp-ptrcheck/tests/pth_create (stderr) exp-ptrcheck/tests/pth_specific (stderr) exp-ptrcheck/tests/realloc (stderr) exp-ptrcheck/tests/stackerr (stderr) exp-ptrcheck/tests/strcpy (stderr) exp-ptrcheck/tests/supp (stderr) exp-ptrcheck/tests/tricky (stderr) exp-ptrcheck/tests/unaligned (stderr) exp-ptrcheck/tests/zero (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc22_exit_w_lock (stderr) helgrind/tests/tc23_bogus_condwait (stderr) none/tests/faultstatus (stderr) I will check if I can improve the performance. Afterwards we can refresh all patch against latest upstream and slowly start the discussion what else is necessary for upstream integration. Christian
Created attachment 51454 [details] Latest patch against 11347/2032 part1/2 (new files) For reference, here is the latest version of the s390x port. We made good progress in reducing common code changes. Most of the changes are now within elseif VG*s390 or in new files. We also did a lot of performance improvements thanks due to spechelper and other things. This patch also has an EXecute implementation that uses self checking code instead of a trampoline. Given that we dont yet support callgrind and exp-ptrcheck, the test suite looks pretty good on SLES11SP1: == 447 tests, 43 stderr failures, 10 stdout failures, 0 post failures == callgrind/tests/clreq (stderr) callgrind/tests/notpower2-hwpref (stderr) callgrind/tests/notpower2-use (stderr) callgrind/tests/notpower2-wb (stderr) callgrind/tests/notpower2 (stderr) callgrind/tests/simwork-both (stdout) callgrind/tests/simwork-both (stderr) callgrind/tests/simwork-branch (stdout) callgrind/tests/simwork-branch (stderr) callgrind/tests/simwork-cache (stdout) callgrind/tests/simwork-cache (stderr) callgrind/tests/simwork1 (stdout) callgrind/tests/simwork1 (stderr) callgrind/tests/simwork2 (stdout) callgrind/tests/simwork2 (stderr) callgrind/tests/simwork3 (stdout) callgrind/tests/simwork3 (stderr) callgrind/tests/threads-use (stderr) callgrind/tests/threads (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) exp-ptrcheck/tests/bad_percentify (stdout) exp-ptrcheck/tests/bad_percentify (stderr) exp-ptrcheck/tests/base (stderr) exp-ptrcheck/tests/ccc (stderr) exp-ptrcheck/tests/fp (stderr) exp-ptrcheck/tests/globalerr (stderr) exp-ptrcheck/tests/hackedbz2 (stdout) exp-ptrcheck/tests/hackedbz2 (stderr) exp-ptrcheck/tests/hp_bounds (stderr) exp-ptrcheck/tests/hp_dangle (stderr) exp-ptrcheck/tests/hsg (stdout) exp-ptrcheck/tests/hsg (stderr) exp-ptrcheck/tests/justify (stderr) exp-ptrcheck/tests/partial_bad (stderr) exp-ptrcheck/tests/partial_good (stderr) exp-ptrcheck/tests/preen_invars (stdout) exp-ptrcheck/tests/preen_invars (stderr) exp-ptrcheck/tests/pth_create (stderr) exp-ptrcheck/tests/pth_specific (stderr) exp-ptrcheck/tests/realloc (stderr) exp-ptrcheck/tests/stackerr (stderr) exp-ptrcheck/tests/strcpy (stderr) exp-ptrcheck/tests/supp (stderr) exp-ptrcheck/tests/tricky (stderr) exp-ptrcheck/tests/unaligned (stderr) exp-ptrcheck/tests/zero (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc09_bad_unlock (stderr) helgrind/tests/tc23_bogus_condwait (stderr) memcheck/tests/badfree3 (stderr) none/tests/faultstatus (stderr) Thats 8 errors without callgrind/exp-ptrcheck. Performance got also much better: -- Running tests in perf ---------------------------------------------- bigcode1 src :0.29s no: 7.2s (24.9x, -----) me:13.5s (46.6x, -----) bigcode2 src :0.30s no:19.2s (64.1x, -----) me:38.0s (126.6x, -----) bz2 src :1.06s no: 8.3s ( 7.8x, -----) me:25.6s (24.1x, -----) fbench src :0.74s no: 4.2s ( 5.6x, -----) me: 9.2s (12.5x, -----) ffbench src :0.61s no: 2.5s ( 4.1x, -----) me: 6.4s (10.4x, -----) heap src :0.41s no: 2.9s ( 7.0x, -----) me:16.2s (39.6x, -----) sarp src :0.04s no: 0.7s (17.5x, -----) me: 5.8s (144.0x, -----) tinycc src :0.34s no: 5.9s (17.3x, -----) me:25.6s (75.1x, -----) -- Finished tests in perf ---------------------------------------------- What needs to be done for integration? - There might be 2 or 3 places where we can further reduce the common code changes - Integrate several of our own tests into the regression test suite Any more ideas?
Created attachment 51455 [details] Latest patch against 11347/2032 part2/2 (changes) 2nd part.
Created attachment 52024 [details] updated patch to r11383/r2046 new files. (1/2) this is part1 of an updated patch. on a z10 with sles11: -- Running tests in perf ---------------------------------------------- bigcode1 src :0.29s no: 7.1s (24.5x, -----) me:13.2s (45.7x, -----) bigcode2 src :0.31s no:18.9s (61.0x, -----) me:36.9s (119.2x, -----) bz2 src :1.06s no: 6.8s ( 6.4x, -----) me:22.8s (21.5x, -----) fbench src :0.74s no: 3.0s ( 4.0x, -----) me: 7.9s (10.6x, -----) ffbench src :0.61s no: 1.8s ( 3.0x, -----) me: 5.8s ( 9.5x, -----) heap src :0.41s no: 2.6s ( 6.3x, -----) me:16.3s (39.8x, -----) sarp src :0.04s no: 0.7s (16.8x, -----) me: 5.6s (139.8x, -----) tinycc src :0.35s no: 5.5s (15.8x, -----) me:25.1s (71.7x, -----) -- Finished tests in perf ---------------------------------------------- == 8 programs, 16 timings ================= == 426 tests, 21 stderr failures, 6 stdout failures, 0 post failures == callgrind/tests/clreq (stderr) callgrind/tests/notpower2-hwpref (stderr) callgrind/tests/notpower2-use (stderr) callgrind/tests/notpower2-wb (stderr) callgrind/tests/notpower2 (stderr) callgrind/tests/simwork-both (stdout) callgrind/tests/simwork-both (stderr) callgrind/tests/simwork-branch (stdout) callgrind/tests/simwork-branch (stderr) callgrind/tests/simwork-cache (stdout) callgrind/tests/simwork-cache (stderr) callgrind/tests/simwork1 (stdout) callgrind/tests/simwork1 (stderr) callgrind/tests/simwork2 (stdout) callgrind/tests/simwork2 (stderr) callgrind/tests/simwork3 (stdout) callgrind/tests/simwork3 (stderr) callgrind/tests/threads-use (stderr) callgrind/tests/threads (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc09_bad_unlock (stderr) helgrind/tests/tc23_bogus_condwait (stderr) memcheck/tests/badfree3 (stderr) none/tests/faultstatus (stderr)
Created attachment 52025 [details] updated patch to r11383/r2046 changed files (2/2) 2nd part
Created attachment 52141 [details] Latest patch against mainline 11391 VEX 2057 new files (1/2) This is the latest patch set against mainline 11391 VEX 2057, since the old patch did no longer apply due to the fixes from 248893, 246888,247875. This also fixes all issues that I have seen on x86: - compile warnings due to 32 vs. 64bit - fix arch_test to not run s390x test cases on other platforms Please note that you have to manually do a "chmod u+x none/tests/s390x/filter_stderr" since patch/diff does not save the file attributes.
Created attachment 52142 [details] Latest patch against mainline 11391 VEX 2057 changed files (2/2) 2nd part.
Created attachment 52457 [details] update to 11433/2065 new files (1/2) update to latest upstream. - Contains s390x fix for https://bugs.kde.org/show_bug.cgi?id=243270 - exp-dhat seems to work - firefox runs fine with memcheck and is actually reasonable usable with --tool=none - also contains the fix from https://bugs.kde.org/show_bug.cgi?id=253206 to make tool=none run without error As always, you have to manually do a "chmod u+x none/tests/s390x/filter_stderr" since patch/diff does not save the file attributes.
Created attachment 52458 [details] update to 11433/2065 changes (2/2) part two 11433/2065
Testing with a newer compiler actually revealed a small typo: This makes valgrind compile with gcc 4.5 again: Index: include/valgrind.h =================================================================== --- include/valgrind.h (revision 866) +++ include/valgrind.h (working copy) @@ -4327,7 +4327,7 @@ ".cfi_remember_state\n\t" \ ".cfi_def_cfa r11, 0\n\t" # define VALGRIND_CFI_EPILOGUE \ - "movq 11, 7\n\t" \ + "lgr 11, 7\n\t" \ ".cfi_restore_state\n\t" #else # define __FRAME_POINTER
Created attachment 52534 [details] patch 1/2 against 11447/2066 new files fixes a bug in the new cfi annotations for the function wrappers: We have to save the argvec pointer because gcc might use r11 which we overwrite. Compiles and works now fine on older and newer gcc. This patch applies without offset/hunk agaist latest upstream as well as against the current 3.6 branch.
Created attachment 52535 [details] patch 2/2 against 11447/2066 changes files
Created attachment 52679 [details] patch 2/2 changed files somehow the changed patch got mixed up with the new files one. Update this one.,
Created attachment 52718 [details] Patch 1/2 against 3.6: new files Updated patch against 3.6, containing the new files. - refreshed patches - removed some test case changes that are not required for the port and do have or will have their own bugzilla - Florian added partial support for the z196 - compare and swap insns now yield if comparison fails - more fixes for the functions wrappers - you have to "chmod u+x none/tests/s390x/filter_stderr" These patches also apply against current head (there is an offset 113 for configure.in) Test status: Tested on x86 (32bit Debian): no regressions Test results on a sles11sp1 s390x: == 426 tests, 19 stderr failures, 6 stdout failures, 0 post failures == callgrind/tests/clreq (stderr) callgrind/tests/notpower2-hwpref (stderr) callgrind/tests/notpower2-use (stderr) callgrind/tests/notpower2-wb (stderr) callgrind/tests/notpower2 (stderr) callgrind/tests/simwork-both (stdout) callgrind/tests/simwork-both (stderr) callgrind/tests/simwork-branch (stdout) callgrind/tests/simwork-branch (stderr) callgrind/tests/simwork-cache (stdout) callgrind/tests/simwork-cache (stderr) callgrind/tests/simwork1 (stdout) callgrind/tests/simwork1 (stderr) callgrind/tests/simwork2 (stdout) callgrind/tests/simwork2 (stderr) callgrind/tests/simwork3 (stdout) callgrind/tests/simwork3 (stderr) callgrind/tests/threads-use (stderr) callgrind/tests/threads (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc23_bogus_condwait (stderr) none/tests/faultstatus (stderr)
Created attachment 52719 [details] Patch 2/2 against 3.6: changes 2nd part. On a fedora like system the test suite even looks better, since callgrind works here: == 410 tests, 8 stderr failures, 0 stdout failures, 0 post failures == drd/tests/annotate_spinlock (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc23_bogus_condwait (stderr) memcheck/tests/partiallydefinedeq (stderr) none/tests/faultstatus (stderr)
As another reminder. The patches work against the 3.6 branch in the svn repository. If the patches are applied against the tarball, you have to rerun the autotools stuff for a refreshed configure aclocal autoheader automake -a autoconf
Created attachment 53280 [details] Updated Patches for s390x 1/2 (created files) This is a minor update of the s390x patches. The patch set will apply on valgrind svn 11477/2069 without any hunk/offset but also on valgrind-3.6 (there is an offset 113 in configure.in). There are only minor changes (all non-critical) to the last patch set: - make lackey F128 aware - reduce VEX GENOFFSET definitions to what is really needed - whitespace and comment fixes - move implicit knowledge of dwarf return address handling from debuginfo.c into readdwarf.c - add some testcase fixes for s390x - fix massif stack handling with lonjump (see https://bugs.kde.org/show_bug.cgi?id=256043)
Created attachment 53281 [details] Updated Patches for s390x 2/2 (changed files) 2nd part.
Created attachment 53325 [details] valgrind-s390x-build.patch Unfortunately those patches break building on all other architectures. The problem is that the whole guest_s390_cc.c file is guarded with VGA_s390x, and functions from it are referenced from other files which aren't guarded that way. This patch lets me at least build it on x86_64-linux. Of course ideally there would be a replacement for #ifndef VGA_s390x for the stuff currently performed by inline asm only.
That is strange. I have no problems building valgrind on x86 (32bit Debian) with these patches. Will double check.
I got myself an x86_64 system - I can reproduce the problem - and your fix. Still dont understand why it does not happen on 32bit. Some of the codition code stuff would become less efficient in C and x86 also has several helpers with inline asms. Julian, Jakub, given that this patch also works on arm and ppc, would it be ok to use this patch for the time being?
(In reply to comment #29) > I got myself an x86_64 system - I can reproduce the problem - and your fix. > Still dont understand why it does not happen on 32bit. > It's an optimization thing.. When compiling with -m32 GCC will figure out that the assertion vassert(0 == sizeof(VexGuestS390XState) % VexGuestS390XStateAlignment); does not hold and therefore the process ends here. That in turn renders the assignment disInstrFn = disInstr_S390; useless and it gets optimized away thereby removing the only reference to disInstr_S390. The undefined symbols that are reported on x86_64 are only reachable if disInstr_S390 is reachable.
Created attachment 56228 [details] Patch against 11502/2081 1of2 (new files) Here is an updated patch set against 11502/2081. The patch set also applies to the 3.6 branch (with some offsets here and there). It also includes the build fix from Jakub and 2 small fixes: - improve the comment about page faults on s390 which do not deliver the offset within the page - fix the test data class instruction condition code(also add a testcase)
Created attachment 56229 [details] Patch against 11502/2081 2of2 (changes) 2nd part
Created attachment 56230 [details] what has changed since last patch set For review purposes: This patch contains all changes since the last patch.
(In reply to comment #32) > Created an attachment (id=56229) [details] > Patch against 11502/2081 2of2 (changes) ================================================================ Comments for c32-changes.diff (attachment ID = 56229) Generally the quality is high and this looks like a pretty clean patch. Only one larger concern, and a bunch of smaller points. I did not yet look at the new files patch. Larger concerns --------------- libvex_ir.h: new additions to IRRoundingMode: hrrm, this isn't good. What's the deal? Why are these necessary? Smaller comments ---------------- syswrap-main.c I didn't understand the change to the comment about the use of registers in syscalls. (line 60). What's SYSNO in the NUM column? Why isn't it a register name? m_debuginfo/priv_storage.h For the typedef of DiCfSI, can we have a comment pls similar to those preceding the definitions, (eg) along the lines of existing comment "On ARM it's pretty much the same, except .." m_signals.c, extend_stack_if_appropriate() the change from fault >= (esp - VG_STACK_REDZONE_SZB) to fault >= VG_PGROUNDDN(esp - VG_STACK_REDZONE_SZB) I'd prefer the non-s390x case to not be changed. This logic is a bit tricky to get right. coregrind/m_syscall.c #define _svc_clobber ... inline this -- it's only used once m_initimg/initimg-linux.c + vg_assert(0 == sizeof(VexGuestS390XState) % VexGuestS390XStateAlignment); Is this necessary? None of the other linux targets have that. Plus pre_run_checks in m_scheduler/scheduler.c should check everything anyway. (jseward TODO: LibVEX_Translate: insn_bytes: how do we know this is not overrun? (on any target?) VEX/priv/main_main.c: + /* Make sure the size of the guest state is small enough to be used + as an offset in a B12 amode. */ + vassert(sizeof(VexGuestS390XState) < (1 << 12)); Why is this a useful check? We could be asked to generate an offset of up to 3 * sizeof(VexGuestS390XState) from the baseblock pointer. Plus anyway, surely it is the job of the instruction selector to generate in-range offsets for such amodes. VEX/pub/libvex_ir.h rename F64HLto128 to F64HLtoF128 ditto F128to64, F128HIto64 no need to repeat comment on top of Iop_CmpF32/128, just add CmpF32 and CmpF128 to CmpF64 (jseward TODO: We should get rid of IRCmpF64Result and rename it simply to IRCmpFResult, and not bother with IRCmpF32/128Result) libvex.h: +/* Special value representing all available s390x hwcaps */ +#define VEX_HWCAPS_S390X_ALL (VEX_HWCAPS_S390X_LDISP | \ + VEX_HWCAPS_S390X_EIMM | \ + VEX_HWCAPS_S390X_GIE | \ + VEX_HWCAPS_S390X_DFP) Is this necessary? configure.in + VGCONF_PLATFORM_SEC_CAPS="" Fix inconsistent indentation (tab vs space problem?) (jseward TODO: re-check the mc_translate stuff.) mc_instrument.c: what type is F128 shadowed by? I64 or I128? If the latter, then F64HLto128 and inverses might be doable better (and this is relevant to all platforms, not just s390x). How are atomic memory operations handled? I saw no new IR constructions related to atomic memory ops, so I suppose the existing LL/SC or CAS constructions are enough? Why does memcheck/tests/supp_unknown.supp have a new entry? include/pub_tool_machine.h +# define VG_MAX_INSTR_SZB 6 is that really right? Seems a bit low. include/valgrind.h __SPECIAL_INSTRUCTION_PREAMBLE is that really unique enough? What does an lr instruction do? If lr x,y is just an integer register move, I'd say it is not unique enough: a really stupid compiler having a really bad day might actually generate that code. comment fix: "Similar craziness as x86" shouldn't that be "as amd64" ? =================================================================
(In reply to comment #34) > Thanks for the review. We'll fix the minor issues related to indentation, comments, and VEX floating point operator name changes in the next patch. > Larger concerns > --------------- > > libvex_ir.h: > new additions to IRRoundingMode: hrrm, this isn't good. > What's the deal? Why are these necessary? > The "convert to fixed" instruction distinguishes between "round to nearest with ties away from 0" and "round to nearest with ties to even". We've added Irrm_NEAREST_AWAY to support the former. However, we will investigate whether GCC actually uses that mode. If not, we can probably drop this rounding mode, as it's obscure enough. The Irrm_CURRENT is in support of "round according to the current setting of the rounding mode in the FPC register". First, I tried to understand how this was done in the PPC port but did not quite follow. Then I thought: If we had Irrm_CURRENT, then we could implement a simple scheme where the FPC register is saved and restored whenever we switch between valgrind proper and client code. So the FPC value should always be correct for the client and we would not have to worry about tracking the actual rounding mode setting. That is what is currently implemented. But we're of course open for suggestions. > > Smaller comments > ---------------- > > syswrap-main.c > I didn't understand the change to the comment about the use of > registers in syscalls. (line 60). What's SYSNO in the NUM column? > Why isn't it a register name? > Not sure. It should be. > m_initimg/initimg-linux.c > + vg_assert(0 == sizeof(VexGuestS390XState) % VexGuestS390XStateAlignment); > Is this necessary? None of the other linux targets have that. > Plus pre_run_checks in m_scheduler/scheduler.c should check everything > anyway. > This is a leftover from long time ago (before r8804) where there was no requirement that the size of the guest state ought to be a multiple of 16. It is redundant now and we can get rid of it. > VEX/priv/main_main.c: > + /* Make sure the size of the guest state is small enough to be used > + as an offset in a B12 amode. */ > + vassert(sizeof(VexGuestS390XState) < (1 << 12)); > Why is this a useful check? We could be asked to generate an offset > of up to 3 * sizeof(VexGuestS390XState) from the baseblock pointer. Good point. I'll fix that. > libvex.h: > +/* Special value representing all available s390x hwcaps */ > +#define VEX_HWCAPS_S390X_ALL (VEX_HWCAPS_S390X_LDISP | \ > + VEX_HWCAPS_S390X_EIMM | \ > + VEX_HWCAPS_S390X_GIE | \ > + VEX_HWCAPS_S390X_DFP) > Is this necessary? > This is in support of the hwcaps sanity check at the beginning of the toplevel insn selector. If we need to add a new hwcap in the future we only need to adapt VEX_HWCAPS_S390X_ALL and its definition is right there. If we did not have it we would have to remember adapting the sanity check in the selector which is harder. > mc_instrument.c: what type is F128 shadowed by? I64 or I128? > If the latter, then F64HLto128 and inverses might be doable better > (and this is relevant to all platforms, not just s390x). > I128. We'll look into it. > How are atomic memory operations handled? I saw no new IR > constructions related to atomic memory ops, so I suppose the > existing LL/SC or CAS constructions are enough? > The CAS constructions have been sufficient. > Why does memcheck/tests/supp_unknown.supp have a new entry? > Because we report the correct line (17 in "main") where the error occurs. See also the file supp_unknown.stderr.exp-s390x there. So we need to adapt the suppression pattern for the testcase to pass. > include/pub_tool_machine.h > +# define VG_MAX_INSTR_SZB 6 > is that really right? Seems a bit low. > Yes, it's correct. Instructions are fixed length; 2, 4, or 6 bytes long. > include/valgrind.h > __SPECIAL_INSTRUCTION_PREAMBLE > is that really unique enough? What does an lr instruction do? > If lr x,y is just an integer register move, I'd say it is not > unique enough: a really stupid compiler having a really bad day > might actually generate that code. > > comment fix: "Similar craziness as x86" > shouldn't that be "as amd64" ? > > =================================================================
> syswrap-main.c > I didn't understand the change to the comment about the use of > registers in syscalls. (line 60). What's SYSNO in the NUM column? > Why isn't it a register name? is it uderstandable with this patch on top? --- valgrind-ibm.orig/coregrind/m_syswrap/syswrap-main.c +++ valgrind-ibm/coregrind/m_syswrap/syswrap-main.c @@ -67,7 +67,13 @@ ppc32 r0 r3 r4 r5 r6 r7 r8 n/a n/a r3+CR0.SO (== ARG1) ppc64 r0 r3 r4 r5 r6 r7 r8 n/a n/a r3+CR0.SO (== ARG1) arm r7 r0 r1 r2 r3 r4 r5 n/a n/a r0 (== ARG1) - s390x SYSNO r2 r3 r4 r5 r6 r7 n/a n/a r2 (== ARG1) + On s390x the svc instruction is used for system calls. The system call + number is encoded in the instruction (8 bit immediate field). Since Linux + 2.6 it is also allowed to use svc 0 with the system call number in r1. + This was introduced for system calls >255, but works for all. It is + also possible to see the svc 0 together with an EXecute instruction, that + fills in the immediate field. + s390x r1/SVC r2 r3 r4 r5 r6 r7 n/a n/a r2 (== ARG1) AIX: ppc32 r2 r3 r4 r5 r6 r7 r8 r9 r10 r3(res),r4(err) ppc64 r2 r3 r4 r5 r6 r7 r8 r9 r10 r3(res),r4(err)
Some more feedback on the review (thanks for doing it) > m_debuginfo/priv_storage.h > For the typedef of DiCfSI, can we have a comment pls similar to those > preceding the definitions, (eg) along the lines of existing comment > "On ARM it's pretty much the same, except .." Will do > m_signals.c, extend_stack_if_appropriate() > the change from > fault >= (esp - VG_STACK_REDZONE_SZB) > to > fault >= VG_PGROUNDDN(esp - VG_STACK_REDZONE_SZB) > I'd prefer the non-s390x case to not be changed. This logic > is a bit tricky to get right. Ok. What about fault >= fault_mask(esp - VG_STACK_REDZONE_SZB) with fault_mask(x) = x & ~0xfff on s390x and fault_mask(x) = x otherwise > coregrind/m_syscall.c > #define _svc_clobber ... > inline this -- it's only used once will do include/valgrind.h > __SPECIAL_INSTRUCTION_PREAMBLE > is that really unique enough? What does an lr instruction do? > If lr x,y is just an integer register move, I'd say it is not > unique enough: a really stupid compiler having a really bad day > might actually generate that code. Yes, these are 4 integer loads from 4 different registers into itself. Given that by ABI definition r15 is the stack pointer (and therefore not available for the standard register allocator in gcc) I think it is reasonably safe to assume that we will never see this sequence in real life. Shall we change it anyway? > comment fix: "Similar craziness as x86" > shouldn't that be "as amd64" ? will do
>> libvex_ir.h: >> new additions to IRRoundingMode: hrrm, this isn't good. >> What's the deal? Why are these necessary? >> > > The "convert to fixed" instruction distinguishes between "round to > nearest with ties away from 0" and "round to nearest with ties to even". > We've added Irrm_NEAREST_AWAY to support the former. However, we will > investigate whether GCC actually uses that mode. If not, we can > probably drop this rounding mode, as it's obscure enough. gcc never sets a rounding mode, glibc does. But fesetround only supports FE_TONEAREST, FE_UPWARD, FE_DOWNWARD, and FE_TOWARDZERO so we should never see "NEAREST_AWAY". So lets drop this.
(In reply to comment #35) > > libvex_ir.h: > > new additions to IRRoundingMode: hrrm, this isn't good. > > What's the deal? Why are these necessary? > > > > The "convert to fixed" instruction distinguishes between "round to > nearest with ties away from 0" and "round to nearest with ties to even". > We've added Irrm_NEAREST_AWAY to support the former. However, we will > investigate whether GCC actually uses that mode. If not, we can > probably drop this rounding mode, as it's obscure enough. OK. > The Irrm_CURRENT is in support of "round according to the current setting > of the rounding mode in the FPC register". First, I tried to understand how > this was done in the PPC port but did not quite follow. Then I thought: If > we had Irrm_CURRENT, then we could implement a simple scheme where the FPC > register is saved and restored whenever we switch between valgrind proper > and client code. So the FPC value should always be correct for the client > and we would not have to worry about tracking the actual rounding > mode setting. The problem with this is (I think) that it means the behaviour of FP instructions cannot be stated exactly in IR any more. You have introduced "under the table" semantics, in which the rounding to be used for (eg) Iop_AddF64 is not specified by the first argument to the primop itself (since you set it to Irrm_CURRENT), but instead depends on how previous guest instructions set the rounding mode. So the IR translation for an FP add becomes context-dependent. In the existing ports I used two schemes: * for x86,amd64,arm-vfp: The rounding mode is simply ignored when translating FP instructions, and instead Irrm_NEAREST is used. This means that if have code which sets the FPU to (eg) round to +infinity, and then do FP adds, subtracts, etc, it will produce different results natively vs on Valgrind. This is generally not a big deal. The only place where V respects the rounding mode is in FP<->Int conversions, since these simply wouldn't work at all if they ignored the rounding mode. * on PPC/POWER, this approximation caused the AIX libc to not work. So the PPC front end really does encode the current rounding mode in all IR primops that require it. eg: IRExpr* rm = get_IR_roundingmode(); ... case 0x15: // fadd (Floating Add (Double-Precision), PPC32 p400) if (frC_addr != 0) return False; DIP("fadd%s fr%u,fr%u,fr%u\n", flag_rC ? ".":"", frD_addr, frA_addr, frB_addr); assign( frD, triop(Iop_AddF64, rm, mkexpr(frA), mkexpr(frB)) ); break; and the back end, on seeing an Iop_AddF64 or similar, sets the host's rounding mode accordingly, before generating the fadd itself. Now, that's insanely inefficient, particularly considering that the rounding mode has to be converted (in the front end, in IR) from the PPC format to the IR format (see get_IR_roundingmode in guest_ppc_toIR.c) and then back to the PPC format in the insn selector (see set_FPU_rounding_mode, roundModeIRtoPPC in host_ppc_isel.c) So host_ppc_isel.c optimises this: when it sees a setting of the host rounding mode, it checks whether the rounding mode has already been set to the same value in this superblock, and if so it omits the setting. (see ISelEnv::previous_rm) In most cases the effect is to reduce the overhead to one rounding-mode setting per superblock. Given that FP superblocks tend to be quite long and have many FP insns in them, this is quite effective in reducing the overhead. So I suggest you choose one of the above 2 schemes. You could start with the x86 cheap-hack scheme and upgrade to the PPC scheme if it causes problems, or just use the PPC scheme directly. > > libvex.h: > > +/* Special value representing all available s390x hwcaps */ > > +#define VEX_HWCAPS_S390X_ALL (VEX_HWCAPS_S390X_LDISP | \ > > + VEX_HWCAPS_S390X_EIMM | \ > > + VEX_HWCAPS_S390X_GIE | \ > > + VEX_HWCAPS_S390X_DFP) > > Is this necessary? > > > > This is in support of the hwcaps sanity check at the beginning > of the toplevel insn selector. If we need to add a new hwcap in the future > we only need to adapt VEX_HWCAPS_S390X_ALL and its definition is right > there. If we did not have it we would have to remember adapting > the sanity check in the selector which is harder. Fine. > > mc_instrument.c: what type is F128 shadowed by? I64 or I128? > > If the latter, then F64HLto128 and inverses might be doable better > > (and this is relevant to all platforms, not just s390x). > > > > I128. We'll look into it. Are you sure about that? But anyway, assuming yes, then you should shadow them by Iop_64HLto128, Iop_128to64, Iop_128HIto64 since these are cheaper than the PCasts (PCast is expensive) and more precise -- you're just steering bits around here, so the shadow operations can just steer V bits in the same way.
(In reply to comment #37) > > m_signals.c, extend_stack_if_appropriate() > > the change from > > fault >= (esp - VG_STACK_REDZONE_SZB) > > to > > fault >= VG_PGROUNDDN(esp - VG_STACK_REDZONE_SZB) > > I'd prefer the non-s390x case to not be changed. This logic > > is a bit tricky to get right. > > Ok. What about > fault >= fault_mask(esp - VG_STACK_REDZONE_SZB) > > with fault_mask(x) = x & ~0xfff on s390x > and fault_mask(x) = x otherwise Yes, that would be good. Pls do. > > __SPECIAL_INSTRUCTION_PREAMBLE > > is that really unique enough? What does an lr instruction do? > > If lr x,y is just an integer register move, I'd say it is not > > unique enough: a really stupid compiler having a really bad day > > might actually generate that code. > > Yes, these are 4 integer loads from 4 different registers into itself. Given > that by ABI definition r15 is the stack pointer (and therefore not available > for the standard register allocator in gcc) I think it is reasonably safe to > assume that we will never see this sequence in real life. Shall we change it > anyway? Well, it depends how paranoid you are, I suppose. All the other targets use sequences of rotate instructions which seem to me to be even less likely to be generated by a human or compiler. It's your decision. If there is ever a snafu in which the s390x front end misinterprets an intended set of such moves as a special insn, whilst deep in some 25 million LOC application you're running, you get to spend days tracking down and fixing the bug, not me :-)
(In reply to comment #36) > > syswrap-main.c > > I didn't understand the change to the comment about the use of > > registers in syscalls. (line 60). What's SYSNO in the NUM column? > > Why isn't it a register name? > > is it uderstandable with this patch on top? Yes.
(In reply to comment #31) > Created an attachment (id=56228) [details] > Patch against 11502/2081 1of2 (new files) ================================================================= Comments for c31-new-files.diff (attachment ID = 56228) Again, not bad stuff. I have three main concerns and a number of smaller comments. Larger concerns --------------- 1. For the VEX stuff, I would much prefer the file structure to match the file structure (and, therefore, the assignment of functions to files) for the existing ports, even if you think that it generates some overly large files (it does). With your patch there are a whole bunch of new files and I don't know what stuff is where. 2. The handling of FP rounding modes. As already discussed in comments 35 and 39 above. 3. Unless I missed something, the instruction set tests seem mostly missing. There are some special case ones, but nothing like the comprehensiveness of those for the other targets. Eg, see none/tests/arm/v6intThumb.c, none/tests/arm/vfp.c Without something that comprehensive, particularly for boring integer instructions, it's hard to convince oneself that the implementation is really correct. This might come back to bite you later, in that it can be literally impossible to figure out why a large program is failing. See https://bugs.kde.org/show_bug.cgi?id=245925#c3 for a cautionary tale. Smaller comments ---------------- sigframe-s390x-linux.c just a check -- are you sure you need to handle both rt_sigframe and the older plain sigframe formats? AIUI the need for two different formats mostly exists for older linux ports (eg x86) and does not exist for newer ports (eg, amd64 -- see sigframe-amd64-linux.c) dispatch-s390x-linux.S run_innerloop__dispatch_unprofiled: couldn't you save an instruction here? lgr %r7, %r2 /* next guest addr */ srlg %r7,%r7,1 --> srlg %r7,%r2,1 (We discussed this already, so this is <kind of irrelevant> More seriously, you need to redo the way the FPC is handled. ... need to consider more .. (do FP primops take rounding mode that is actually considered? they should; do like PPC) run_innerloop_exit: you really need to check that the FPC on exit from generated code is as expected. so when fixed up, need to check it's the right value at exit of the dispatcher (if ppc does that) </kind of irrelevant> vex stuff would prefer a file structure that matches all the other ports (sewardj to check: vex must be still be buildable standalone with this patch); (cd VEX && make clean && make libvex.a) guest_s390_toIR.c + /* fixs390: we should probably pass the resteer-function and the callback + data. It's not needed for correctness but improves performance. */ You should allow resteering (following of unconditional branches and unconditional calls). It can improve performance quite a bit. This is something you can do with a followup patch, though -- not important right now. rename s390_irgen_noredir --> s390_irgen_call_noredir (the whole purpose of this is to generate IR for a call instruction which can't be redirected). check -- are there vex own versions for: +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) host_s390_insn.c +#include <stdarg.h> urr, is this necessary? (sewardj TODO: check everything for new #includes) host_s390_disasm.[ch] What's this for? The host-side doesn't need any disassembly facilities on any other platforms. +/* fixs390: I'm not exactly sure whether it is allowed that cc_dep2 is refered + to twice in the expression we build up here. Elsewhere we try to avoid + that (see the bazillions mkU64(0) in irgen.c). On the other hand... + guest_x86_helper.c around line 1144 does this, too. */ It's OK because cc_dep2 should be an IRAtom (either an IRTemp or IRConst) and so referring to it more than once does not duplicate work -- specifically, it does not duplicate an IR expression tree (because IRAtoms can't be trees). You could assert that if you like ("isIRAtom()") remove unused code in directReload_S390 (or make it work) -- probably is not important for s390, only has any noticable effect on x86, since that has very few integer registers. everywhere: 80 char max line length please. In many places in the front end it is a lot more that 80 chars wide. +s390_irgen_FLOGR: "Note, that the Iop_Shl64 and + Iop_Shr64 semantics will preserve the value-to-be-shifted if the + shift-amount equals or is larger than the width of value-to-be-shifted" Wrong reason: The IR definition says that the behavior of these primops is undefined in out of range shift cases -- regardless of what your back end might do with them. ISelEnv: + - A Bool to tell us if the host is 32 or 64bit. + This is set at the start and does not change. Necessary? I thought you only supported 64-bit guest and host in this patch. +#define VexGuestS390XStateAlignment 16 This can disappear now, yes? pre_run_checks checks this for all targets.
Thanks for doing a deep review. > 1. > For the VEX stuff, I would much prefer the file structure to match the > file structure (and, therefore, the assignment of functions to files) > for the existing ports, even if you think that it generates some > overly large files (it does). With your patch there are a whole bunch > of new files and I don't know what stuff is where. Ok, so we will merge most of the guest_s390 and host_s390_ files. > The handling of FP rounding modes. As already discussed in comments > 35 and 39 above. Although this would be a step backwards in terms of full compliance, I am tempted to do the x86 way and simply do default rounding to nearest if the instruction normally would use the value from FPC. Most applications do not set the rounding mode. We can improve that later if it is really necessary for some tools. > Unless I missed something, the instruction set tests seem mostly > missing. There are some special case ones, but nothing like the > comprehensiveness of those for the other targets. Eg, see > none/tests/arm/v6intThumb.c, none/tests/arm/vfp.c Without something > that comprehensive, particularly for boring integer instructions, it's > hard to convince oneself that the implementation is really correct. > This might come back to bite you later, in that it can be literally > impossible to figure out why a large program is failing. See > https://bugs.kde.org/show_bug.cgi?id=245925#c3 > for a cautionary tale. Yes, for several reasons (e.g. licence, historical reasons) we have lots of our testing outside the regression test suite (e.g compiler test framework). Still, we have several testcases that could be moved to the regression suite relatively easy - will do. > sigframe-s390x-linux.c > just a check -- are you sure you need to handle both rt_sigframe > and the older plain sigframe formats? AIUI the need for two > different formats mostly exists for older linux ports > (eg x86) and does not exist for newer ports (eg, amd64 -- > see sigframe-amd64-linux.c) Yes, s390/s390x was ported around 2000 -> lots of legacy cruft and we need both. > dispatch-s390x-linux.S > run_innerloop__dispatch_unprofiled: > couldn't you save an instruction here? > lgr %r7, %r2 /* next guest addr */ > srlg %r7,%r7,1 > --> > srlg %r7,%r2,1 Nice catch. Fixed in our repository. thanks > (We discussed this already, so this is > <kind of irrelevant> > More seriously, you need to redo the way the FPC is handled. > ... need to consider more .. (do FP primops take rounding > mode that is actually considered? they should; do like PPC) > run_innerloop_exit: you really need to check that the FPC > on exit from generated code is as expected. > so when fixed up, need to check it's the right value at > exit of the dispatcher (if ppc does that) > </kind of irrelevant> see above, I want to start with the x86 way, but this was mostly done by Florian, so he has more insight. > vex stuff > would prefer a file structure that matches all the other ports see above. Will merge the files into a guest_s390_defs.h guest_s390_helpers.c guest_s390_toIR.c and host_s390_defs.c host_s390_defs.h host_s390_isel.c > (sewardj to check: vex must be still be buildable standalone > with this patch); (cd VEX && make clean && make libvex.a) > > guest_s390_toIR.c > + /* fixs390: we should probably pass the resteer-function and the callback > + data. It's not needed for correctness but improves performance. */ > You should allow resteering (following of unconditional branches > and unconditional calls). It can improve performance quite a bit. > This is something you can do with a followup patch, though -- not > important right now. Yes, we will do it later. > rename s390_irgen_noredir --> s390_irgen_call_noredir > (the whole purpose of this is to generate IR for a call instruction > which can't be redirected). done. > check -- are there vex own versions for: > +#define likely(x) __builtin_expect(!!(x), 1) > +#define unlikely(x) __builtin_expect(!!(x), 0) > > host_s390_insn.c > +#include <stdarg.h> > urr, is this necessary? For va_arg. We could use __buildin_va_arg to get rid of this include. This is used for the instruction printing below. > > (sewardj TODO: check everything for new #includes) > > host_s390_disasm.[ch] > What's this for? The host-side doesn't need any disassembly > facilities on any other platforms. Its mostly pretty printing for the trace/profile flags. Maybe s390_disasm is a misnomer, what about print_instruction? > +/* fixs390: I'm not exactly sure whether it is allowed that cc_dep2 is refered > + to twice in the expression we build up here. Elsewhere we try to > avoid > + that (see the bazillions mkU64(0) in irgen.c). On the other > hand... > + guest_x86_helper.c around line 1144 does this, too. */ > It's OK because cc_dep2 should be an IRAtom (either an IRTemp or > IRConst) and so referring to it more than once does not duplicate work > -- specifically, it does not duplicate an IR expression tree (because > IRAtoms can't be trees). You could assert that if you like ("isIRAtom()") Comment removed. > > remove unused code in directReload_S390 (or make it work) -- > probably is not important for s390, only has any noticable > effect on x86, since that has very few integer registers. Its mostly working, but somehow creates a failure in our compiler test suite and origin-bz5. Will remove until debugged. > > everywhere: > 80 char max line length please. In many places in > the front end it is a lot more that 80 chars wide. Ok, will fix. > > +s390_irgen_FLOGR: "Note, that the Iop_Shl64 and > + Iop_Shr64 semantics will preserve the value-to-be-shifted if the > + shift-amount equals or is larger than the width of > value-to-be-shifted" > Wrong reason: The IR definition says that the behavior of these primops is > undefined in out of range shift cases -- regardless of what your > back end might do with them. Will have a look. > ISelEnv: > + - A Bool to tell us if the host is 32 or 64bit. > + This is set at the start and does not change. > Necessary? I thought you only supported 64-bit guest and host > in this patch. This is a leftover. Fixed. > +#define VexGuestS390XStateAlignment 16 > This can disappear now, yes? pre_run_checks checks this for > all targets. Yes. Done.
(In reply to comment #43) > > > (We discussed this already, so this is > > <kind of irrelevant> > > More seriously, you need to redo the way the FPC is handled. > > ... need to consider more .. (do FP primops take rounding > > mode that is actually considered? they should; do like PPC) > > run_innerloop_exit: you really need to check that the FPC > > on exit from generated code is as expected. > > so when fixed up, need to check it's the right value at > > exit of the dispatcher (if ppc does that) > > </kind of irrelevant> > see above, I want to start with the x86 way, but this was mostly > done by Florian, so he has more insight. > I would much prefer to implement to PPC scheme, even if it is a bit more work. That way we don't have to revisit it, or have to discuss whether to revisit. > > > check -- are there vex own versions for: > > +#define likely(x) __builtin_expect(!!(x), 1) > > +#define unlikely(x) __builtin_expect(!!(x), 0) > > Not that I could find them, but it would definitely be nice if VEX proper provided them. > > host_s390_insn.c > > +#include <stdarg.h> > > urr, is this necessary? > > For va_arg. We could use __buildin_va_arg to get rid of this include. > This is used for the instruction printing below. > We could use the __builtin stuff but probably would have to check for GCC versions that support it reliably... Perhaps using stdarg.h is simpler, as it does not introduce an unwanted dependency on libc. Unles of course you consider the inclusion of the file itself unwanted. > > host_s390_disasm.[ch] > > What's this for? The host-side doesn't need any disassembly > > facilities on any other platforms. > > Its mostly pretty printing for the trace/profile flags. > Maybe s390_disasm is a misnomer, what about print_instruction? > Not really a misnomer (see also wikipedia :) In addition to disassembling during IR generation we also can disassemble the emitted code. That is not done on other platforms but I found it quite convenient for debugging purposes. I could have called the file guest_s390_disasm.[ch] as well. It's neither specific to guest nor host. > > > > +s390_irgen_FLOGR: "Note, that the Iop_Shl64 and > > + Iop_Shr64 semantics will preserve the value-to-be-shifted if the > > + shift-amount equals or is larger than the width of > > value-to-be-shifted" > > Wrong reason: The IR definition says that the behavior of these primops is > > undefined in out of range shift cases -- regardless of what your > > back end might do with them. > Thanks for the clarification. I did not spot it in libvex_ir.h so I deduced the semantics from ir_defs/ir_opt.c
Created attachment 57150 [details] FYI: review feedback fir changes patch (In reply to comment #34) Here is a patch that should address most issues in the first patch. The only open thing is the rounding mode. One thing though: There was a naming issue: F128toF64 already exists (rounding down from 128 bit to 64bit) so I renamed F128to64 to F128LOtoF64 to better match the F128HItoF64. OK?
(In reply to comment #42) > > +s390_irgen_FLOGR: "Note, that the Iop_Shl64 and > + Iop_Shr64 semantics will preserve the value-to-be-shifted if the > + shift-amount equals or is larger than the width of > value-to-be-shifted" > Wrong reason: The IR definition says that the behavior of these primops is > undefined in out of range shift cases -- regardless of what your > back end might do with them. > OK, here is an updated version of s390_irgrn_FLOGR that avoids the undefined behaviour. HChar * s390_irgen_FLOGR(UChar r1, UChar r2) { IRTemp input = newTemp(Ity_I64); IRTemp not_zero = newTemp(Ity_I64); IRTemp tmpnum = newTemp(Ity_I64); IRTemp num = newTemp(Ity_I64); IRTemp shift_amount = newTemp(Ity_I8); /* We use the "count leading zeroes" operator because the number of leading zeroes is identical with the bit position of the first '1' bit. However, that operator does not work when the input value is zero. Therefore, we set the LSB of the input value to 1 and use Clz64 on the modified value. If input == 0, then the result is 64. Otherwise, the result of Clz64 is what we want. */ assign(input, get_gpr_dw0(r2)); assign(not_zero, binop(Iop_Or64, mkexpr(input), mkU64(1))); assign(tmpnum, unop(Iop_Clz64, mkexpr(not_zero))); /* num = (input == 0) ? 64 : tmpnum */ assign(num, mkite(binop(Iop_CmpEQ64, mkexpr(input), mkU64(0)), /* == 0 */ mkU64(64), /* != 0 */ mkexpr(tmpnum))); put_gpr_dw0(r1, mkexpr(num)); /* Set the leftmost '1' bit of the input value to zero. The general scheme is to first shift the input value by NUM + 1 bits to the left which causes the leftmost '1' bit to disappear. Then we shift logically to the right by NUM + 1 bits. Because the semantics of Iop_Shl64 and Iop_Shr64 are undefined if the shift-amount is greater than or equal to the width of the value-to-be-shifted, we need to special case NUM + 1 >= 64. This is equivalent to INPUT != 0 && INPUT != 1. For both such INPUT values the result will be 0. */ assign(shift_amount, unop(Iop_64to8, binop(Iop_Add64, mkexpr(num), mkU64(1)))); put_gpr_dw0(r1 + 1, mkite(binop(Iop_CmpLE64U, mkexpr(input), mkU64(1)), /* == 0 || == 1*/ mkU64(0), /* otherwise */ binop(Iop_Shr64, binop(Iop_Shl64, mkexpr(input), mkexpr(shift_amount)), mkexpr(shift_amount)))); /* Compare the original value as an unsigned integer with 0. */ s390_cc_thunk_put2(S390_CC_OP_UNSIGNED_COMPARE, input, mktemp(Ity_I64, mkU64(0)), False); return "flogr"; }
(In reply to comment #45) > Created an attachment (id=57150) [details] > FYI: review feedback fir changes patch Looks OK to me.
(In reply to comment #44) > [.. rounding for FP arithmetic ..] > I would much prefer to implement to PPC scheme, even if it is a bit more work. > That way we don't have to revisit it, or have to discuss whether to > revisit. As you like. The important thing is to make sure you have a good test program to verify it's actually working properly. > > > check -- are there vex own versions for: > > > +#define likely(x) __builtin_expect(!!(x), 1) > > > +#define unlikely(x) __builtin_expect(!!(x), 0) > > > > > Not that I could find them, but it would definitely be nice if VEX proper > provided them. Somewhere in the Valgrind tree are defns for LIKELY and UNLIKELY. We should copy those into vex. > Not really a misnomer (see also wikipedia :) > In addition to disassembling during IR generation we also can disassemble the > emitted code. That is not done on other platforms but I found it quite > convenient for debugging purposes. I could have called the file > guest_s390_disasm.[ch] as well. It's neither specific to guest nor > host. IMO this is of marginal value. For all the other targets, the front ends independently print insns (via the DIP macro) and the back ends also independently print insns pre-assembly, via ppX86Instr etc. The only time where I wanted to disassemble the generated code is when debugging emitX86Instr etc, since those fns were often a source of bugs, but even then those were quickly fixed and then worked reliably after that. So I'd prefer not to have this facility, unless you feel really strongly about it.
(In reply to comment #46) > OK, here is an updated version of s390_irgrn_FLOGR that avoids the undefined > behaviour. Yeah .. the other front ends have similarly horrible kludges to solve the same problem (deal with out of range shifts without presenting out of range shifts to the IR primops). The only important thing here is to ensure that, for the IR you're generating, if the shift amount is an immediate (as is most often the case), then ir_opt can fold out all the Clz and Mux0X (ite) gunk and leave a single shift primop in place. (if you see what I mean). So you don't lose performance in the common case.
(In reply to comment #49) > > Yeah .. the other front ends have similarly horrible kludges to solve > the same problem (deal with out of range shifts without presenting out > of range shifts to the IR primops). The only important thing here is > to ensure that, for the IR you're generating, if the shift amount is > an immediate (as is most often the case), then ir_opt can fold out all > the Clz and Mux0X (ite) gunk and leave a single shift primop in place. > (if you see what I mean). So you don't lose performance in the common > case. I will add a testcase, to make sure we get the desired optimizations.
(In reply to comment #48) > > In addition to disassembling during IR generation we also can disassemble the > > emitted code. That is not done on other platforms but I found it quite > > convenient for debugging purposes. I could have called the file > > guest_s390_disasm.[ch] as well. It's neither specific to guest nor > > host. > > IMO this is of marginal value. For all the other targets, the front > ends independently print insns (via the DIP macro) and the back ends > also independently print insns pre-assembly, via ppX86Instr etc. The > only time where I wanted to disassemble the generated code is when > debugging emitX86Instr etc, since those fns were often a source of > bugs, but even then those were quickly fixed and then worked reliably > after that. > > So I'd prefer not to have this facility, unless you feel really > strongly about it. I found the feature really helpful and would prefer to keep it. It is not contributing a lot of code or adding note-worthy runtime.
(In reply to comment #50) > (In reply to comment #49) > > > > Yeah .. the other front ends have similarly horrible kludges to solve > > the same problem (deal with out of range shifts without presenting out > > of range shifts to the IR primops). The only important thing here is > > to ensure that, for the IR you're generating, if the shift amount is > > an immediate (as is most often the case), then ir_opt can fold out all > > the Clz and Mux0X (ite) gunk and leave a single shift primop in place. > > (if you see what I mean). So you don't lose performance in the common > > case. > > I will add a testcase, to make sure we get the desired optimizations. I noticed that ir_opt does not fold Iop_Clz32 / 64. Here is a patch. Would you mind applying it? Index: ir_opt.c =================================================================== --- ir_opt.c (revision 2104) +++ ir_opt.c (working copy) @@ -957,6 +957,30 @@ } +static UInt fold_Clz64(ULong value) +{ + UInt i; + + for (i = 0; i < 64; ++i) { + if (value & ((ULong)1 << (63 - i))) return i; + } + + return 0; +} + + +static UInt fold_Clz32(UInt value) +{ + UInt i; + + for (i = 0; i < 32; ++i) { + if (value & ((UInt)1 << (31 - i))) return i; + } + + return 0; +} + + static IRExpr* fold_Expr ( IRExpr* e ) { Int shift; @@ -1154,6 +1178,16 @@ break; } + case Iop_Clz32: + e2 = IRExpr_Const(IRConst_U32( + fold_Clz32(e->Iex.Unop.arg->Iex.Const.con->Ico.U32))); + break; + + case Iop_Clz64: + e2 = IRExpr_Const(IRConst_U64( + fold_Clz64(e->Iex.Unop.arg->Iex.Const.con->Ico.U64))); + break; + default: goto unhandled; }
(In reply to comment #42) > > Comments for c31-new-files.diff (attachment ID = 56228) > > Again, not bad stuff. I have three main concerns and a number of > smaller comments. > Here is a patch against r11566 / VEX r2104 which addresses your comments. Specifically we fixed all white space and formatting issues, combined the files in VEX to match what is done for the other architectures and added more testcases. The Irrm_CURRENT rounding mode has been eliminated. Instead we're using Irrm_NEAREST (as is done on x86, amd, etc). Eventually, we'd like to implement what is done in the ppc port, but not in this patch. We've tested successfully with auxprogs/gsl16test. Included are also some minor bug fixes in s390 specific code (mostly in testcases) which were exposed by running on an older Z machine with older GCCs. The patch compiles and links OK. Running the testsuite requires bugs 264800 and 265762 to be fixed. Both are trivial and patches are attached to them. I tried to create a dependency in bugzilla, but could not figure it out. Probably I don't have the necessary privs to do that. When you apply the patch, note that none/tests/s390x/filter_stderr needs execute permissions. If there is anything else that you want addressed let us know.
Created attachment 57456 [details] Combined patch addressing review comments
Julian, for reference, here is the make regtest result on SLES11 (with a small callgrind fix) == 450 tests, 7 stderr failures, 0 stdout failures, 0 post failures == memcheck/tests/badfree3 (stderr) none/tests/faultstatus (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc23_bogus_condwait (stderr) drd/tests/tc04_free_lock (stderr) drd/tests/tc09_bad_unlock (stderr) drd/tests/tc23_bogus_condwait (stderr) Please consider to apply the patch.
(In reply to comment #52) > I noticed that ir_opt does not fold Iop_Clz32 / 64. Here is a patch. Committed, with a check against zero -- since Clz32/64(0) has no defined semantics, we can't fold in that case. vex r2106.
(In reply to comment #54) > Combined patch addressing review comments Committed: vex r2105, valgrind r11604, 11605, 11606. Thank you for the high quality patches, and sorry for the slow handling. I see there are followup bugs 264800 and 265762. Any others?
(In reply to comment #57) > (In reply to comment #54) > > Combined patch addressing review comments > > Committed: vex r2105, valgrind r11604, 11605, 11606. Thank you for > the high quality patches, and sorry for the slow handling. > > I see there are followup bugs 264800 and 265762. Any others? If you could apply 247223 that would be sweet. We've got about 60 warnings in our build about regparm. I'm not aware of immediate followup patches. I've some things in the queue but they won't materialize until end of the week or so.
bug #253206 would also be good.
I think this "bug" is comprehensively fixed now.