Summary: | memcheck refuses to run 64-bit executables on openSUSE 10.3 ppc | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Bart Van Assche <bart.vanassche+kde> |
Component: | memcheck | Assignee: | Julian Seward <jseward> |
Status: | REOPENED --- | ||
Severity: | normal | CC: | griffon26, jreiser, njn |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | openSUSE | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: | implement search-by-value for special redirections |
Description
Bart Van Assche
2009-01-30 19:44:03 UTC
I'm not sure we can do anything better than this. So I'm tentatively marking it as WONTFIX. You might be overestimating the difficulty to get memcheck running on openSUSE ppc. The patch below is sufficient to get memcheck running on openSUSE ppc (which I assume is not acceptable for inclusion in Valgrind). This shows that the test whether strlen has been redirected needs a closer look at. Index: coregrind/m_redir.c =================================================================== --- coregrind/m_redir.c (revision 10380) +++ coregrind/m_redir.c (working copy) @@ -910,30 +910,8 @@ } # elif defined(VGP_ppc64_linux) - { - static const HChar croakage[] - = "Possible fix: install glibc's debuginfo package on this machine."; + /* nothing so far */ - /* If we're using memcheck, use these intercepts right from - the start, otherwise ld.so makes a lot of noise. */ - if (0==VG_(strcmp)("Memcheck", VG_(details).name)) { - - /* this is mandatory - can't sanely continue without it */ - add_hardwired_spec( - "ld64.so.1", "strlen", - (Addr)VG_(fnptr_to_fnentry)( &VG_(ppc64_linux_REDIR_FOR_strlen) ), - croakage - ); - - add_hardwired_spec( - "ld64.so.1", "index", - (Addr)VG_(fnptr_to_fnentry)( &VG_(ppc64_linux_REDIR_FOR_strchr) ), - NULL /* not mandatory - so why bother at all? */ - /* glibc-2.5 (FC6, ppc64) seems fine without it */ - ); - } - } - # elif defined(VGP_ppc32_aix5) /* nothing so far */ Julian, any comments? The requirements for which redirects need to available in ld.so at startup time, are a bit conservative. The problem is that Memcheck may work ok on one distro with fewer redirects (eg, Fedora 10, ppc64) but still report huge numbers of false errors on a slightly different platform (eg, OpenSUSE 11.0, ppc64). A background difficulty is that the relative un-availability of virtual machines for ppc makes verification difficult. For similar scenarios on x86 or amd64, VMware etc make it easy to test proposed fixes on a variety of distros, but not so with ppc. 32-bit ppc support (in system mode) now sort-of works on QEMU, but 64-bit support is still essentially unusable, afaict. Asking users to install glibc debuginfo packages is not a viable solution either. They are not always available, and even when they are, there can be problems with online update: the main glibc packages are subject to online update in the normal way, but not the debuginfo packages. Hence either one has to not do online updates, or uninstall glibc-devel. This happened to me at least on openSUSE 10.3 on ppc32. This bug has annoyed me for a long time, because it basically means Valgrind doesn't work reliabily on ppc Linux. The best long term fix would be to make Memcheck's value-tracking machinery be able to handle without false positives, the optimised strlen etc that ppc Linux is using. Then the whole problem would go away. However, identifying the root problem is difficult. I spent an afternoon trying to figure out what was going on, but failed to fix it, and moved on to other things. Reopened this issue such that it becomes again visible in the list of open bugs. (In reply to comment #4) The problem of re-directing strlen in a stripped ld64.so.1 on ppc64 can be solved with some machine-specific determination. I solved the same problem for Wine on x86 about a year ago, and submitted patches to valgrind-developers. The instruction stream for strlen is stylized enough that it can be recognized efficiently just-in-time: just before reporting an error. It takes about 2 comparisons to rule out impostors, up to 3 more comparisons to determine near-certainty, and a memcmp of the entire length to verify that the error is strlen. If it is strlen, then perform the re-direction instead of continuing with reporting the error. If not (and 99% of the time the first two comparisons will say so) then just report the error. If someone would please post the offending code (the disassembly of the actual ppc64 code which implements strlen in the stripped ld64.so.1) then I am willing to provide specific guidance. I do have an old PowerPC Apple Macintosh, but it is only a 32-bit Mac G4 mini, not a 64-bit G5 tower. Not sure this helps, but below you can find the strlen disassembly for Fedora 7 ppc. $ cat /etc/issue.net Fedora release 7 (Moonshine) Kernel \r on an \m $ rpm -qf /lib/libc.so.6 glibc-2.6-3 $ gdb /lib/libc.so.6 GNU gdb Red Hat Linux (6.6-8.fc7rh) Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-redhat-linux-gnu"... (no debugging symbols found) Using host libthread_db library "/lib/libthread_db.so.1". (gdb) disas strlen Dump of assembler code for function strlen: 0x0fea6ce8 <strlen+0>: rlwinm r4,r3,0,0,29 0x0fea6cec <strlen+4>: lis r7,32639 0x0fea6cf0 <strlen+8>: rlwinm r5,r3,3,27,28 0x0fea6cf4 <strlen+12>: lwz r8,0(r4) 0x0fea6cf8 <strlen+16>: li r9,-1 0x0fea6cfc <strlen+20>: addi r7,r7,32639 0x0fea6d00 <strlen+24>: srw r9,r9,r5 0x0fea6d04 <strlen+28>: and r0,r7,r8 0x0fea6d08 <strlen+32>: or r10,r7,r8 0x0fea6d0c <strlen+36>: add r0,r0,r7 0x0fea6d10 <strlen+40>: nor r0,r10,r0 0x0fea6d14 <strlen+44>: and. r8,r0,r9 0x0fea6d18 <strlen+48>: mtcrf 1,r3 0x0fea6d1c <strlen+52>: bne- 0xfea6d8c <strlen+164> 0x0fea6d20 <strlen+56>: lis r6,-257 0x0fea6d24 <strlen+60>: addi r6,r6,-257 0x0fea6d28 <strlen+64>: bgt- cr7,0xfea6d44 <strlen+92> 0x0fea6d2c <strlen+68>: lwzu r8,4(r4) 0x0fea6d30 <strlen+72>: and r0,r7,r8 0x0fea6d34 <strlen+76>: or r10,r7,r8 0x0fea6d38 <strlen+80>: add r0,r0,r7 0x0fea6d3c <strlen+84>: nor. r8,r10,r0 0x0fea6d40 <strlen+88>: bne- 0xfea6d8c <strlen+164> 0x0fea6d44 <strlen+92>: lwz r8,4(r4) 0x0fea6d48 <strlen+96>: lwzu r9,8(r4) 0x0fea6d4c <strlen+100>: add r0,r6,r8 0x0fea6d50 <strlen+104>: nor r10,r7,r8 0x0fea6d54 <strlen+108>: and. r0,r0,r10 0x0fea6d58 <strlen+112>: add r11,r6,r9 0x0fea6d5c <strlen+116>: nor r12,r7,r9 0x0fea6d60 <strlen+120>: bne- 0xfea6d7c <strlen+148> 0x0fea6d64 <strlen+124>: and. r0,r11,r12 0x0fea6d68 <strlen+128>: beq+ 0xfea6d44 <strlen+92> 0x0fea6d6c <strlen+132>: and r0,r7,r9 0x0fea6d70 <strlen+136>: add r0,r0,r7 0x0fea6d74 <strlen+140>: andc r8,r12,r0 0x0fea6d78 <strlen+144>: b 0xfea6d8c <strlen+164> 0x0fea6d7c <strlen+148>: and r0,r7,r8 0x0fea6d80 <strlen+152>: addi r4,r4,-4 0x0fea6d84 <strlen+156>: add r0,r0,r7 0x0fea6d88 <strlen+160>: andc r8,r10,r0 0x0fea6d8c <strlen+164>: cntlzw r11,r8 0x0fea6d90 <strlen+168>: subf r0,r3,r4 0x0fea6d94 <strlen+172>: rlwinm r11,r11,29,3,31 0x0fea6d98 <strlen+176>: add r3,r0,r11 0x0fea6d9c <strlen+180>: blr End of assembler dump. Thank you for posting the disassembly, Bart. In the special case for the beginning of memcheck on ppc64, if strlen is not found in ld64.so.1 by symbol, then search for it by value. Linearly scan the .text of ld64.so for the known first instruction (rlwinm r4,r3,0,0,29). When found then memcmp the remainder of the length of the code for strlen. Obviously a compiler could pick different temporary registers, the code could change from version to version, etc. But that's just a maintenance headache in contrast to conceding defeat. In the slightly more general case of "recognize strlen just before reporting an error", there are only 4 instructions which fetch data from memory: 0x0fea6cf4 <strlen+12>: lwz r8,0(r4) 0x0fea6d2c <strlen+68>: lwzu r8,4(r4) 0x0fea6d44 <strlen+92>: lwz r8,4(r4) 0x0fea6d48 <strlen+96>: lwzu r9,8(r4) so those are the only possible values for PC at a suspected error. Compare the instruction word at the fault PC. If a match then check for the magic constant 0x7f7f (32639) at strlen+4 and strlen+20, and/or the opcode mtcrf at strlen+48, and/or the opcode cntlzw at strlen+164. If still a match, then memcmp the entire .text of strlen. If strlen has been identified, then do like the patches: (8/14) vgsvn+wine-chkstk-nosym.patch recover from frequent memcheck complaints (13/14) vgsvn+wine-replace-strmem.patch _strlen etc. which I posted last year on 07/09/2008 to [Valgrind-developers]. Sounds like an interesting approach. Unfortunately this approach is too advanced for me to implement a solution myself. Just for completeness, here is the disassembly of the openSUSE 10.3 ppc strlen implementation: $ cat /etc/issue.net Welcome to openSUSE 10.3 (PPC) - Kernel %r (%t). $ rpm -qf /lib/libc.so.6 glibc-2.6.1-18 $ gdb /lib/libc.so.6 GNU gdb 6.6.50.20070726-cvs Copyright (C) 2007 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ppc-suse-linux"... Using host libthread_db library "/lib/libthread_db.so.1". (gdb) disas strlen Dump of assembler code for function strlen: 0x0008107c <strlen+0>: rlwinm r4,r3,0,0,29 0x00081080 <strlen+4>: lis r7,32639 0x00081084 <strlen+8>: rlwinm r5,r3,3,27,28 0x00081088 <strlen+12>: lwz r8,0(r4) 0x0008108c <strlen+16>: li r9,-1 0x00081090 <strlen+20>: addi r7,r7,32639 0x00081094 <strlen+24>: srw r9,r9,r5 0x00081098 <strlen+28>: and r0,r7,r8 0x0008109c <strlen+32>: or r10,r7,r8 0x000810a0 <strlen+36>: add r0,r0,r7 0x000810a4 <strlen+40>: nor r0,r10,r0 0x000810a8 <strlen+44>: and. r8,r0,r9 0x000810ac <strlen+48>: mtcrf 1,r3 0x000810b0 <strlen+52>: bne- 0x81120 <strlen+164> 0x000810b4 <strlen+56>: lis r6,-257 0x000810b8 <strlen+60>: addi r6,r6,-257 0x000810bc <strlen+64>: bgt- cr7,0x810d8 <strlen+92> 0x000810c0 <strlen+68>: lwzu r8,4(r4) 0x000810c4 <strlen+72>: and r0,r7,r8 0x000810c8 <strlen+76>: or r10,r7,r8 0x000810cc <strlen+80>: add r0,r0,r7 0x000810d0 <strlen+84>: nor. r8,r10,r0 0x000810d4 <strlen+88>: bne- 0x81120 <strlen+164> 0x000810d8 <strlen+92>: lwz r8,4(r4) 0x000810dc <strlen+96>: lwzu r9,8(r4) 0x000810e0 <strlen+100>: add r0,r6,r8 0x000810e4 <strlen+104>: nor r10,r7,r8 0x000810e8 <strlen+108>: and. r0,r0,r10 0x000810ec <strlen+112>: add r11,r6,r9 0x000810f0 <strlen+116>: nor r12,r7,r9 0x000810f4 <strlen+120>: bne- 0x81110 <strlen+148> 0x000810f8 <strlen+124>: and. r0,r11,r12 0x000810fc <strlen+128>: beq+ 0x810d8 <strlen+92> 0x00081100 <strlen+132>: and r0,r7,r9 0x00081104 <strlen+136>: add r0,r0,r7 0x00081108 <strlen+140>: andc r8,r12,r0 0x0008110c <strlen+144>: b 0x81120 <strlen+164> 0x00081110 <strlen+148>: and r0,r7,r8 0x00081114 <strlen+152>: addi r4,r4,-4 0x00081118 <strlen+156>: add r0,r0,r7 0x0008111c <strlen+160>: andc r8,r10,r0 0x00081120 <strlen+164>: cntlzw r11,r8 0x00081124 <strlen+168>: subf r0,r3,r4 0x00081128 <strlen+172>: rlwinm r11,r11,29,3,31 0x0008112c <strlen+176>: add r3,r0,r11 0x00081130 <strlen+180>: blr End of assembler dump. The code is glibc/sysdeps/powerpc/powerpc{32,64}/strlen.S [hand-written assembly language] which has been unchanged for about five years (2004-10-06). The 64-bit version is the same as the 32-bit version except for the obvious change in word size, and the addition of one data cache prefetch for the beginning of the string, and a possible call to mcount. If anyone truly cares, then they should offer remote access via ssh+private_key to such a system, in order to facilitate implementation. Created attachment 36315 [details]
implement search-by-value for special redirections
This patch implements "search .text for known instruction sequence" when searching by name fails for special redirections. After disabling the search by name (for testing purposes only), then the patched code finds 'strlen' and 'index' by value in ld64.so.1 on the ppc64 system (which is running Fedora 9) that is reached via "qsub -I" from the front-end at cell-user.cc.gatech.edu. The patched code then calls maybe_add_active() with the same struct-by-value that would have resulted from the find-it-by-name search, had the find-it-by-name search not been disabled for testing.
So, would somebody who is running ppc64 with a stripped ld64.so.1 please try running memcheck on a 64-bit executable after applying the patch?
I will give the patch a try. Thanks for developing this patch. The summary generated by "make regtest" (trunk r10856) on openSUSE 10.3 PPC can be found below. I have verified that the differences with the output on Fedora 6 PPC are not related to calls to strlen(). == 446 tests, 47 stderr failures, 12 stdout failures, 1 post failure == memcheck/tests/badjump (stderr) memcheck/tests/badjump2 (stderr) memcheck/tests/deep_templates (stdout) memcheck/tests/leak-cases-full (stderr) memcheck/tests/leak-cases-summary (stderr) memcheck/tests/leak-cycle (stderr) memcheck/tests/origin5-bz2 (stderr) memcheck/tests/partiallydefinedeq (stderr) memcheck/tests/supp_unknown (stderr) memcheck/tests/varinfo1 (stderr) memcheck/tests/varinfo2 (stderr) memcheck/tests/varinfo3 (stderr) memcheck/tests/varinfo4 (stderr) memcheck/tests/varinfo5 (stderr) memcheck/tests/varinfo6 (stderr) memcheck/tests/wrap8 (stderr) massif/tests/deep-D (post) none/tests/empty-exe (stderr) none/tests/linux/mremap (stderr) none/tests/ppc32/jm-fp (stdout) none/tests/ppc32/jm-int (stdout) none/tests/ppc32/jm-vmx (stdout) none/tests/ppc32/round (stdout) none/tests/ppc32/test_gx (stdout) none/tests/ppc64/jm-fp (stdout) none/tests/ppc64/jm-int (stdout) none/tests/ppc64/jm-vmx (stdout) none/tests/ppc64/round (stdout) none/tests/shell_valid2 (stderr) none/tests/shell_valid3 (stderr) none/tests/shell_zerolength (stderr) helgrind/tests/hg05_race2 (stderr) helgrind/tests/tc06_two_races_xml (stderr) helgrind/tests/tc08_hbl2 (stdout) helgrind/tests/tc22_exit_w_lock (stderr) helgrind/tests/tc23_bogus_condwait (stderr) drd/tests/tc08_hbl2 (stdout) drd/tests/tc23_bogus_condwait (stderr) exp-ptrcheck/tests/bad_percentify (stderr) exp-ptrcheck/tests/base (stderr) exp-ptrcheck/tests/ccc (stderr) exp-ptrcheck/tests/fp (stderr) exp-ptrcheck/tests/globalerr (stderr) exp-ptrcheck/tests/hackedbz2 (stderr) exp-ptrcheck/tests/hp_bounds (stderr) exp-ptrcheck/tests/hp_dangle (stderr) exp-ptrcheck/tests/hsg (stderr) exp-ptrcheck/tests/justify (stderr) exp-ptrcheck/tests/partial_bad (stderr) exp-ptrcheck/tests/partial_good (stderr) exp-ptrcheck/tests/preen_invars (stderr) exp-ptrcheck/tests/pth_create (stderr) exp-ptrcheck/tests/pth_specific (stderr) exp-ptrcheck/tests/realloc (stderr) exp-ptrcheck/tests/stackerr (stderr) exp-ptrcheck/tests/strcpy (stderr) exp-ptrcheck/tests/supp (stderr) exp-ptrcheck/tests/tricky (stderr) exp-ptrcheck/tests/unaligned (stderr) exp-ptrcheck/tests/zero (stderr) Overall I must say I am not in favour of pattern matching machine code -- I don't see it as a maintainable long term solution. It might be useful for emergency fixes, but in this case it's really working around the fact that the distro maintainers threw away information (the symbol table entry point for ld.so strlen) which is important for Memcheck, and which they could very easily have not thrown away. IMO we'd be better off trying to educate/persuade distro maintainers to leave visible in ld.so the symbols Memcheck needs to see. It's better for them to take on board the concept that they can't just make any changes they like to ld.so/libc.so and still expect Valgrind to work properly. I don't think it's an unreasonable request, to make strlen in ld.so externally visible. That said .. improving Memcheck's ability to deal with such codes unmodified is no bad thing. The kinds of tricks we see in the optimised ld.so/libc.so strlen et al functions are gradually getting generated inline by compilers, making it unavoidable to deal with them directly. Already we are in the situation where Memcheck is more or less unusable with version 11.1 of the Intel compiler suite for precisely this reason. (icc 11.0 and before appear to be ok). (In reply to comment #4) > The best long term > fix would be to make Memcheck's value-tracking machinery be able to > handle without false positives, the optimised strlen etc that ppc Linux > is using. Then the whole problem would go away. However, identifying > the root problem is difficult. It seems to me that if memcheck replaced the arithmetic operations 'add' and 'sub' with their *boolean* equivalents (i.e., all Carry operations written explicitly using only AND, OR, XOR, NOT, and raw input bits) then the problem ought to be solved. However, that might be several times slower. What about an intermediate step: use Carry-Save adder. (a + b) === ((a ^ b) + ((a & b) << 1)). Does this expose the Carry bits enough so that memcheck's tracking of undefinedness can understand the magic constants 0x7f7f7f... and 0x808080... and their effects under the opcodes that are used by strlen, etc.? The reason why memcheck has trouble with strlen is a big fat bug in the tracking of undefined bits for ADD and SUB. ----- vg.c #include <stdlib.h> #include <stdio.h> main() { int *p = malloc(4); int *q = malloc(4); *p &= 0x0000ffff; *q &= 0x0000ffff; if ((*p + *q) < 0) { printf("impossible\n"); } return 0; } ----- The Valid bits for the operands to the addition are: *p 11111111111111110000000000000000 (top 16 init, bottom 16 uninit) *q 11111111111111110000000000000000 (top 16 init, bottom 16 uninit) sum 11111111111111100000000000000000 (top 15 init, bottom 17 uninit) and the value of the sum is known to be sum 000000000000000xxxxxxxxxxxxxxxxx (top 15 bits zero, bottom 17 uninit) So the test never is true (the top bit of the sum is zero), yet memcheck 3.5.0 complains [x86_64 Fedora 11]: ==5412== Conditional jump or move depends on uninitialised value(s) ==5412== at 0x400557: main (vg.c:10) The printf() is not executed. |