Summary: | Incorrect scheme for detecting NEON capabilities of host CPU | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | Peter Maydell <peter.maydell> |
Component: | vex | Assignee: | Julian Seward <jseward> |
Status: | VERIFIED WAITINGFORINFO | ||
Severity: | normal | ||
Priority: | NOR | ||
Version: | 3.6 SVN | ||
Target Milestone: | --- | ||
Platform: | Compiled Sources | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
patch which enables support for BX PC
valgrind -v output for secondary failure |
Created attachment 51188 [details]
valgrind -v output for secondary failure
> With the attached patch (which just enables the stubbed out /*atc*/ > handling of this case) "atc" stands for "awaiting test case". It means the handler got written but no instance of the instruction has so far been seen. > valgrind proceeds past this point and works at least some of the > time. However it then intermittently fails (perhaps one run in three) with > > ==18028== Process terminating with default action of signal 4 (SIGILL) > ==18028== Illegal opcode at address 0x6285008C > ==18028== at 0x40112F4: __sigsetjmp (setjmp.S:59) > ==18028== by 0x400B040: _dl_catch_error (dl-error.c:175) > ==18028== by 0x4009E66: _dl_map_object_deps (dl-deps.c:249) > ==18028== Jump to the invalid address stated on the next line > ==18028== at 0x3E4: ??? > ==18028== Address 0x3e4 is not stack'd, malloc'd or (recently) free'd That's (confusingly) a fault in the JIT generated code, not in the front end. An easy way to debug is to rerun with --wait-for-gdb=yes, which puts V in a minute-ish long spin loop. In that time, attach gdb to it from another shell and let it continue. You might also have to continue past a few (expected) segfaults. Eventually you should get to the SIGILL. I wonder if this is fallout from mis-handling BX PC, but I can't see how. It would be useful to see how the simulator got to this place. Re-run with --trace-flags=10000000 --trace-notbelow=99999. Once you see what SB number it's failing on, change the 99999 to that number (or one or two just below) so you can see what insns the front end is decoding. The secondary problem turns out to be unrelated to BX PC. The problematic instruction is an fstmiad in __sigsetjmp: (arm) 0x40112F4: fstmiad r12!, {d8-d15} ------ IMark(0x40112F4, 4) ------ t0 = GET:I32(48) t1 = Add32(t0,0x40:I32) t2 = t0 STle(Add32(t2,0x0:I32)) = GET:F64(184) STle(Add32(t2,0x8:I32)) = GET:F64(192) STle(Add32(t2,0x10:I32)) = GET:F64(200) STle(Add32(t2,0x18:I32)) = GET:F64(208) STle(Add32(t2,0x20:I32)) = GET:F64(216) STle(Add32(t2,0x28:I32)) = GET:F64(224) STle(Add32(t2,0x30:I32)) = GET:F64(232) STle(Add32(t2,0x38:I32)) = GET:F64(240) PUT(48) = t1 That gets translated into a code sequence which includes vld1.32 {d8} [r9] 8F 87 29 F4 VLD1 is a Neon instruction, and this system doesn't have Neon, only VFP, so we SIGILL when we try to execute it. Valgrind is incorrectly deciding that we do have neon; on startup it says "Arch and hwcaps: ARM, ARMv7-vfp-neon". I'm not sure why machine_get_hwcaps() is diagnosing the system as having Neon: if I single step through it in gdb then we get a SIGILL on the 'vorr q2,q2,q2' diagnostic insn it is using, and the NEON bit is not set in hwcaps. However if I let gdb run through the function rather than stepping then we do not get a SIGILL, and the NEON bit is set... (The runs where valgrind works also diagnose the machine as having neon, so it's not that the diagnosis is giving variable results; it's consistently wrong when not being singlestepped, it's just that the wrong diagnosis doesn't always cause a crash later.) > I'm not sure why machine_get_hwcaps() is diagnosing the system as having Neon
"Does this neon instruction execute?" is apparently not a valid way to make this check -- the kernel may have had support compiled out or disabled because of hardware issues. I'm told you need to check for HWCAP_NEON in /proc/self/auxv.
I hacked machine_get_hwcaps() to force it to say 'no neon', and the intermittent failures have gone away.
(In reply to comment #4) > "Does this neon instruction execute?" is apparently not a valid way to make > this check Yes, I'd wondered exactly that. Thanks for chasing it. Does that go only for detecting NEON support, or will we have to check all the features using /proc/self/aux ? > I hacked machine_get_hwcaps() to force it to say 'no neon', and the > intermittent failures have gone away. Good. So at least the backend instruction selection logic is working correctly w.r.t. hw capabilities. (In reply to comment #5) > Does that go > only for detecting NEON support, or will we have to check all the > features using /proc/self/aux ? You need to do that for at least Neon and VFP support. I don't think it's as critical for "are we v5/v6/v7?" but I imagine that if you're reading /proc/self/auxv for AT_HWCAP it's as simple to look at AT_PLATFORM to determine v5/v6/v7 as it is to do it by testing for faulting instructions. > patch which enables support for BX PC
Committed as r2027. Thanks.
Changed the title to reflect the more serious bug. Fixed, r11347/r2032. r11347 only checks the auxv for Neon; I think you also need to do this for VFP. Reopening. (What's the right way to reopen a bug?) But would prefer to see a real failure as a result of not correctly detecting VFP before fixing. I think for a test case you'd could try building a kernel with no VFP support and running it on an A8 or similar. But I suspect that it's basically that at the moment you're relying on something that happens to work but which isn't guaranteed to do so. (We just don't happen to currently have any hardware with broken VFP the way we do Neon.) Sorry, I didn't mean to change the status fields there. |
Created attachment 51187 [details] patch which enables support for BX PC Hi. I've compiled valgrind from svn on Ubuntu maverick for ARM, in order to test the Thumb support that has recently landed in the svn trunk. This is a pure svn checkout with svn r11315, VEX svn r2025, built with "./configure CFLAGS='-marm -fno-stack-protector' && make". I'm building on a pegatron board (freescale MX51 based): uname -a says: Linux linaro-m-10141 2.6.31-008-ER1-lange51 #1 Fri Apr 9 14:06:09 UTC 2010 armv7l GNU/Linux In Maverick everything is built with Thumb2 by default. Unfortunately valgrind doesn't decode an instruction in the dynamic linker's startup sequence so valgrinding anything fails: [linaro-m-dev] ubuntu@linaro-m-10141:~/valgrind-svn/trunk$ ./vg-in-place -v /bin/ls ==17674== Memcheck, a memory error detector ==17674== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==17674== Using Valgrind-3.6.0.SVN and LibVEX; rerun with -h for copyright info ==17674== Command: /bin/ls ==17674== --17674-- Valgrind options: --17674-- -v --17674-- Contents of /proc/version: --17674-- Linux version 2.6.31-008-ER1-lange51 (ubuntu@babbage-davem-1) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu4) ) #1 Fri Apr 9 14:06:09 UTC 2010 --17674-- Arch and hwcaps: ARM, ARMv7-vfp-neon --17674-- Page sizes: currently 4096, max supported 4096 --17674-- Valgrind library directory: /home/ubuntu/valgrind-svn/trunk/./.in_place --17674-- Reading syms from /bin/ls (0x8000) --17674-- Considering /bin/ls .. --17674-- .. CRC mismatch (computed 017f562f wanted 1a5b4806) --17674-- object doesn't have a symbol table --17674-- Reading syms from /lib/ld-2.12.1.so (0x4000000) --17674-- Considering /lib/ld-2.12.1.so .. --17674-- .. CRC mismatch (computed 25bac168 wanted b996edb1) --17674-- Considering /usr/lib/debug/lib/ld-2.12.1.so .. --17674-- .. CRC is valid --17674-- Reading syms from /home/ubuntu/valgrind-svn/trunk/memcheck/memcheck-arm-linux (0x38000000) --17674-- object doesn't have a dynamic symbol table --17674-- Reading suppressions file: /home/ubuntu/valgrind-svn/trunk/./.in_place/default.supp --17674-- REDIR: 0x4012180 (memcpy) redirected to 0x38043530 (???) --17674-- REDIR: 0x4011610 (strlen) redirected to 0x38043504 (???) disInstr(thumb): unhandled instruction: 0x4778 0x46C0 ==17674== valgrind: Unrecognised instruction at address 0x40007a5. ==17674== Your program just tried to execute an instruction that Valgrind ==17674== did not recognise. There are two possible reasons for this. ==17674== 1. Your program has a bug and erroneously jumped to a non-code ==17674== location. If you are running Memcheck and you just saw a ==17674== warning about a bad jump, it's probably your program's fault. ==17674== 2. The instruction is legitimate but Valgrind doesn't handle it, ==17674== i.e. it's Valgrind's fault. If you think this is the case or ==17674== you are not sure, please let us know and we'll try to fix it. ==17674== Either way, Valgrind will now raise a SIGILL signal which will ==17674== probably kill your program. ==17674== ==17674== Process terminating with default action of signal 4 (SIGILL) ==17674== Illegal opcode at address 0x40007A5 ==17674== at 0x40007A5: ??? (in /lib/ld-2.12.1.so) ==17674== ==17674== HEAP SUMMARY: ==17674== in use at exit: 0 bytes in 0 blocks ==17674== total heap usage: 0 allocs, 0 frees, 0 bytes allocated ==17674== ==17674== All heap blocks were freed -- no leaks are possible ==17674== ==17674== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==17674== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Illegal instruction With the attached patch (which just enables the stubbed out /*atc*/ handling of this case) valgrind proceeds past this point and works at least some of the time. However it then intermittently fails (perhaps one run in three) with ==18028== Process terminating with default action of signal 4 (SIGILL) ==18028== Illegal opcode at address 0x6285008C ==18028== at 0x40112F4: __sigsetjmp (setjmp.S:59) ==18028== by 0x400B040: _dl_catch_error (dl-error.c:175) ==18028== by 0x4009E66: _dl_map_object_deps (dl-deps.c:249) ==18028== Jump to the invalid address stated on the next line ==18028== at 0x3E4: ??? ==18028== Address 0x3e4 is not stack'd, malloc'd or (recently) free'd (I'll attach the full -v output). I haven't yet investigated this secondary failure.