Summary: | valgrind/memcheck segfaults under certain kernel versions (amd64) but not others. | ||
---|---|---|---|
Product: | [Developer tools] valgrind | Reporter: | zephyrus00jp <ishikawa> |
Component: | memcheck | Assignee: | Julian Seward <jseward> |
Status: | REPORTED --- | ||
Severity: | normal | CC: | jreiser |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
debug log under 4.7.0.1 (valgrind crashes)
Log for 4.9.x kernel test runs Log for 3.19.5 (sucessful runs for comparison) log of segfault failure under 4.9.x series kernel (debian). with --trace-syscalls=yes A log to show valgrind with --vex-iropt-register-updates=all still segfaults (under 4.7.0.1) A log of valgrind experiencing SIGSEGV under 4.9.x kernel: with full vlgrind options uname -a output, and other requested logs. |
Description
zephyrus00jp
2017-02-28 08:22:23 UTC
Created attachment 104263 [details]
Log for 4.9.x kernel test runs
Log of test runs under 4.9.x kernel.
Created attachment 104264 [details]
Log for 3.19.5 (sucessful runs for comparison)
I needed to remove the largest log which contains a much longer strace dump
during a success run to avoid upload limit.
I will explain the files in more detail later on.
Description of log directory archive. [1] Description of log files under 4.7.0.1 There are a group of files. (N being 2,3,4,5) - logN-main.txt, - logN-gdb-dddd.txt, - logN-gdb-dddd'.txt - logN-stra.out.gz - logN-crash.txt I ran ~/bin/run-valgrind-test.sh which contains the following command strace -f -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes ~/objdir-tb3/dist/bin/thunderbird-bin strace is used to capture what system calls valgrind was issueing. I used -vgdb=yes, and --vgdb-error=0 to attach to the thunderbird binary that is monitied by thunderbird binary. It turns out that thunderbird binary runs an external program (by fork()) about the time it segfaulted and so I added --tracechildren=yes. (However, I found that the crash occurs on the original process). Anyway, logN-main.txt recorded the log output from the above command line. One of the logN-dddd.txt and logN-dddd'.txt contains the first instance of gdb that is attached to the thunderbird process under valgrind/memcheck. I set a few breakpoints at main and at fork, and let it run. I dumped the mmap layout by printing /proc/PID/maps (PID being the process id of the thunderbird process). logN-stra.out.gz contains the recoding from strace. logN-crash.txt contains the last part of the strace output (logN-stra.out.gz) and a few comments regarding the particular test run. Under the particular version of 4.7.0.1 (which is Debian-supplied), valgrind invokes another process by fork(), and when it happens, valgrind suggests that I attach another gdb to the sibling process. (--vgdb=yes --vgdb=error), and I do so and run the sibling process under gdb. However, the sibling does not seem to be the culprit. But the original parent process or rather the original valgrind that runs the original thunderbird binary seemed to crash. (But note the comment about gdb showing three SIGSEGV and the first two (2) are for stack fault and handled by valgrind, and only the third one being the fatal SIGSEGV later in this post. (In the discussion of 4-9-x-dir/log5-main.txt ) Anyway, you can see that valgrind experiences SIGSEGV eventually. I did not include the case where I stepped through the parent valgrind+thunderbird execution by stepping using "c" for continue and "fun" for function level trace. In one case, the crash did not happen and thunderbird ran to completion without experiencing the crash. So the issue is timing-related and race condition exists somewhere in the code (on the valgrind side as well.) [2] log under 4.9.x kernel test runs There are files (N=1,2,3,4) - logN-main.txt - logN-error.txt - logN-stra.out.dddd.gz - logN-gdb-ddd.txt logN-main.txt contains the log/print from running the following command line: strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin (Sorry, this is on a different machine with different file layout. this is on 4 core machine.) logN-gdb-ddd.txt: logN-gdb-ddd.txt contains the log from the attaching gdb. Since the SIGSEGV occurs way before the child process is forked(), there is only one log from the instance of gdb that is attached to the thunderbird binary monitored under valgrind/memcheck. As in the case of kernel 4.7.0.1, I set breakpoint on main, fork and in a few cases, dlopen, but under kernel 4.7.0.1, valgrind crashes way before fork is called and thus I only dumped /proc/PID/maps at the breakpoint placed at the entry of main(). logN-stra.out.dddd.gz: this contains the strace output. logN-error.txt: (I am afraid I emptied log2-error.txt by mistake....) This contains the last part of logN-stra.out.ddd.gz and a comment. For example, log1-error.txt contains this. --- begin quote valgrind crashed even before I had chance to place a breakpoint, i.e., before --vgdb-error=0 took effect!? gettid() = 2851 mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000 getpid() = 2851 write(1027, "--2851-- REDIR: 0x52e68b0 (libst"..., 115) = 115 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} --- +++ killed by SIGSEGV +++ --- end quote So the situation is highly timing dependent / racy. log5-main.txt: this is very interesting. I ran valgrind under gdb. And then issued to gdb run --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin With this log, you can see that valgrind DOES experience SIGSEGV, but it uses SIGSEGV internally to catch stack over(?)/under(?) flow and adjust the stack and continue. (I came to recognize this after I saw the signal tracking log lines from valgrind.) Previously, as soon as I saw SIGSEGV, I gave up. (And the stack trace is that of signal handler for SIGSEGV (???), and it is not printed as in the usual case???. That prompted my giving up also.) In this log file, log5-main.txt, for the first two SIGSEGV cases, I dumped stack and register (info register), and then continued with "c" for cont. To my surprise, valgrind happily continued. Only at the third SIGSEGV, valgrind crashes. This may give us some clue on the nature of the crash? [3] 3.19.5 log from successful runs. This was captured under 3.19.5 kernel. This shows the success runs and what messages ought to be printed by successful valgrind run. There are files (N=1,2,3) logN-main.txt: co I ran the following command. strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin Under the kernel 3.19.5 which I used, valgrind does not crash, and thus the initial run of log1-main.txt had to be interrupted midway because the strace output became too large: it had to be removed from the upload, too. But for the two subsequent runs, log2-main.txt and log3-main.txt, I exited thunderbird runs after the initial profile selection menu comes up: I simply exited from there. By that time, though, the sibling process to probe the graphics adaptor was invoked and exited successfully. logN-gdb-ddd.txt, logN-gdb-ddd'.txt: I attach gdb to the initial thunderbird process and the subsequent sibling process that is forked from the original process. logN-stra.out.ddd.gz: the strace output for the initial valgrind process. There are only two of them: 3-19-5-dir/log2-stra.out.10447.gz 3-19-5-dir/log3-stra.out.10973.gz For the first test run, it became too large since I followed the execution until initial windows came up. I had to drop it before uploading. The immediate change you will notice is the slight change of mmap. Somehow the initial process's stack is NOT at the end, but somewhere earlier. The successful case shows many signal handling dump. I wondery why we do not see them in the failed cases. That may be related to the root cause(s) of the crash. Hope this helps someone in figuring out the true nature of the bug or in suggesting some tips for further debugging on my end. TIA Created attachment 104424 [details]
log of segfault failure under 4.9.x series kernel (debian). with --trace-syscalls=yes
This is under 4.9.x.
The log file by running a shell script that contains the following command
with
bash -vx the_script_file 2>&1 | tee /tmp/t-log
strace -ff valgrind --verbose --trace-syscalls=yes --trace-signals=yes --show-mismatched-frees=no --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin
"--trace-syscalls" ought to show the syscalls executed by the traced program, namely, thunderbird-binary.
I see the message about such syscalls written by write(1027, ...) by valgrind.
Because of the way the log was captured, the trace of |write| by strace
is immediately followed by the string written by |write|.
In any case, the following last portion of the log suggests to me that
it is indeed valgrind that is causing the mysterious crash. Isn't it?
--- begin quote
...
write(1027, "SYSCALL[3810,1](202) ... [async]"..., 49SYSCALL[3810,1](202) ... [async] --> Success(0x0)) = 49
write(1027, " ", 1 ) = 1
write(1027, "\n", 1
) = 1
mmap(0x802fb0000, 196608, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x802fb0000
munmap(0x802f90000, 131072) = 0
gettid() = 3810
write(1029, "Q", 1) = 1
gettid() = 3810
read(1028, "Q", 1) = 1
gettid() = 3810
rt_sigprocmask(SIG_SETMASK, ~[], ~[KILL STOP], 8) = 0
rt_sigtimedwait(~[], 0x802f8de20, {tv_sec=0, tv_nsec=0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP], NULL, 8) = 0
gettid() = 3810
mmap(0x802d3d000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x802d3d000
getpid() = 3810
write(1027, "--3810-- REDIR: 0x52e68b0 (libst"..., 115--3810-- REDIR: 0x52e68b0 (libstdc++.so.6:operator delete(void*)) redirected to 0x4c2d19c (operator delete(void*))
) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa22c} ---
+++ killed by SIGSEGV +++
/home/ishikawa/bin/run-valgrind-thunderbird.sh: line 13: 3808 Segmentation fault strace -ff valgrind --verbose --trace-syscalls=yes --trace-signals=yes --show-mismatched-frees=no --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin
--- end quote
The address reported is slightly changed: I think it is due to the slight
change in the source code..
final error in the 4.9.x series:
The following is the old address reported in sigsegv.
gettid() = 3174
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid() = 3174
write(1027, "--3174-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++
The address in SIGSEGV is interesting:
From the old log:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
From the current log:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa22c} ---
It is BELOW the last (anonymous) map (reported in the old log)
--- begin
From 4.9.x series kernel: Failure case
...
806203000-806334000 rwxp 00000000 00:00 0
806af9000-806ce2000 rwxp 00000000 00:00 0 <-- the si_addr is after this.
ffeffe000-fff001000 rw-p 00000000 00:00 0 <---the si_addr is before this.
7ffd03470000-7ffd03492000 rw-p 00000000 00:00 0 [stack]
7ffd034ba000-7ffd034bc000 r--p 00000000 00:00 0 [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
(gdb) cont
--- end
Hope this helps.
Apparently this user has been in a very similar situation before: https://bugs.kde.org/show_bug.cgi?id=345414#c3 which is only 6 months ago (September, 2016). I wonder if the user has followed Julian's advice from bug #345414 Comment 3: ===== In your mozconfig, you need these ac_add_options --disable-jemalloc ac_add_options --enable-valgrind and when you run Valgrind, you need this --vex-iropt-register-updates=allregs-at-mem-access ===== Created attachment 104426 [details] A log to show valgrind with --vex-iropt-register-updates=all still segfaults (under 4.7.0.1) Sorry I was not specific. I DID follow Julien's advice in the last several months and still no luck. The following options for creating Thunderbird are in my mozconfig file (on two PCs where I test. > ac_add_options --disable-jemalloc > ac_add_options --enable-valgrind [And I don't see unhandled syscall: 317 before segfault in recent logs as you can verify. At least, I didn't recall seeing this for the last few months.] Re the following option: > --vex-iropt-register-updates=allregs-at-mem-access This has been added to the valgrind option when I ran |make mozmill| test suite under one of the test machines, and still no luck. Even with this and a few other options Julian suggested, the combination valgrind+thunderbird runs under 3.19.5 and segfaults under 4.7.0.1 and 4.9.x series. Actually, this option does not make a difference as far as I can tell :-( > https://bugs.kde.org/show_bug.cgi?id=345414#c3 Yes, that thread was reported by me almost two years ago. Back then 4.y series kernel was not available for Debian (it was only in testing repository). But it is now. And I want to use the later kernel versions for obvious reasons. I have to emphasize that the bug still stands with the suggested option. I am attaching the segfault case when valgrind is run with the following parameter under kernel 4.7.0.1 (Debian's distribution). Note the addition of --vex-iropt-register-updates=allregs-at-mem-access. (It does not make a difference. valgrind+thunderbird still segfaults. Sorry I was not specific enough about this in my original report. I did not want to clutter the bug report with the options that do not seem to have effect.) strace -ff valgrind --verbose --trace-syscalls=yes --trrace-signals=yes --show-mismatched-frees=no --trace-children=yes --vex-iropt-register-updates=allregs-at-mem-access ~ishikawa/objdir-tb3/dist/bin/thunderbird-bbin On this machine with this kernel, original valgrind+thunderbird segfaults AFTER a child process spawned by thunderbird finishes. On another machine with 4.9.x kernel, valgrind+thunderbird segfaults way before the child process fork/exec happens. It is really frustrating to see the combination of valgrind+thunderbird work only under certain kernel revisions (in my case, 3.19.5) as noted in https://bugs.kde.org/show_bug.cgi?id=345414#c6 With a slim hope of success, I tried to use the old kernel config for 3.19.5 to create 4.9.z kernel (using make oldconfig ), but valgrind+thunderbird still segfaults under the resulting kernel. (That was on a different PC.) It would be great to find out TO WHERE (if it is meaningful) the stray pointer reported in SIGSEGV points. I think the routine to report the symbols that get mapped to anonymous maps area as viewed by valgrind can be very useful for this. I have a few pet theories for possible issues: 1. Given that when the fatal SIGSEGV is received, the stack trace seems to messed up somehow (they seem to be too low address in comparison to other values), I suspect that it may be that valgrind is experiencing a segfault in the code to set up signal handlers including SIGSEGV. (There were some races in linux kernel regarding some signal issues before after a fork(). Maybe valgrind code inherits similar problems. But do note that under 4.9.x, the valgrind+thunderbird combination still crashes BEFORE fork() is reached. So there may be multpile issues here. 2. I am not sure how valgrind handles this, but, given the different mmap layout, I wonder if the malloc routine in valgrind may assign an area as a return value of malloc() which is at the end of sbrk()'ed area. If so, what happens if multiple-bytes access by x86_64 for strcmp, etc. goes beyond the sbrk() value during operation. That is what happens, that eager access to speed up operation by reading extra bytes (8 or 16 octes) and if the tail part of the octet array falls beyond the user's valid vm address are. Does it get caught as SIGSEGV? Or is such an access checked byte by byte before such an illegal access is attemped? If the reason for the segfault is not one of the above, I am at my wit's end. TIA To avoid the uncertainty about how mozilla thunderbird is compiled for valgrind testing, I will create a binary (for valgrind testing) on mozilla's tryserver. (You an see that what patches are applied via the web interface.) I figure it would be easier this way after I thought about detailing the manual procedure to compile comm-central thunderbird under linux for valgrind, etc. There are some patches, etc. that have to be applied using MERCURIUM hg command, and other steps. Having a visible binary which has undergone standard testing is much better. Unfortunately, the build is currently busted on the tryserver due to a very recent change in the code that is part of mozilla firefox which thunderbird shares. I hope it will be fixed in a few days. Then I can create a binary on the tryserver which can be used for testing this issue under linux. TIA Created attachment 104470 [details]
A log of valgrind experiencing SIGSEGV under 4.9.x kernel: with full vlgrind options
This log is obtained by running
valgrind under a wrapper called run-valgrind (of my creation).
This wrapper runs valgrind with suitable parameters to monitor thunderbird binary as follows.
valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=$HOME/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --trace-syscalls=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin
The process IDs that appear are:
pid 28538: the binary wrapper run-valgrind
pid 28539: from the wrapper valgrind is executed by calling system() library call.
[pid 28539] execve("/bin/sh", ["sh", "-c", "valgrind --trace-children=yes --"...], [/* 65 vars */]) = 0
pid 28540: from the spawned shell valgrind is run
[pid 28540] execve("/usr/local/bin/valgrind", ["valgrind", "--trace-children=yes", "--fair-sched=yes", "--smc-check=all-non-file", "--gen-suppressions=all", "--vex-iropt-register-updates=all"..., "--child-silent-after-fork=yes", "--trace-children-skip=/usr/bin/h"..., "--num-transtab-sectors=24", "--tool=memcheck", "--freelist-vol=500000000", "--redzone-size=128", "--px-default=allregs-at-mem-acce"..., "--px-file-backed=unwindregs-at-m"..., "--malloc-fill=0xA5", "--free-fill=0xC3", "--num-callers=50", "--suppressions=/home/ishikawa/Dr"..., "--show-mismatched-frees=no", "--show-possibly-lost=no", "--read-inline-info=yes", "--trace-syscalls=yes", "/NREF-COMM-CENTRAL/objdir-tb3/di"...], [/* 65 vars */] <unfinished ...>
Near the end of the log, we see the following two lines:
[pid 28540] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa1fc} ---
[pid 28540] +++ killed by SIGSEGV +++
So it is pid 28540 that gets killed by SIGSEGV: pid 28540 is for /usr/local/bin/valgrind thus invoked.
There are two PCs which I cannot easily access all the time.
Sorry for slightly different file layout on two PCs.
TIA
PS: I am trying to create a valgrind-enabled thunderbird binary on mozilla's compilation farm. But I have hit some unexpected difficulties for now.
(The particular CPU farm does NOT have valgrind headers. Ugh.)
I am trying to have it installed. Stay tuned.
The series of patches AND parameters that I use to create comm-central thunderbird is captured in a few shell scripts, but they are rather complex and need customization: I am not quire sure if I can simply post the shell script and people to modify it to their environment. I may not be available for consultation always. So I thought creating a binary on a shared computer is much better.
It looks to me like some part of the problem arises when memcheck is working on the driver for the video graphics card. This suggests a cause for non-determinism, and also a reason for different behavior on different Linux kernels. At various times over the last few years, different parts of the driver have moved between kernel space and user space. So one strategy to avoid SIGSEGV might be to choose a video driver that is as simple as possible; probably this is "VGA framebuffer" (which does exist, but I don't know its actual name.) Another source of non-determinism is the use of threads. I usually see two threads. One of them gets the SIGSEGV, then the other terminates "normally". I ran this group of sessions on : Linux deb81p64 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 GNU/Linux 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1) libdrm-nouveau2:amd64 2.4.74-1 Gnome 1:3.20+3 desktop environment Thunderbird is icedove-45.7.1 re-built from source in the usual Debian way. THUNDERBIRD_BIN=icedove-45.7.1/obj-thunderbird/dist/bin/thunderbird-bin After building valgrind from current SVN, I modified vg-in-place so that the last command is strace -i -e signal=SIGSEGV -e trace=file,memory "$vgbasedir/coregrind/valgrind" --run-libc-freeres=no --trace-flags=10000000 --trace-notbelow=22081 --trace-syscalls=yes $THUNDERBIRD_BIN >foo 2>&1 and then experimented with --trace-notbelow until I got close to just before the killing SIGSEGV. [The number of basic blocks varied from run to run, which I attribute to non-determinism.] The last two basic blocks are below. You can see the SIGSEGV in the middle of the last block. I saw the SIGSEGV on every run, usually in about 20 seconds of real time on Intel Core 2 Duo @ 3GHz. ===== GuestBytes 1B5BF56B 22 48 8D 3D 1E 33 FF FF 48 89 3C D3 48 63 90 74 02 00 00 85 D2 78 0B 00A1B31B VexExpansionRatio 22 363 165 :10 ==== SB 23786 (evchecks 13409019) [tid 1] 0x1b5bf581 UNKNOWN_FUNCTION /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9581 ------------------------ Front end ------------------------ 0x1B5BF581: leaq -52824(%rip), %rcx ------ IMark(0x1B5BF581, 7, 0) ------ t0 = Add64(0x1B5BF588:I64,0xFFFFFFFFFFFF31A8:I64) PUT(24) = t0 PUT(184) = 0x1B5BF588:I64 0x1B5BF588: movq %rcx,(%rbx,%rdx,8) ------ IMark(0x1B5BF588, 4, 0) ------ t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8)) STle(t1) = GET:I64(24) PUT(184) = 0x1B5BF58C:I64 0x1B5BF58C: movslq 640(%rax),%rdx ------ IMark(0x1B5BF58C, 7, 0) ------ t2 = Add64(GET:I64(16),0x280:I64) PUT(32) = 32Sto64(LDle:I32(t2)) PUT(184) = 0x1B5BF593:I64 0x1B5BF593: testl %edx,%edx ------ IMark(0x1B5BF593, 2, 0) ------ t5 = 64to32(GET:I64(32)) t4 = 64to32(GET:I64(32)) t3 = And32(t5,t4) PUT(144) = 0x13:I64 PUT(152) = 32Uto64(t3) PUT(160) = 0x0:I64 PUT(184) = 0x1B5BF595:I64 0x1B5BF595: js-8 0x1B5BF5A2 ------ IMark(0x1B5BF595, 2, 0) ------ if (64to1(amd64g_calculate_condition[mcx=0x13]{0x3817bec0}(0x8:I64,GET:I64(144),GET:I64(152),GET:I64(160),GET:I64(168)):I64)) { PUT(184) = 0x1B5BF5A2:I64; exit-Boring } PUT(184) = 0x1B5BF597:I64 PUT(184) = GET:I64(184); exit-Boring GuestBytes 1B5BF581 22 48 8D 0D A8 31 FF FF 48 89 0C D3 48 63 90 80 02 00 00 85 D2 78 0B 03FEC91B VexExpansionRatio 22 363 165 :10 ==== SB 23787 (evchecks 13409020) [tid 1] 0x1b5bf597 UNKNOWN_FUNCTION /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9597 ------------------------ Front end ------------------------ 0x1B5BF597: leaq -53198(%rip), %rsi ------ IMark(0x1B5BF597, 7, 0) ------ t0 = Add64(0x1B5BF59E:I64,0xFFFFFFFFFFFF3032:I64) PUT(64) = t0 PUT(184) = 0x1B5BF59E:I64 0x1B5BF59E: movq %rsi,(%rbx,%rdx,8) ------ IMark(0x1B5BF59E, 4, 0) ------ t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8)) STle(t1) = GET:I64(64) PUT(184) = 0x1B5BF5A2:I64 0x1B5BF5A2: movslq 636(%rax),%rdx ------ IMark(0x1B5BF5A2, 7, 0) ------ t2[????????????????] +++ killed by SIGSEGV +++ = Add64(GET:I64(16),0x27C:I64) PUT(32) = 32Sto64(LDle:I32(t2)) PUT(184) = 0x1B5BF5A9:I64 0x1B5BF5A9: testl %edx,%edx ------ IMark(0x1B5BF5A9, 2, 0) ------ t5 = 64to32(Segmentation fault GET:I64(32)) t4 = 64to32(GET:I64(32)) t3 = And32(t5,t4) PUT(144) = 0x13:I64 PUT(152) = 32Uto64(t3) PUT(160) = 0x0:I64 PUT(184) = 0x1B5BF5AB:I64 0x1B5BF5AB: js-8 0x1B5BF5B8 ===== $ gdb /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so (gdb) x/12i 0xa9597 0xa9597: lea -0xcfce(%rip),%rsi # 0x9c5d0 0xa959e: mov %rsi,(%rbx,%rdx,8) 0xa95a2: movslq 0x27c(%rax),%rdx 0xa95a9: test %edx,%edx 0xa95ab: js 0xa95b8 0xa95ad: lea -0xd144(%rip),%rdi # 0x9c470 0xa95b4: mov %rdi,(%rbx,%rdx,8) 0xa95b8: movslq 0x284(%rax),%rdx 0xa95bf: test %edx,%edx 0xa95c1: js 0xa95ce 0xa95c3: lea -0xd2ba(%rip),%rcx # 0x9c310 0xa95ca: mov %rcx,(%rbx,%rdx,8) (gdb) x/12i 0xa9597-0x20 0xa9577: movslq 0x274(%rax),%edx 0xa957d: test %edx,%edx 0xa957f: js 0xa958c 0xa9581: lea -0xce58(%rip),%rcx # 0x9c730 0xa9588: mov %rcx,(%rbx,%rdx,8) 0xa958c: movslq 0x280(%rax),%rdx 0xa9593: test %edx,%edx 0xa9595: js 0xa95a2 0xa9597: lea -0xcfce(%rip),%rsi # 0x9c5d0 0xa959e: mov %rsi,(%rbx,%rdx,8) 0xa95a2: movslq 0x27c(%rax),%rdx 0xa95a9: test %edx,%edx ===== (In reply to John Reiser from comment #9) > It looks to me like some part of the problem arises when memcheck is working > on the driver for the video graphics card. This suggests a cause for > non-determinism, and also a reason for different behavior on different Linux > kernels. At various times over the last few years, different parts of the > driver have moved between kernel space and user space. So one strategy to > avoid SIGSEGV might be to choose a video driver that is as simple as > possible; probably this is "VGA framebuffer" (which does exist, but I don't > know its actual name.) Thank you for taking your time to look into the issue. I believe that SIGSEGV issue is now being reproduced in a non-deterministic manner on your PC. > > Another source of non-determinism is the use of threads. I usually see two > threads. One of them gets the SIGSEGV, then the other terminates "normally". OK, I think I will add the fair-schedulling option to valgrind to see if it makes any difference. > I ran this group of sessions on : > Linux deb81p64 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 > GNU/Linux > 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce > GT 710B] [10de:128b] (rev a1) > libdrm-nouveau2:amd64 2.4.74-1 > Gnome 1:3.20+3 desktop environment > Thunderbird is icedove-45.7.1 re-built from source in the usual Debian way. > THUNDERBIRD_BIN=icedove-45.7.1/obj-thunderbird/dist/bin/thunderbird-bin > > After building valgrind from current SVN, I modified vg-in-place so that the > last command is > strace -i -e signal=SIGSEGV -e trace=file,memory > "$vgbasedir/coregrind/valgrind" --run-libc-freeres=no --trace-flags=10000000 > --trace-notbelow=22081 --trace-syscalls=yes $THUNDERBIRD_BIN >foo 2>&1 > > and then experimented with --trace-notbelow until I got close to just before > the killing SIGSEGV. [The number of basic blocks varied from run to run, > which I attribute to non-determinism.] The last two basic blocks are below. > You can see the SIGSEGV in the middle of the last block. I have never done the detailed debugging of valgrind/memcheck at this level. Do you think that the SIGSEGV somehow occurs in the valgrind code anyhow? (I have a feeling that it may be due to a broken emulation of a very complex instruction that *may* involve certain context-level information.: the current emulation may not be quite well protected/complete in terms of atomicity or something like that. Pure guess. But otherwise, I cannot explain valgrind failure to report nice memory error(s) on its own.) > > I saw the SIGSEGV on every run, usually in about 20 seconds of real time on > Intel Core 2 Duo @ 3GHz. > Actually, I once tried to allocate only a single core to see if the problem symptom changed, but no luck. Still the same SIGSEGV. > ===== > > GuestBytes 1B5BF56B 22 48 8D 3D 1E 33 FF FF 48 89 3C D3 48 63 90 74 02 00 > 00 85 D2 78 0B 00A1B31B > > VexExpansionRatio 22 363 165 :10 > > ==== SB 23786 (evchecks 13409019) [tid 1] 0x1b5bf581 UNKNOWN_FUNCTION > /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9581 > > ------------------------ Front end ------------------------ > > 0x1B5BF581: leaq -52824(%rip), %rcx > > ------ IMark(0x1B5BF581, 7, 0) ------ > t0 = Add64(0x1B5BF588:I64,0xFFFFFFFFFFFF31A8:I64) > PUT(24) = t0 > PUT(184) = 0x1B5BF588:I64 > > 0x1B5BF588: movq %rcx,(%rbx,%rdx,8) > > ------ IMark(0x1B5BF588, 4, 0) ------ > t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8)) > STle(t1) = GET:I64(24) > PUT(184) = 0x1B5BF58C:I64 > > 0x1B5BF58C: movslq 640(%rax),%rdx > > ------ IMark(0x1B5BF58C, 7, 0) ------ > t2 = Add64(GET:I64(16),0x280:I64) > PUT(32) = 32Sto64(LDle:I32(t2)) > PUT(184) = 0x1B5BF593:I64 > > 0x1B5BF593: testl %edx,%edx > > ------ IMark(0x1B5BF593, 2, 0) ------ > t5 = 64to32(GET:I64(32)) > t4 = 64to32(GET:I64(32)) > t3 = And32(t5,t4) > PUT(144) = 0x13:I64 > PUT(152) = 32Uto64(t3) > PUT(160) = 0x0:I64 > PUT(184) = 0x1B5BF595:I64 > > 0x1B5BF595: js-8 0x1B5BF5A2 > > ------ IMark(0x1B5BF595, 2, 0) ------ > if > (64to1(amd64g_calculate_condition[mcx=0x13]{0x3817bec0}(0x8:I64,GET:I64(144), > GET:I64(152),GET:I64(160),GET:I64(168)):I64)) { PUT(184) = 0x1B5BF5A2:I64; > exit-Boring } > PUT(184) = 0x1B5BF597:I64 > PUT(184) = GET:I64(184); exit-Boring > > GuestBytes 1B5BF581 22 48 8D 0D A8 31 FF FF 48 89 0C D3 48 63 90 80 02 00 > 00 85 D2 78 0B 03FEC91B > > VexExpansionRatio 22 363 165 :10 > > ==== SB 23787 (evchecks 13409020) [tid 1] 0x1b5bf597 UNKNOWN_FUNCTION > /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9597 > > ------------------------ Front end ------------------------ > > 0x1B5BF597: leaq -53198(%rip), %rsi > > ------ IMark(0x1B5BF597, 7, 0) ------ > t0 = Add64(0x1B5BF59E:I64,0xFFFFFFFFFFFF3032:I64) > PUT(64) = t0 > PUT(184) = 0x1B5BF59E:I64 > > 0x1B5BF59E: movq %rsi,(%rbx,%rdx,8) > > ------ IMark(0x1B5BF59E, 4, 0) ------ > t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8)) > STle(t1) = GET:I64(64) > PUT(184) = 0x1B5BF5A2:I64 > > 0x1B5BF5A2: movslq 636(%rax),%rdx > > ------ IMark(0x1B5BF5A2, 7, 0) ------ > t2[????????????????] +++ killed by SIGSEGV +++ > = Add64(GET:I64(16),0x27C:I64) > PUT(32) = 32Sto64(LDle:I32(t2)) > PUT(184) = 0x1B5BF5A9:I64 > > 0x1B5BF5A9: testl %edx,%edx > > ------ IMark(0x1B5BF5A9, 2, 0) ------ > t5 = 64to32(Segmentation fault > GET:I64(32)) > t4 = 64to32(GET:I64(32)) > t3 = And32(t5,t4) > PUT(144) = 0x13:I64 > PUT(152) = 32Uto64(t3) > PUT(160) = 0x0:I64 > PUT(184) = 0x1B5BF5AB:I64 > > 0x1B5BF5AB: js-8 0x1B5BF5B8 > > ===== > > $ gdb /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so > > (gdb) x/12i 0xa9597 > 0xa9597: lea -0xcfce(%rip),%rsi # 0x9c5d0 > 0xa959e: mov %rsi,(%rbx,%rdx,8) > 0xa95a2: movslq 0x27c(%rax),%rdx > 0xa95a9: test %edx,%edx > 0xa95ab: js 0xa95b8 > 0xa95ad: lea -0xd144(%rip),%rdi # 0x9c470 > 0xa95b4: mov %rdi,(%rbx,%rdx,8) > 0xa95b8: movslq 0x284(%rax),%rdx > 0xa95bf: test %edx,%edx > 0xa95c1: js 0xa95ce > 0xa95c3: lea -0xd2ba(%rip),%rcx # 0x9c310 > 0xa95ca: mov %rcx,(%rbx,%rdx,8) > > (gdb) x/12i 0xa9597-0x20 > 0xa9577: movslq 0x274(%rax),%edx > 0xa957d: test %edx,%edx > 0xa957f: js 0xa958c > 0xa9581: lea -0xce58(%rip),%rcx # 0x9c730 > 0xa9588: mov %rcx,(%rbx,%rdx,8) > 0xa958c: movslq 0x280(%rax),%rdx > 0xa9593: test %edx,%edx > 0xa9595: js 0xa95a2 > 0xa9597: lea -0xcfce(%rip),%rsi # 0x9c5d0 > 0xa959e: mov %rsi,(%rbx,%rdx,8) > 0xa95a2: movslq 0x27c(%rax),%rdx > 0xa95a9: test %edx,%edx > > ===== I am not familiar with this debugging info before, but basically valgrind expands the instructions by including some memory checks and execute the extended instruction stream (which it calls basic block)? I will tinker with the options to see if I can gain any insight on my end. TIA I thought I posted a followup early this October, but funny, it did not seem to have made it to the bugzilla. OK, here it goes. I still cannot run Thunderbird mail client under valgrind as normal user. I get very mysterious segmentation error of valgrind. *HOWEVER*, to my surprise, when I ran valgrind as a *LOGGED IN SUPERUSER* (|su| does not work. I have to log in as superuser either from login screen or sshing to the local PC box as root.), I can run thunderbird under valgrind. successfully (!) I wonder why. So now, I have been testing thunderbird under valgrind by "ssh"ing to my local PC box after enabling superuser login of ssh daemon (which is usually off for security reasons). Anyone has any idea why this is so? From the look of log trace, the mysterious segmentation error (as normal user) that kills valgrind occurs at the time the stack seems to be extended. (I looked at signal handling and it looks that way.) At that point, valgrind as normal user is killed, but as properly logged in superuser, the stack gets enlarged properly and the thunderbird proceeds fine under valgrind. I mention that Debian GNU/Linux kernel 3.19.5 had no problem running thunderbird under valgrind. However I have no idea how that particular kernel version with the particular kernel config had any looser restriction regarding user process trying to extend stack when it overflows. I thought stack extension is automatic under Debian as long as user sets the stack maximum size reasonably large. I set ulimit to the maximum (no limit). So my processshould be able to extend the stack as normal user under OS management, was what I thought. Does anyone have idea how to enable the reliable stack growth as ordinary user under Debian? Or maybe this is valgrind-specific stack exntesion issue under certain kernel config (including security mechanism?). I have been testing thunderbird under valgrind as superuser for a couple of weeks after long time, but use of ssh and the permission of files created as superuser during the execution of valgrind+thunderbird have made it rather awkward in the debug cycle and I really wish to use valgrind + thunderbird as normal user. Any hints/tips will be appreciated. valgrind is very useful. I already have found two clear-cut uninitialized memory access bugs that have crept in the code over the years. I suspect more to come ... TIA Please run "uname -a" then copy+paste the entire output, and label with the circumstances (3 cases: normal user, logged-in root, su from normal user to root). It's not obvious which kernels are being used today. Created attachment 123636 [details]
uname -a output, and other requested logs.
Thank you for the comment.
I stand corrected. I think valgrind works in the context of |su|
superuser.
It is the thunderbird binary that refuses to run in |su| context.
(I forgot about it and thought it was valgrind which refused to
operate properly in |su| context.)
Requested info is attached.
*1: uname -a output
*2. As a normal user
*3: as super user |su|.
*4 As properly logged in superuser (via ssh)
|