Bug 377006

Summary:	valgrind/memcheck segfaults under certain kernel versions (amd64) but not others.
Product:	[Developer tools] valgrind	Reporter:	zephyrus00jp <ishikawa>
Component:	memcheck	Assignee:	Julian Seward <jseward>
Status:	REPORTED ---
Severity:	normal	CC:	jreiser
Priority:	NOR
Version:	unspecified
Target Milestone:	---
Platform:	Other
OS:	Linux
Latest Commit:		Version Fixed In:
Sentry Crash Report:
Attachments:	debug log under 4.7.0.1 (valgrind crashes) Log for 4.9.x kernel test runs Log for 3.19.5 (sucessful runs for comparison) log of segfault failure under 4.9.x series kernel (debian). with --trace-syscalls=yes A log to show valgrind with --vex-iropt-register-updates=all still segfaults (under 4.7.0.1) A log of valgrind experiencing SIGSEGV under 4.9.x kernel: with full vlgrind options uname -a output, and other requested logs.

Description zephyrus00jp 2017-02-28 08:22:23 UTC

Created attachment 104262 [details]
debug log under 4.7.0.1 (valgrind crashes)

System Debian GNU/Linux: amd64

I am trying to run mozilla's thunderbird which I compile locally under valgrind.
It works rather well.
However, over the past couple of years, I found that valgrind + thunderbird does not work very well under certain Debian-supplied (and my locally created) linux kernel versions.

I think there is a reason for this.
But until now, I have no clue as to what kernel configure options have effect
on this.

I posted the buggy situation in the following mail post: you can see the thread therein.

https://sourceforge.net/p/valgrind/mailman/message/35667483/

I have managed to glean 5 runs under 
- 4.7.0.1 (two core VM image). Failure cases.
- 4.9.x (4.9.6, manually created. four core VM image). Failure cases.
and a couple of runs for
- 3.19.5 (four core VM image). Success cases for comparison.

After five runs, I thought I would seek experts opinion in which direction I should spend time in pursuing the issue.

I am going to upload a few attachments now.
One is for 4.7.0.1 test runs.
The second one is for  4.9.x test runs.
The last  one is for 3.19.5 test runs for comparison.

Here is a quick observation (my wild guess based on what I observed).
I think people in the know can glean more information to refute my guess or
suggest how I can pursue the debugging still further.

You may want to read the post to the above URL first.
Here is the URL again:
https://sourceforge.net/p/valgrind/mailman/message/35667483/

--- comment as of now.
Please remember that valgrind + thunderbird runs just fine under
more or less vanilla 3.19.5 kernel.
But valgrind failed to run under certain dDebian-supplied kernels and
4.9.x series I created.
Valgrind segfaults.

I obtained some logs from failure cases under 4.7.0.1 and 4.9.x,
and a successful cases from 3.19.5 for comparison.

There could be an issue of mmaps layout change and
signal handler setup issue.

Debug story:

I set breakpoint on mozilla thunderbird
to figure out if there is any particular behavior of mozilla
thunderbird that triggers valgrind segmentation error on certaion
linux kernel versions.

For successful cases under 3.19.5 and failed cases under 4.7.0.1,
I could set breakpoint on fork() [that is used to call an external
program to check for graphics adaptor capability.] and
at that point I could dump /proc/$pid/maps in addition to the dump
at breakpoint placed on main().
Under 4.9.x series kernel, valgrind segfaults way before this |fork|
and so I only obtained /proc/$pid/maps only at the breakpoint at
main().

The different dumpings of maps revealed a change near the stack of
valgrind under different kenrel vesion..

I show the excerpts near the end of maps listing.

From 4.9.x series kernel: Failure case
  ...
806203000-806334000 rwxp 00000000 00:00 0 
806af9000-806ce2000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7ffd03470000-7ffd03492000 rw-p 00000000 00:00 0                          [stack]
7ffd034ba000-7ffd034bc000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(gdb) cont



From 4.7.0.1: failure case
     ...
805b29000-805c29000 rwxp 00000000 00:00 0 
8063ee000-8067d7000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7ffcbe4d1000-7ffcbe4f3000 rw-p 00000000 00:00 0                          [stack]
7ffcbe5ee000-7ffcbe5f0000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

From: 3.19.5: success case:

Note that the stack of the first process that runs is at an earlier
map [stack:9283]
...
802001000-802cdc000 rwxp 00000000 00:00 0 
802d8c000-802ea8000 rwxp 00000000 00:00 0 
802ea8000-802eaa000 ---p 00000000 00:00 0 
802eaa000-802faa000 rwxp 00000000 00:00 0                                [stack:9283]
802faa000-802fac000 ---p 00000000 00:00 0 
802fac000-802fad000 rw-s 00000000 08:18 482                              /tmp/vgdb-pipe-shared-mem-vgdb-9283-by-ishikawa-on-???
802fad000-802fcd000 rwxp 00000000 00:00 0 
803081000-8034c4000 rwxp 00000000 00:00 0 
80356a000-805649000 rwxp 00000000 00:00 0 
805749000-805849000 rwxp 00000000 00:00 0 
805c0e000-805d0e000 rwxp 00000000 00:00 0 
805e0e000-806019000 rwxp 00000000 00:00 0 
806203000-806334000 rwxp 00000000 00:00 0 
806af9000-806ce2000 rwxp 00000000 00:00 0 
ffeffe000-fff001000 rw-p 00000000 00:00 0 
7fffe6088000-7fffe60aa000 rw-p 00000000 00:00 0 
7fffe616b000-7fffe616d000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(gdb) cont


The above shows the maps layout change I noticed.

Timing/race issue: Tough.

I have learned while debugging this way under linux 4.7.0.1 kernel
that the issue is timing-dependent and is racey (!).  The problem did
not occur when I stepped in the gdb using "s" and "fun". thunderbird
ran fine under valgrind then. Ugh...  (valgrind + thunderbird
segfaults if I let it run at full speed without attaching gdb to the
interpreted thunderbird.)

So there is a very small chance to narrow the area of search of
particular behavior of the thunderbird.  I doubt if we could narrow it
to a single line of code of mozilla thunderbird.

My current thinking: signal handler and mmap layout change?

Rather I think it is the handling of signals such as virtual timer
event that seems to be used often inside C-C TB that seems to mess
with valgrind under certain kernels.

The reasoning: while I see many signal handling dump in the successful
execution of valgrind + mozilla thunderbird under 3.19.5 kernel,
I don't see them in the failure cases (segfault) under 4.7.0.1 and 4.9
series kernels.
I suspect that the lack of signal handling dump in failed cases
suggests that there is a timing window where signal handler is not
quite well set up (inside valgrind?) when a memory error is detected.
(I still am trying to grapple with the situation where
valgrind fails to properly catches sigsegv condition.
Is the signal handler not properly set up in those cases?

OTOH, valgrind seems to be capable of enlarging stack based on memory
violation. OTOH, the final sigsegv that failed valgrind is not handled.

[I took the following snippets out from 4-9-x series testing's log2.]

Successful Stack fault handling
1st case:
--3174-- SIGSEGV: si_code=1 faultaddr=0xffeffd6a8 tid=1 ESP=0xffeffd6a0 seg=0xffe7b0000-0xffeffdfff
--3174--        -> extended stack base to 0xffeffd000

2nd case:
--3174-- SIGSEGV: si_code=1 faultaddr=0xffeffc678 tid=1 ESP=0xffeffc670 seg=0xffe7b0000-0xffeffcfff
--3174--        -> extended stack base to 0xffeffc000
--3174-- REDIR: 0x5b79750 (libc.so.6:bcmp) redirected to 0x4a26742 (_vgnU_ifunc_wrapper)

final error in the 4.9.x series: 

gettid()                                = 3174
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid()                                = 3174
write(1027, "--3174-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++


Wild guess: Is it possible that due to mmap layout change(s),
the stack fault is reported as

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---

and valgrind did not notice that it was running out of stack or heap
or whatever which valgrind could hanlde either by extending it or printing
awarning and stopping, etc.? Maybe since it did not recognize the SIGSEGV
as such and simply threw out the error without catching it? (See
random observation [2]).

I am saying this since si_addr=0xffeffa24c reported is rathe near the
stack address reported in the stackfault above.

Anyway, I appreciate if I can get some feedback.

Random observation:

[1]. The error address in SIGSEGV seems to be repeatable under 4.9.x: this
may be due to the fact that the error occurs even before the control
reaches the breakpoint at |fork| call. It occurs before any
interaction.  OTOH, under 4.7.0.1, the error seems to occur around
|fork| call where I set a breakpoint. There *IS* a timing issue
involved. And the address reported is slightly different across the
log files. (It relates to the address on the stack then?)

[2]. Valgrind uses SIGSEGV for handling stack overflow.

Running valgrind under gdb gave me some clues on this.

I ran valgrind under gdb once under 4.9.x serie kernel.
(See log5-main.txt log file.)

gdb reports SIGSEGV three times when valgrind crashes.
Actually the first two SIGSEGV crashes are the ones that valgrind uses
internally for checking stack fault. valgrind adjusts stack
accordingly and chugs along happily.
Previously, I did not notice this and when I saw SIGSEGV reported by
GDB, I gave up.
But this time, I wised up and "c"ontinue and voila!
The first two times, valgrind continued.

Only at the third SIGSEGV, valgrind crashed.

[3] Testing after a reboot.

I cheated a bit. I did not reboot 4.9.x kernel every time I tested
valgrind+thunderbird combination.
Under linux, each process should not be disturbed in such a
manner as to making debugging user programs unrepeatable just because
a few independent processes ran before it.

Well, that is the principle.

I know that there have been cases where linux kernel, especially
fork()-related handling left a few dormat bugs for a long time. That
is why I had a short-lived excitement when I noticed that the segfault
occurred near fork call under 4.7.0.1. I thought I hit upon a linux
kernel bug or similar bug in valgrind. [I definitely think signal
handler setup/release have something to do due to this observation,
too.]

The fact is that there is not a single thing I can do to run
X-based application under Debian WITHOUT RUNNING A COUPLE OF THOUSAND
PROCESSES before I get a chance to login from the login manager and
under X desktop. I found that the PID started over 2000 when I tried
rebooting and began testing, and I am not even sure if it is a 
rounded PID or not (module 2^16).
So "running test afresh" is almost impossible under today's linux if one
wants to test an X-based application.

TIA

Comment 1 zephyrus00jp 2017-02-28 08:26:54 UTC

Created attachment 104263 [details]
Log for 4.9.x kernel test runs

Log of test runs under 4.9.x kernel.

Comment 2 zephyrus00jp 2017-02-28 08:28:46 UTC

Created attachment 104264 [details]
Log for 3.19.5 (sucessful runs for comparison)

I needed to remove the largest log which contains a much longer strace dump
during a success run to avoid upload limit.

I will explain the files in more detail later on.

Comment 3 zephyrus00jp 2017-02-28 10:09:00 UTC

Description of log directory archive.

[1] Description of log files under 4.7.0.1

There are a group of files. (N being 2,3,4,5)
- logN-main.txt,
- logN-gdb-dddd.txt,
- logN-gdb-dddd'.txt
- logN-stra.out.gz
- logN-crash.txt

I ran ~/bin/run-valgrind-test.sh which contains the following command

strace -f -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes ~/objdir-tb3/dist/bin/thunderbird-bin

strace is used to capture what system calls valgrind was issueing.

I used -vgdb=yes, and --vgdb-error=0 to attach to the thunderbird binary that
is monitied by thunderbird binary.
It turns out that thunderbird binary runs an external program (by fork()) about the time it segfaulted and so I added --tracechildren=yes.
(However, I found that the crash occurs on the original process).

Anyway, logN-main.txt recorded the log output from the above command line.
One of the logN-dddd.txt and logN-dddd'.txt contains the first instance of gdb that is attached to the thunderbird process under valgrind/memcheck.
I set a few breakpoints at main and at fork, and let it run.
I dumped the mmap layout by printing /proc/PID/maps (PID being the process id of the thunderbird process).

logN-stra.out.gz contains the recoding from strace.
logN-crash.txt contains the last part of the strace output (logN-stra.out.gz) and a few comments regarding the particular test run.

Under the particular version of 4.7.0.1 (which is Debian-supplied), valgrind invokes another process by fork(), and when it happens, valgrind suggests that I attach another gdb to the sibling process. (--vgdb=yes --vgdb=error), and I do so and run the sibling process under gdb.
However, the sibling does not seem to be the culprit.
But the original parent process or rather the original valgrind that runs the
original thunderbird binary seemed to crash. (But note the comment about
gdb showing three SIGSEGV and the first two (2) are for stack fault and handled by valgrind, and only the third one being the fatal SIGSEGV later in this post. (In the discussion of 4-9-x-dir/log5-main.txt )

Anyway, you can see that valgrind experiences SIGSEGV eventually.

I did not include the case where I stepped through the parent valgrind+thunderbird execution by stepping using "c" for continue and "fun" for function level trace. In one case, the crash did not happen and thunderbird ran to completion without experiencing the crash. So the issue is timing-related and race condition exists somewhere in the code (on the valgrind side as well.)

[2] log under 4.9.x kernel test runs

There are files (N=1,2,3,4)
 - logN-main.txt
 - logN-error.txt
 - logN-stra.out.dddd.gz
 - logN-gdb-ddd.txt

logN-main.txt contains the log/print from running the following command line:
strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

(Sorry, this is on a different machine with different file layout. this is on 4 core machine.)


logN-gdb-ddd.txt: logN-gdb-ddd.txt contains the log from the attaching gdb.
Since the SIGSEGV occurs way before the child process is forked(), there is only one log from the instance of gdb that is attached to the thunderbird binary monitored under valgrind/memcheck.

As in the case of kernel 4.7.0.1, I set breakpoint on main, fork and in a few cases, dlopen, but under kernel 4.7.0.1, valgrind crashes way before
fork is called and thus I only dumped /proc/PID/maps at the breakpoint placed at the entry of main().

logN-stra.out.dddd.gz: this contains the strace output.

logN-error.txt: 
(I am afraid I emptied log2-error.txt by mistake....)

This contains the last part of logN-stra.out.ddd.gz and a comment.
For example, log1-error.txt contains this.
--- begin quote
valgrind crashed even before I had chance to place a breakpoint, i.e.,
before --vgdb-error=0 took effect!?

gettid()                                = 2851
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid()                                = 2851
write(1027, "--2851-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++
--- end quote

So the situation is highly timing dependent / racy.

log5-main.txt: this is very interesting.
I ran valgrind under gdb.
And then issued to gdb
run --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

With this log, you can see that valgrind DOES experience SIGSEGV, but
it uses SIGSEGV internally to catch stack over(?)/under(?) flow and adjust the stack and continue.
(I came to recognize this after I saw the signal tracking log lines from valgrind.)
Previously, as soon as I saw SIGSEGV, I gave up. (And the stack trace is that of signal handler for SIGSEGV (???), and it is not printed as in the usual case???. That prompted my giving up also.)
In this log file, log5-main.txt, 
for the first two SIGSEGV cases, I dumped stack and register (info register), and then continued with "c" for cont.
To my surprise, valgrind happily continued.
Only at the third SIGSEGV, valgrind crashes.

This may give us some clue on the nature of the crash?

[3] 3.19.5 log from successful runs.

This was captured under 3.19.5 kernel.
This shows the success runs and what messages ought to be printed by successful valgrind run.


There are files (N=1,2,3)
logN-main.txt: co
I ran the following command.
strace -ff -o /tmp/stra.out valgrind --verbose --trace-signals=yes --show-mismatched-frees=no --vgdb=yes --vgdb-error=0 --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

Under the kernel 3.19.5 which I used, valgrind does not crash,
and thus the initial run of log1-main.txt
had to be interrupted midway because the strace output became too large: it had to be removed from
the upload, too.

But for the two subsequent runs, log2-main.txt and log3-main.txt, I exited thunderbird runs
after the initial profile selection menu comes up: I simply exited from there.
By that time, though, the sibling process to probe the graphics adaptor was invoked and exited successfully.

logN-gdb-ddd.txt, logN-gdb-ddd'.txt:
I attach gdb to the initial thunderbird process and the subsequent sibling process that is forked from the original process.

logN-stra.out.ddd.gz: the strace output for the initial valgrind process.
There are only two of them:
      3-19-5-dir/log2-stra.out.10447.gz
      3-19-5-dir/log3-stra.out.10973.gz

For the first test run, it became too large since I followed the
execution until initial windows came up. I had to drop it before uploading.

The immediate change you will notice is the slight change of mmap.

Somehow the initial process's stack is NOT at the end, but somewhere earlier.

The successful case shows many signal handling dump.
I wondery why we do not see them in the failed cases.
That may be related to the root cause(s) of the crash.


Hope this helps someone in figuring out the true nature of the bug
or in suggesting some tips for further debugging on my end.

TIA

Comment 4 zephyrus00jp 2017-03-07 02:12:13 UTC

Created attachment 104424 [details]
log of segfault failure under 4.9.x series kernel (debian). with --trace-syscalls=yes


This is under 4.9.x.

The log file by running a shell script that contains the following command
with
  bash -vx the_script_file 2>&1 | tee /tmp/t-log

strace -ff valgrind --verbose --trace-syscalls=yes --trace-signals=yes --show-mismatched-frees=no --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

"--trace-syscalls" ought to show the syscalls executed by the traced program, namely, thunderbird-binary.
I see the message about such syscalls written by write(1027, ...) by valgrind.

Because of the way the log was captured, the trace of |write| by strace
is immediately followed by the string written by |write|.


In any case, the following last portion of the log suggests to me that
it is indeed valgrind that is causing the mysterious crash. Isn't it?

--- begin quote
...
write(1027, "SYSCALL[3810,1](202) ... [async]"..., 49SYSCALL[3810,1](202) ... [async] --> Success(0x0)) = 49
write(1027, " ", 1 )                     = 1
write(1027, "\n", 1
)                    = 1
mmap(0x802fb0000, 196608, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x802fb0000
munmap(0x802f90000, 131072)             = 0
gettid()                                = 3810
write(1029, "Q", 1)                     = 1
gettid()                                = 3810
read(1028, "Q", 1)                      = 1
gettid()                                = 3810
rt_sigprocmask(SIG_SETMASK, ~[], ~[KILL STOP], 8) = 0
rt_sigtimedwait(~[], 0x802f8de20, {tv_sec=0, tv_nsec=0}, 8) = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_SETMASK, ~[KILL STOP], NULL, 8) = 0
gettid()                                = 3810
mmap(0x802d3d000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x802d3d000
getpid()                                = 3810
write(1027, "--3810-- REDIR: 0x52e68b0 (libst"..., 115--3810-- REDIR: 0x52e68b0 (libstdc++.so.6:operator delete(void*)) redirected to 0x4c2d19c (operator delete(void*))
) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa22c} ---
+++ killed by SIGSEGV +++
/home/ishikawa/bin/run-valgrind-thunderbird.sh: line 13:  3808 Segmentation fault      strace -ff valgrind --verbose --trace-syscalls=yes --trace-signals=yes --show-mismatched-frees=no --trace-children=yes /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin

--- end quote 

The address reported is slightly changed: I think it is due to the slight 
change in the source code..
final error in the 4.9.x series: 

The following is the old address reported in sigsegv.

gettid()                                = 3174
mmap(0x803041000, 16384, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x803041000
getpid()                                = 3174
write(1027, "--3174-- REDIR: 0x52e68b0 (libst"..., 115) = 115
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---
+++ killed by SIGSEGV +++


The address in SIGSEGV is interesting:
From the old log:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa24c} ---

From the current log:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa22c} ---


It is BELOW the last (anonymous) map (reported in the old log)
--- begin
From 4.9.x series kernel: Failure case
  ...
806203000-806334000 rwxp 00000000 00:00 0 
806af9000-806ce2000 rwxp 00000000 00:00 0   <-- the si_addr is after this.
ffeffe000-fff001000 rw-p 00000000 00:00 0   <---the si_addr is before this. 
7ffd03470000-7ffd03492000 rw-p 00000000 00:00 0                          [stack]
7ffd034ba000-7ffd034bc000 r--p 00000000 00:00 0                          [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(gdb) cont
--- end

Hope this helps.

Comment 5 John Reiser 2017-03-07 05:00:54 UTC

Apparently this user has been in a very similar situation before:
     https://bugs.kde.org/show_bug.cgi?id=345414#c3
which is only 6 months ago (September, 2016).

I wonder if the user has followed Julian's advice from bug #345414 Comment 3:
=====
In your mozconfig, you need these

  ac_add_options --disable-jemalloc
  ac_add_options --enable-valgrind

and when you run Valgrind, you need this

  --vex-iropt-register-updates=allregs-at-mem-access
=====

Comment 6 zephyrus00jp 2017-03-07 07:04:21 UTC

Created attachment 104426 [details]
A log to show valgrind with --vex-iropt-register-updates=all still segfaults (under 4.7.0.1)

Sorry I was not specific.

I DID follow Julien's advice in the last several months and still no luck.

The following options for creating Thunderbird are in my mozconfig
file (on two PCs where I test.  

>  ac_add_options --disable-jemalloc
>  ac_add_options --enable-valgrind

[And I don't see unhandled syscall: 317 before segfault
in recent logs as you can verify. At least, I didn't recall seeing this for the last few months.]

Re the following option:

>  --vex-iropt-register-updates=allregs-at-mem-access

This has been added to the valgrind option when I ran 
|make mozmill| test suite under one of the test machines, and
still no luck. Even with this and a few other options Julian suggested, the combination valgrind+thunderbird runs under 3.19.5 and
segfaults under 4.7.0.1 and 4.9.x series.
Actually, this option does not make a difference as far as I can tell :-(

> https://bugs.kde.org/show_bug.cgi?id=345414#c3

Yes, that thread was reported by me almost two years ago.

Back then 4.y series kernel was not available for Debian (it was only
in testing repository). But it is now. And I want to use the later
kernel versions for obvious reasons.

I have to emphasize that the bug still stands with the suggested option.

I am attaching the segfault case when valgrind is run with the following parameter under kernel 4.7.0.1 (Debian's distribution). Note the addition of
--vex-iropt-register-updates=allregs-at-mem-access.
(It does not make a difference. valgrind+thunderbird still
segfaults. Sorry I was not specific enough about this in my original
report. I did not want to clutter the bug report with the options that
do not seem to have effect.)

strace -ff valgrind --verbose --trace-syscalls=yes --trrace-signals=yes --show-mismatched-frees=no --trace-children=yes --vex-iropt-register-updates=allregs-at-mem-access ~ishikawa/objdir-tb3/dist/bin/thunderbird-bbin 

On this machine with this kernel, original valgrind+thunderbird segfaults AFTER a child process spawned by thunderbird finishes.
On another machine with 4.9.x kernel, valgrind+thunderbird segfaults way before the child process fork/exec happens.

It is really frustrating to see the combination of valgrind+thunderbird work only under certain kernel revisions (in my case, 3.19.5) as noted in https://bugs.kde.org/show_bug.cgi?id=345414#c6

With a slim hope of success, I tried to use the old kernel config for 3.19.5 to
create 4.9.z kernel (using make oldconfig ), but valgrind+thunderbird still segfaults under the resulting kernel. (That was on a different PC.)

It would be great to find out TO WHERE (if it is meaningful) the stray pointer reported in SIGSEGV points.

I think the routine to report the symbols that get mapped to anonymous
maps area as viewed by valgrind can be very useful for this.

I have a few pet theories for possible issues:

1. Given that when the fatal SIGSEGV is received, the stack trace
seems to messed up somehow (they seem to be too low address in
comparison to other values), I suspect that it may be that valgrind is
experiencing a segfault in the code to set up signal handlers
including SIGSEGV.  (There were some races in linux kernel regarding
some signal issues before after a fork(). Maybe valgrind code inherits similar
problems. But do note that under 4.9.x, the valgrind+thunderbird combination  still crashes BEFORE fork() is reached. So there may be multpile issues here.

2. I am not sure how valgrind handles this, but, given the different mmap layout, I wonder if the malloc routine in valgrind may assign an area as a return value of malloc() which is at the end of sbrk()'ed area. If so, what happens if multiple-bytes access by x86_64 for strcmp, etc. goes beyond the sbrk() value during operation. That is what happens, that eager access to speed up operation by reading extra bytes (8 or 16 octes) and if the tail part of the octet array falls beyond the user's valid vm address are. Does it get caught as SIGSEGV? Or is such an access checked byte by byte before such an illegal access is attemped?

If the reason for the segfault is not one of the above,
I am at my wit's end.

TIA

Comment 7 zephyrus00jp 2017-03-08 06:05:51 UTC

To avoid the uncertainty about how mozilla thunderbird is compiled for valgrind testing, I will create a binary (for valgrind testing) on mozilla's tryserver.
(You an see that what patches are applied via the web interface.)

I figure it would be easier this way after I thought about detailing the
manual procedure to compile comm-central thunderbird under linux for valgrind, etc. There are some patches, etc. that have to be applied using MERCURIUM hg command, and other steps.

Having a visible binary which has undergone standard testing is much better.


Unfortunately, the build is currently busted on the tryserver due to 
a very recent change in the code that is part of mozilla firefox which thunderbird shares.
I hope it will be fixed in a few days. Then I can create a binary on the tryserver which can be used for testing this issue under linux.

TIA

Comment 8 zephyrus00jp 2017-03-09 13:13:11 UTC

Created attachment 104470 [details]
A log of valgrind experiencing SIGSEGV under 4.9.x kernel: with full vlgrind options

This log is obtained by running
valgrind under a wrapper called run-valgrind (of my creation).
This wrapper runs valgrind with suitable parameters to monitor thunderbird binary as follows.

valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=$HOME/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --trace-syscalls=yes	 /NREF-COMM-CENTRAL/objdir-tb3/dist/bin/thunderbird-bin


The process IDs that appear are:
pid 28538: the binary wrapper run-valgrind

pid 28539: from the wrapper valgrind is executed by calling system() library call.
[pid 28539] execve("/bin/sh", ["sh", "-c", "valgrind --trace-children=yes --"...], [/* 65 vars */]) = 0

pid 28540: from the spawned shell valgrind is run
[pid 28540] execve("/usr/local/bin/valgrind", ["valgrind", "--trace-children=yes", "--fair-sched=yes", "--smc-check=all-non-file", "--gen-suppressions=all", "--vex-iropt-register-updates=all"..., "--child-silent-after-fork=yes", "--trace-children-skip=/usr/bin/h"..., "--num-transtab-sectors=24", "--tool=memcheck", "--freelist-vol=500000000", "--redzone-size=128", "--px-default=allregs-at-mem-acce"..., "--px-file-backed=unwindregs-at-m"..., "--malloc-fill=0xA5", "--free-fill=0xC3", "--num-callers=50", "--suppressions=/home/ishikawa/Dr"..., "--show-mismatched-frees=no", "--show-possibly-lost=no", "--read-inline-info=yes", "--trace-syscalls=yes", "/NREF-COMM-CENTRAL/objdir-tb3/di"...], [/* 65 vars */] <unfinished ...>

Near the end of the log, we see the following two lines:

[pid 28540] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffeffa1fc} ---
[pid 28540] +++ killed by SIGSEGV +++

So it is pid 28540 that gets killed by SIGSEGV: pid 28540 is for /usr/local/bin/valgrind thus invoked.

There are two PCs which I cannot easily access all the time.
Sorry for slightly different file layout on two PCs.

TIA

PS: I am trying to create a valgrind-enabled thunderbird binary on mozilla's compilation farm. But I have hit some unexpected difficulties for now.
(The particular CPU farm does NOT have valgrind headers. Ugh.)
I am trying to have it installed. Stay tuned.

The series of patches AND parameters that I use to create comm-central thunderbird is captured in a few shell scripts, but they are rather complex and need customization: I am not quire sure if I can simply post the shell script and people to modify it to their environment. I may not be available for consultation always. So I thought creating a binary on a shared computer is much better.

Comment 9 John Reiser 2017-03-13 00:36:22 UTC

It looks to me like some part of the problem arises when memcheck is working on the driver for the video graphics card.  This suggests a cause for non-determinism, and also a reason for different behavior on different Linux kernels.  At various times over the last few years, different parts of the driver have moved between kernel space and user space.  So one strategy to avoid SIGSEGV might be to choose a video driver that is as simple as possible; probably this is "VGA framebuffer" (which does exist, but I don't know its actual name.)

Another source of non-determinism is the use of threads.  I usually see two threads.  One of them gets the SIGSEGV, then the other terminates "normally".

I ran this group of sessions on :
Linux deb81p64 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64 GNU/Linux
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 710B] [10de:128b] (rev a1)
libdrm-nouveau2:amd64  2.4.74-1
Gnome 1:3.20+3 desktop environment
Thunderbird is icedove-45.7.1 re-built from source in the usual Debian way.
THUNDERBIRD_BIN=icedove-45.7.1/obj-thunderbird/dist/bin/thunderbird-bin

After building valgrind from current SVN, I modified vg-in-place so that the last command is
     strace -i -e signal=SIGSEGV -e trace=file,memory "$vgbasedir/coregrind/valgrind" --run-libc-freeres=no --trace-flags=10000000 --trace-notbelow=22081 --trace-syscalls=yes $THUNDERBIRD_BIN  >foo 2>&1

and then experimented with --trace-notbelow until I got close to just before the killing SIGSEGV.  [The number of basic blocks varied from run to run, which I attribute to non-determinism.]  The last two basic blocks are below.  You can see the SIGSEGV in the middle of the last block.

I saw the SIGSEGV on every run, usually in about 20 seconds of real time on Intel Core 2 Duo @ 3GHz.

=====

GuestBytes 1B5BF56B 22  48 8D 3D 1E 33 FF FF 48 89 3C D3 48 63 90 74 02 00 00 85 D2 78 0B  00A1B31B

VexExpansionRatio 22 363   165 :10

==== SB 23786 (evchecks 13409019) [tid 1] 0x1b5bf581 UNKNOWN_FUNCTION /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9581

------------------------ Front end ------------------------

        0x1B5BF581:  leaq -52824(%rip), %rcx

              ------ IMark(0x1B5BF581, 7, 0) ------
              t0 = Add64(0x1B5BF588:I64,0xFFFFFFFFFFFF31A8:I64)
              PUT(24) = t0
              PUT(184) = 0x1B5BF588:I64

        0x1B5BF588:  movq %rcx,(%rbx,%rdx,8)

              ------ IMark(0x1B5BF588, 4, 0) ------
              t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8))
              STle(t1) = GET:I64(24)
              PUT(184) = 0x1B5BF58C:I64

        0x1B5BF58C:  movslq 640(%rax),%rdx

              ------ IMark(0x1B5BF58C, 7, 0) ------
              t2 = Add64(GET:I64(16),0x280:I64)
              PUT(32) = 32Sto64(LDle:I32(t2))
              PUT(184) = 0x1B5BF593:I64

        0x1B5BF593:  testl %edx,%edx

              ------ IMark(0x1B5BF593, 2, 0) ------
              t5 = 64to32(GET:I64(32))
              t4 = 64to32(GET:I64(32))
              t3 = And32(t5,t4)
              PUT(144) = 0x13:I64
              PUT(152) = 32Uto64(t3)
              PUT(160) = 0x0:I64
              PUT(184) = 0x1B5BF595:I64

        0x1B5BF595:  js-8 0x1B5BF5A2

              ------ IMark(0x1B5BF595, 2, 0) ------
              if (64to1(amd64g_calculate_condition[mcx=0x13]{0x3817bec0}(0x8:I64,GET:I64(144),GET:I64(152),GET:I64(160),GET:I64(168)):I64)) { PUT(184) = 0x1B5BF5A2:I64; exit-Boring }
              PUT(184) = 0x1B5BF597:I64
              PUT(184) = GET:I64(184); exit-Boring

GuestBytes 1B5BF581 22  48 8D 0D A8 31 FF FF 48 89 0C D3 48 63 90 80 02 00 00 85 D2 78 0B  03FEC91B

VexExpansionRatio 22 363   165 :10

==== SB 23787 (evchecks 13409020) [tid 1] 0x1b5bf597 UNKNOWN_FUNCTION /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9597

------------------------ Front end ------------------------

        0x1B5BF597:  leaq -53198(%rip), %rsi

              ------ IMark(0x1B5BF597, 7, 0) ------
              t0 = Add64(0x1B5BF59E:I64,0xFFFFFFFFFFFF3032:I64)
              PUT(64) = t0
              PUT(184) = 0x1B5BF59E:I64

        0x1B5BF59E:  movq %rsi,(%rbx,%rdx,8)

              ------ IMark(0x1B5BF59E, 4, 0) ------
              t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8))
              STle(t1) = GET:I64(64)
              PUT(184) = 0x1B5BF5A2:I64

        0x1B5BF5A2:  movslq 636(%rax),%rdx

              ------ IMark(0x1B5BF5A2, 7, 0) ------
              t2[????????????????] +++ killed by SIGSEGV +++
 = Add64(GET:I64(16),0x27C:I64)
              PUT(32) = 32Sto64(LDle:I32(t2))
              PUT(184) = 0x1B5BF5A9:I64

        0x1B5BF5A9:  testl %edx,%edx

              ------ IMark(0x1B5BF5A9, 2, 0) ------
              t5 = 64to32(Segmentation fault
GET:I64(32))
              t4 = 64to32(GET:I64(32))
              t3 = And32(t5,t4)
              PUT(144) = 0x13:I64
              PUT(152) = 32Uto64(t3)
              PUT(160) = 0x0:I64
              PUT(184) = 0x1B5BF5AB:I64

        0x1B5BF5AB:  js-8 0x1B5BF5B8

=====

$ gdb /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so

(gdb) x/12i 0xa9597
   0xa9597:	lea    -0xcfce(%rip),%rsi        # 0x9c5d0
   0xa959e:	mov    %rsi,(%rbx,%rdx,8)
   0xa95a2:	movslq 0x27c(%rax),%rdx
   0xa95a9:	test   %edx,%edx
   0xa95ab:	js     0xa95b8
   0xa95ad:	lea    -0xd144(%rip),%rdi        # 0x9c470
   0xa95b4:	mov    %rdi,(%rbx,%rdx,8)
   0xa95b8:	movslq 0x284(%rax),%rdx
   0xa95bf:	test   %edx,%edx
   0xa95c1:	js     0xa95ce
   0xa95c3:	lea    -0xd2ba(%rip),%rcx        # 0x9c310
   0xa95ca:	mov    %rcx,(%rbx,%rdx,8)

(gdb) x/12i 0xa9597-0x20
   0xa9577:	movslq 0x274(%rax),%edx
   0xa957d:	test   %edx,%edx
   0xa957f:	js     0xa958c
   0xa9581:	lea    -0xce58(%rip),%rcx        # 0x9c730
   0xa9588:	mov    %rcx,(%rbx,%rdx,8)
   0xa958c:	movslq 0x280(%rax),%rdx
   0xa9593:	test   %edx,%edx
   0xa9595:	js     0xa95a2
   0xa9597:	lea    -0xcfce(%rip),%rsi        # 0x9c5d0
   0xa959e:	mov    %rsi,(%rbx,%rdx,8)
   0xa95a2:	movslq 0x27c(%rax),%rdx
   0xa95a9:	test   %edx,%edx

=====

Comment 10 zephyrus00jp 2017-03-14 07:08:31 UTC

(In reply to John Reiser from comment #9)
> It looks to me like some part of the problem arises when memcheck is working
> on the driver for the video graphics card.  This suggests a cause for
> non-determinism, and also a reason for different behavior on different Linux
> kernels.  At various times over the last few years, different parts of the
> driver have moved between kernel space and user space.  So one strategy to
> avoid SIGSEGV might be to choose a video driver that is as simple as
> possible; probably this is "VGA framebuffer" (which does exist, but I don't
> know its actual name.)

Thank you for taking your time to look into the issue.
I believe that SIGSEGV issue is now being reproduced in a non-deterministic manner on your PC.

> 
> Another source of non-determinism is the use of threads.  I usually see two
> threads.  One of them gets the SIGSEGV, then the other terminates "normally".

OK, I think I will add the fair-schedulling option to valgrind to see if it 
makes any difference.

> I ran this group of sessions on :
> Linux deb81p64 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64
> GNU/Linux
> 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce
> GT 710B] [10de:128b] (rev a1)
> libdrm-nouveau2:amd64  2.4.74-1
> Gnome 1:3.20+3 desktop environment
> Thunderbird is icedove-45.7.1 re-built from source in the usual Debian way.
> THUNDERBIRD_BIN=icedove-45.7.1/obj-thunderbird/dist/bin/thunderbird-bin
> 
> After building valgrind from current SVN, I modified vg-in-place so that the
> last command is
>      strace -i -e signal=SIGSEGV -e trace=file,memory
> "$vgbasedir/coregrind/valgrind" --run-libc-freeres=no --trace-flags=10000000
> --trace-notbelow=22081 --trace-syscalls=yes $THUNDERBIRD_BIN  >foo 2>&1
> 
> and then experimented with --trace-notbelow until I got close to just before
> the killing SIGSEGV.  [The number of basic blocks varied from run to run,
> which I attribute to non-determinism.]  The last two basic blocks are below.
> You can see the SIGSEGV in the middle of the last block.

I have never done the detailed debugging of valgrind/memcheck at this level.

Do you think that the SIGSEGV somehow occurs in the valgrind code
anyhow?
(I have a feeling that it may be due to a broken emulation of a very complex instruction that *may* involve certain context-level information.: the current emulation may not be 
quite well protected/complete  in terms of atomicity or something like that. Pure guess.
But otherwise, I cannot explain valgrind failure to report nice memory error(s) on its own.)

> 
> I saw the SIGSEGV on every run, usually in about 20 seconds of real time on
> Intel Core 2 Duo @ 3GHz.
> 
Actually, I once tried to allocate only a single core to see if the problem symptom changed, but no luck. Still the same SIGSEGV.

> =====
> 
> GuestBytes 1B5BF56B 22  48 8D 3D 1E 33 FF FF 48 89 3C D3 48 63 90 74 02 00
> 00 85 D2 78 0B  00A1B31B
> 
> VexExpansionRatio 22 363   165 :10
> 
> ==== SB 23786 (evchecks 13409019) [tid 1] 0x1b5bf581 UNKNOWN_FUNCTION
> /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9581
> 
> ------------------------ Front end ------------------------
> 
>         0x1B5BF581:  leaq -52824(%rip), %rcx
> 
>               ------ IMark(0x1B5BF581, 7, 0) ------
>               t0 = Add64(0x1B5BF588:I64,0xFFFFFFFFFFFF31A8:I64)
>               PUT(24) = t0
>               PUT(184) = 0x1B5BF588:I64
> 
>         0x1B5BF588:  movq %rcx,(%rbx,%rdx,8)
> 
>               ------ IMark(0x1B5BF588, 4, 0) ------
>               t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8))
>               STle(t1) = GET:I64(24)
>               PUT(184) = 0x1B5BF58C:I64
> 
>         0x1B5BF58C:  movslq 640(%rax),%rdx
> 
>               ------ IMark(0x1B5BF58C, 7, 0) ------
>               t2 = Add64(GET:I64(16),0x280:I64)
>               PUT(32) = 32Sto64(LDle:I32(t2))
>               PUT(184) = 0x1B5BF593:I64
> 
>         0x1B5BF593:  testl %edx,%edx
> 
>               ------ IMark(0x1B5BF593, 2, 0) ------
>               t5 = 64to32(GET:I64(32))
>               t4 = 64to32(GET:I64(32))
>               t3 = And32(t5,t4)
>               PUT(144) = 0x13:I64
>               PUT(152) = 32Uto64(t3)
>               PUT(160) = 0x0:I64
>               PUT(184) = 0x1B5BF595:I64
> 
>         0x1B5BF595:  js-8 0x1B5BF5A2
> 
>               ------ IMark(0x1B5BF595, 2, 0) ------
>               if
> (64to1(amd64g_calculate_condition[mcx=0x13]{0x3817bec0}(0x8:I64,GET:I64(144),
> GET:I64(152),GET:I64(160),GET:I64(168)):I64)) { PUT(184) = 0x1B5BF5A2:I64;
> exit-Boring }
>               PUT(184) = 0x1B5BF597:I64
>               PUT(184) = GET:I64(184); exit-Boring
> 
> GuestBytes 1B5BF581 22  48 8D 0D A8 31 FF FF 48 89 0C D3 48 63 90 80 02 00
> 00 85 D2 78 0B  03FEC91B
> 
> VexExpansionRatio 22 363   165 :10
> 
> ==== SB 23787 (evchecks 13409020) [tid 1] 0x1b5bf597 UNKNOWN_FUNCTION
> /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so+0xa9597
> 
> ------------------------ Front end ------------------------
> 
>         0x1B5BF597:  leaq -53198(%rip), %rsi
> 
>               ------ IMark(0x1B5BF597, 7, 0) ------
>               t0 = Add64(0x1B5BF59E:I64,0xFFFFFFFFFFFF3032:I64)
>               PUT(64) = t0
>               PUT(184) = 0x1B5BF59E:I64
> 
>         0x1B5BF59E:  movq %rsi,(%rbx,%rdx,8)
> 
>               ------ IMark(0x1B5BF59E, 4, 0) ------
>               t1 = Add64(GET:I64(40),Shl64(GET:I64(32),0x3:I8))
>               STle(t1) = GET:I64(64)
>               PUT(184) = 0x1B5BF5A2:I64
> 
>         0x1B5BF5A2:  movslq 636(%rax),%rdx
> 
>               ------ IMark(0x1B5BF5A2, 7, 0) ------
>               t2[????????????????] +++ killed by SIGSEGV +++
>  = Add64(GET:I64(16),0x27C:I64)
>               PUT(32) = 32Sto64(LDle:I32(t2))
>               PUT(184) = 0x1B5BF5A9:I64
> 
>         0x1B5BF5A9:  testl %edx,%edx
> 
>               ------ IMark(0x1B5BF5A9, 2, 0) ------
>               t5 = 64to32(Segmentation fault
> GET:I64(32))
>               t4 = 64to32(GET:I64(32))
>               t3 = And32(t5,t4)
>               PUT(144) = 0x13:I64
>               PUT(152) = 32Uto64(t3)
>               PUT(160) = 0x0:I64
>               PUT(184) = 0x1B5BF5AB:I64
> 
>         0x1B5BF5AB:  js-8 0x1B5BF5B8
> 
> =====
> 
> $ gdb /usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so
> 
> (gdb) x/12i 0xa9597
>    0xa9597:	lea    -0xcfce(%rip),%rsi        # 0x9c5d0
>    0xa959e:	mov    %rsi,(%rbx,%rdx,8)
>    0xa95a2:	movslq 0x27c(%rax),%rdx
>    0xa95a9:	test   %edx,%edx
>    0xa95ab:	js     0xa95b8
>    0xa95ad:	lea    -0xd144(%rip),%rdi        # 0x9c470
>    0xa95b4:	mov    %rdi,(%rbx,%rdx,8)
>    0xa95b8:	movslq 0x284(%rax),%rdx
>    0xa95bf:	test   %edx,%edx
>    0xa95c1:	js     0xa95ce
>    0xa95c3:	lea    -0xd2ba(%rip),%rcx        # 0x9c310
>    0xa95ca:	mov    %rcx,(%rbx,%rdx,8)
> 
> (gdb) x/12i 0xa9597-0x20
>    0xa9577:	movslq 0x274(%rax),%edx
>    0xa957d:	test   %edx,%edx
>    0xa957f:	js     0xa958c
>    0xa9581:	lea    -0xce58(%rip),%rcx        # 0x9c730
>    0xa9588:	mov    %rcx,(%rbx,%rdx,8)
>    0xa958c:	movslq 0x280(%rax),%rdx
>    0xa9593:	test   %edx,%edx
>    0xa9595:	js     0xa95a2
>    0xa9597:	lea    -0xcfce(%rip),%rsi        # 0x9c5d0
>    0xa959e:	mov    %rsi,(%rbx,%rdx,8)
>    0xa95a2:	movslq 0x27c(%rax),%rdx
>    0xa95a9:	test   %edx,%edx
> 
> =====

I am not familiar with this debugging info before, but basically
valgrind expands the instructions by including some memory checks
and execute the extended instruction stream (which it calls basic block)?

I will tinker with the options to see if I can gain any insight on my end.

TIA

Comment 11 zephyrus00jp 2019-10-31 23:06:05 UTC

I thought I posted a followup early this October, but funny, it did not seem to have made it to the bugzilla.

OK, here it goes.

I still cannot run Thunderbird mail client under valgrind as normal user.
I get very mysterious segmentation error of valgrind.

*HOWEVER*, to my surprise, when I ran valgrind as a *LOGGED IN SUPERUSER* (|su| does not work. I have to log in as superuser either from login screen or sshing to the local PC box as root.), I can run thunderbird under valgrind. successfully (!)

I wonder why.

So now, I have been testing thunderbird under valgrind by "ssh"ing to my local PC box after enabling superuser login of ssh daemon (which is usually off for security reasons).

Anyone has any idea why this is so?

From the look of log trace, the mysterious segmentation error (as normal user) that kills valgrind occurs at the time the stack seems to be extended.
(I looked at signal handling and it looks that way.)

At that point, valgrind as normal user is killed, but
as properly logged in superuser, the stack gets enlarged properly and
the thunderbird proceeds fine under valgrind.

I mention that Debian GNU/Linux kernel 3.19.5 had no problem running thunderbird under valgrind.
However I have no idea how that particular kernel version with the particular kernel config had any looser restriction regarding user process trying to extend stack when it overflows.

I thought stack extension is automatic under Debian as long as user sets the stack maximum size reasonably large.
I set ulimit to the maximum (no limit).
So my processshould be able to extend the stack as normal user under OS management, was what I thought.

Does anyone have idea how to enable the reliable stack growth as ordinary user under Debian?
Or maybe this is valgrind-specific stack exntesion issue under certain kernel config (including security mechanism?).

I have been testing thunderbird under valgrind as superuser for a couple of weeks after long time, but
use of ssh and the permission of files created as superuser during the execution of valgrind+thunderbird have made it rather awkward in the debug cycle and I really wish to use valgrind + thunderbird as normal user.

Any hints/tips will be appreciated.

valgrind is very useful. I already have found two clear-cut uninitialized memory access bugs that have crept in the code over the years.
I suspect more to come ...

TIA

Comment 12 John Reiser 2019-10-31 23:11:56 UTC

Please run "uname -a" then copy+paste the entire output, and label with the circumstances (3 cases: normal user, logged-in root, su from normal user to root).
It's not obvious which kernels are being used today.

Comment 13 zephyrus00jp 2019-10-31 23:52:02 UTC

Created attachment 123636 [details]
uname -a output, and other requested logs.

Thank you for the comment.

I stand corrected. I think valgrind works in the context of |su|
superuser.

It is the thunderbird binary that refuses to run in |su| context.
(I forgot about it and thought it was valgrind which refused to
operate properly in |su| context.)

Requested info is attached.

*1: uname -a output 
*2. As a normal user
*3: as super user |su|.
*4  As properly logged in superuser (via ssh)