Bug 405295 - valgrind 3.14.0 dies due to mysterious DWARF information? (output from rust used by Mozilla TB.)
Summary: valgrind 3.14.0 dies due to mysterious DWARF information? (output from rust u...
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Unclassified
Component: memcheck (show other bugs)
Version: 3.14.0
Platform: Other Linux
: NOR critical (vote)
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-10 09:14 UTC by zephyrus00jp
Modified: 2021-10-29 17:50 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Log from the failed valgrind run of mozilla thunderbird (segmentation error somewhere) (25.14 KB, text/plain)
2019-03-25 17:37 UTC, zephyrus00jp
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zephyrus00jp 2019-03-10 09:14:19 UTC
SUMMARY
I tried to run the latest mozilla thundebird under valgrind.
These days mozilla code uses the binary from rust compiler.
valgrind 3.14.0 under Debian GNU/Linux 64-bit version barfed on rust
library file.

STEPS TO REPRODUCE
1. Run thunderbird under valgrind as below.
YMMV. I show the command that was used to invoke TB under valgrind using my local directory layout.

2. valgrind dies with segfault after printing the following message:

parse DIE(readdwarf3.c:3123): confused by:
 <0><25e98>: Abbrev Number: 1 (DW_TAG_compile_unit)
     DW_AT_producer    : (indirect string, offset: 0x16635): clang LLVM (rustc version 1.33.0 (2aa4c46cf 2019-02-28))
     DW_AT_language    : 28
     DW_AT_name        : (indirect string, offset: 0x1666e): toolkit/library/rust/shared/lib.rs
     DW_AT_stmt_list   : 61687424
     DW_AT_comp_dir    : (indirect string, offset: 0x16691): /NREF-COMM-CENTRAL/mozilla
     DW_AT_???         : 1
     DW_AT_low_pc      : 0x0
     DW_AT_ranges      : 48624688
parse_type_DIE:
--3863-- WARNING: Serious error when reading debug info
--3863-- When reading debug info from /KERNEL-SRC/moz-obj-dir/objdir-tb3/toolkit/library/libxul.so:
--3863-- confused by the above DIE
Segmentation fault

3. 

OBSERVED RESULT

segmentation error.
EXPECTED RESULT

Normal operation.

SOFTWARE/OS VERSIONS
Debian GNU/Linux amd64 version.
Linux ip030 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux

Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION

Full command output on the console where valgrind is invoked.
Using profile dir: /tmp/mozmillprofile
run-valgrind (masquerading as thunderbird binary)
final command line is:
valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --track-origins=yes --child-silent-after-fork=yes --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --read-var-info=yes --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=$HOME/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes  /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -jsbridge 24242 -foreground -profile /tmp/mozmillprofile

==3863== Memcheck, a memory error detector
==3863== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3863== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==3863== Command: /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -jsbridge 24242 -foreground -profile /tmp/mozmillprofile
==3863==

parse DIE(readdwarf3.c:3123): confused by:
 <0><25e98>: Abbrev Number: 1 (DW_TAG_compile_unit)
     DW_AT_producer    : (indirect string, offset: 0x16635): clang LLVM (rustc version 1.33.0 (2aa4c46cf 2019-02-28))
     DW_AT_language    : 28
     DW_AT_name        : (indirect string, offset: 0x1666e): toolkit/library/rust/shared/lib.rs
     DW_AT_stmt_list   : 61687424
     DW_AT_comp_dir    : (indirect string, offset: 0x16691): /NREF-COMM-CENTRAL/mozilla
     DW_AT_???         : 1
     DW_AT_low_pc      : 0x0
     DW_AT_ranges      : 48624688
parse_type_DIE:
--3863-- WARNING: Serious error when reading debug info
--3863-- When reading debug info from /KERNEL-SRC/moz-obj-dir/objdir-tb3/toolkit/library/libxul.so:
--3863-- confused by the above DIE
Segmentation fault
^CTraceback (most recent call last):
  File "runtestlist.py", line 142, in <module>
    line = proc.stdout.readline()
KeyboardInterrupt
make: *** [/NREF-COMM-CENTRAL/mozilla/comm/mail/testsuite-targets.mk:31: mozmill] Interrupt

ishikawa@ip030:/tmp$ uname -a
Linux ip030 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux
ishikawa@ip030:/tmp$ 

Thank you in advance for your attention.

PS: I know it could be due to rust producing incorrect DWARF output, but I have no idea. I wanted to check with you first.
Comment 1 Julian Seward 2019-03-10 09:29:49 UTC
Hi zephyrus00jp.  Thanks for working on TB.  I use it all the time.

Try removing --read-var-info=yes from the flags.  I suspect that will
help.  There's not much loss since Memcheck hardly uses that information
anyway.

Also, for recent Gecko builds, you might want to consider using the
V trunk instead, since it has a couple of bug fixes, that remove
false positives, relative to 3.14.0.  You can get it with:
git clone git://sourceware.org/git/valgrind.git
Comment 2 zephyrus00jp 2019-03-10 09:50:28 UTC
(In reply to Julian Seward from comment #1)
> Hi zephyrus00jp.  Thanks for working on TB.  I use it all the time.
> 
> Try removing --read-var-info=yes from the flags.  I suspect that will
> help.  There's not much loss since Memcheck hardly uses that information
> anyway.
> 
> Also, for recent Gecko builds, you might want to consider using the
> V trunk instead, since it has a couple of bug fixes, that remove
> false positives, relative to 3.14.0.  You can get it with:
> git clone git://sourceware.org/git/valgrind.git

I removed "--read-var-info=yes" and the processing went further down the path.
However, then I get a mysterious segmentation fault.
Something like this happened a few years ago with stock debian kernel and I had to recreate various kernel versions to see which one worked.
Anyway, I am not even sure what is segfaulting at this moment.
I will investigate.

I will also try the latest git version, too.

Thank you again for your great package!
Comment 3 zephyrus00jp 2019-03-25 17:37:47 UTC
Created attachment 119027 [details]
Log from the failed valgrind run of mozilla thunderbird (segmentation error somewhere)
Comment 4 zephyrus00jp 2019-03-25 17:38:16 UTC
Dear Julian,

I am still trying analyze the segmentation failure which I observe
when I try to run thunderbird under valgrind.

System is Debian GNU/Linux.
Linux ip030 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux

I am now using 3.15.0 GIT version and still get segmentation error somewhere.

In order to debug a little further, I added
--vgdb=yes
--vgdb-error=0

valgrind optons.

I could talk to the vgdb from remote gdb session.

Attached is the verbose log that leads to segmentation error from the run, and
a short log from remote gdb.

As far as the remote gdb session goes, it sees the remote end disappear after segmentation error.


A couple of points I noticed from the valgrind log.

1. There are a few cases of extending stack base in the verbose log.
   Can it be that I am running out of stack space???

  e.g.:
  --15408-- sync signal handler: signal=11, si_code=1, EIP=0x40088b1, eip=0x10049ff393, from kernel
  --15408-- SIGSEGV: si_code=1 faultaddr=0x1ffeffddf4 tid=1 ESP=0x1ffeffddf0 seg=0x1ffe672000-0x1ffeffdfff
  --15408--        -> extended stack base to 0x1ffeffd000


  --15408-- sync signal handler: signal=11, si_code=1, EIP=0x4009722, eip=0x1004a3ee8c, from kernel
  --15408-- SIGSEGV: si_code=1 faultaddr=0x1ffeffcff8 tid=1 ESP=0x1ffeffcff8 seg=0x1ffe672000-0x1ffeffcfff
  --15408--        -> extended stack base to 0x1ffeffc000


  --15408-- sync signal handler: signal=11, si_code=1, EIP=0x400a334, eip=0x1004a3eb28, from kernel
  --15408-- SIGSEGV: si_code=1 faultaddr=0x1ffeffbff8 tid=1 ESP=0x1ffeffbff8 seg=0x1ffe672000-0x1ffeffbfff
  --15408--        -> extended stack base to 0x1ffeffb000



2. There is a warning about conflicting redirection(?)

  ==15408== WARNING: new redirection conflicts with existing -- ignoring it
  --15408--     old: 0x0401e280 (strlen              ) R-> (0000.0) 0x580c9e32 vgPlain_amd64_linux_REDIR_FOR_strlen
  --15408--     new: 0x0401e280 (strlen              ) R-> (2007.0) 0x04836170 strlen


3. The last signal dumped is 13. This is SIGPIPE suggesting that a pipe was broken: I am not sure what caused the signal (meaning the other end died?, but then which end of what pipe? ) |make mozmill| test suite definitely uses interprocess communication and so the message from the python interpreter or something that is used to create mozmill test framework.

Anyway attached is the log.

Any pointer will be appreciated.  

However, I have a feeling this could
be the result of strange kernel configuration which fails valgrind for
mysterious reasons. In that case, there is not much we can do.
The last time, I checked the strange failure under linux 3.x series kernel
a kernel revision failed while another one ran valgrind fine.

So I have tried to create a few different kernel versions of 4.x
series, but the recent kernel bloat has make it difficult for me to
install different versions of the kernel, mostly due to the size of
dynamically loaded driver modules. The subdirectories under
/usr/lib/modules for the dynamically loaded driver modules for
different revisions of the kernel have become too large and basically
my root partition filled up.  Either I have to manually cull unused
modules from the default configuration files.  I checked what Debian
does and found it distributes a kernel module with many built-in
modules for driver. So that we use /boot pretty much (and /boot is a separate partition on my PC), but
there is not much space pressure under /usr/lib/modules (which is usually part of root (/) partition.)

Oh well.

Best Regards,
Chiaki
Comment 5 Julian Seward 2019-04-01 12:04:10 UTC
(In reply to zephyrus00jp from comment #3)
> Created attachment 119027 [details]
> Log from the failed valgrind run of mozilla thunderbird (segmentation error
> somewhere)

> --15408-- REDIR: 0x4d12640 (libc.so.6:__rawmemchr_avx2) redirected to 0x483a7b0 (rawmemchr)
> --15408-- sys_sigaction: sigNo 13, new 0x1ffeffcd50, old 0x0, new flags 0x4000000
> Segmentation fault

Can you get a stack trace at the segfault, by attaching GDB to the V process before it faults?
Comment 6 zephyrus00jp 2019-04-02 05:37:28 UTC
I will try to obtain the stack trace.

It is just that I am not even sure what process exactly is segfaulting, but given that valgrind ought to be in control of all the subprocesses, it is valgrind.

I will try my best. The lat time I tried a few years ago, I could not get meaningful stack dump.

Keep fingers crossed.

TIA
Comment 7 zephyrus00jp 2019-04-09 08:01:15 UTC
This is what I found.

(A side note: under 
4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux,
I could run very old 32-bit TB 22.0a1 (2013-03-20)
under valgrind-3.15.0.RC1.)

However, under the same OS, I could not run 2.9.1 (64-bit) (the official release, not the one I built locally).

The segfault seems to occur in the dynamically generated code. (or in a dynamically shared libyrary? I am not sure).

gdb valgrind

(gdb) run --smc-check=all-non-file --fair-sched=yes --redzone-size=128 --vex-iropt-register-updates=allregs-at-mem-access --trace-children=yes ~ishikawa/thunderbird/thunderbird
Starting program: /usr/local/bin/valgrind --smc-check=all-non-file --fair-sched=yes --redzone-size=128 --vex-iropt-register-updates=allregs-at-mem-access --trace-children=yes ~ishikawa/thunderbird/thunderbird
process 30378 is executing new program: /usr/local/lib/valgrind/memcheck-amd64-linux
==30378== Memcheck, a memory error detector
==30378== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30378== Using Valgrind-3.15.0.RC1 and LibVEX; rerun with -h for copyright info
==30378== Command: /home/ishikawa/thunderbird/thunderbird
==30378==

Program received signal SIGSEGV, Segmentation fault.

(gdb) where
#0  0x00000010039fa56c in ?? ()
#1  0x0000001002eadf30 in ?? ()
#2  0x0000001002008380 in ?? ()
#3  0x0000001002eadf18 in ?? ()
#4  0x0000001002eadf30 in ?? ()
#5  0x0000001002eadf40 in ?? ()
#6  0x0000000000000000 in ?? ()

(gdb) info files
Symbols from "/usr/local/lib/valgrind/memcheck-amd64-linux".
Native process:
    Using the running image of child process 30378.
    While running this, GDB does not access memory from...
Local exec file:
    `/usr/local/lib/valgrind/memcheck-amd64-linux', file type elf64-x86-64.
    Entry point: 0x580bac60
    0x0000000058000158 - 0x000000005800017c is .note.gnu.build-id
    0x0000000058000180 - 0x00000000581e9d9a is .text
    0x00000000581e9da0 - 0x000000005825831a is .rodata
    0x0000000058258320 - 0x0000000058286048 is .eh_frame
    0x0000000058287860 - 0x0000000058289f60 is .data.rel.ro.local
    0x0000000058289f60 - 0x0000000058289f90 is .data.rel.ro
    0x0000000058289f90 - 0x0000000058289fe8 is .got
    0x0000000058289fe8 - 0x000000005828a000 is .got.plt
    0x000000005828a000 - 0x000000005828c420 is .data
    0x000000005828c440 - 0x0000000059c8fab9 is .bss

So obviously, the PC location in the stacktrace is not within the code of valgrind.


(gdb) info reg
rax            0x1002da7bfe        68767349758
rbx            0x0                 0
rcx            0xffffaaaa          4294945450
rdx            0x1002da4000        68767334400
rsi            0x59298d60          1495895392
rdi            0x1ffeffeff8        137422172152
rbp            0x1002008390        0x1002008390
rsp            0x1002eade00        0x1002eade00
r8             0x180d2             98514
r9             0x10058a5710        68812429072
r10            0x4029190           67277200
r11            0x58010f90          1476464528
r12            0x1002eadf40        68768423744
r13            0x1002eadf30        68768423728
r14            0x1002eadf18        68768423704
r15            0x1ffeffeff8        137422172152
rip            0x10039fa56c        0x10039fa56c
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb)

It seems to me that the crash occurs in code (dynamically generated in heap?).

It is possible that the heap or rather the stack frame got mangled by the time this segmentation error occurs.

(gdb)  disassemble 0x10039fa56c,0x10039fa580
Dump of assembler code from 0x10039fa56c to 0x10039fa580:
=> 0x00000010039fa56c:	mov    %r10,(%r15)
   0x00000010039fa56f:	movq   $0x40088a4,0xb8(%rbp)
   0x00000010039fa57a:	sub    $0x8,%r15
   0x00000010039fa57e:	movq   $0x0,0x3d0(%rbp)
End of assembler dump.
(gdb) bt

the value of r15 seems to be near the end of the sbrk'ed address (from the previous runs where I checked the
system calls by using strace previously), so we may be
accessing an unmapped memory area from the dynamically generated code?

The backtrace looks a bit strange:
Only the most up to date PC seems to contain a valid instruction. (Well, it is possible that this
is in a signal handler and thus the backtrace may not be quite correct.)

(gdb) bt
#0  0x00000010039fa56c in ?? ()
#1  0x0000001002eadf30 in ?? ()
#2  0x0000001002008380 in ?? ()
#3  0x0000001002eadf18 in ?? ()
#4  0x0000001002eadf30 in ?? ()
#5  0x0000001002eadf40 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) disassemble 0x1002eadf30,0x1002eadf40
Dump of assembler code from 0x1002eadf30 to 0x1002eadf40:
   0x0000001002eadf30:    add    %al,(%rax)
   0x0000001002eadf32:    add    %al,(%rax)
   0x0000001002eadf34:    add    %al,(%rax)
   0x0000001002eadf36:    add    %al,(%rax)
   0x0000001002eadf38:    add    %al,(%rax)
   0x0000001002eadf3a:    add    %al,(%rax)
   0x0000001002eadf3c:    add    %al,(%rax)
   0x0000001002eadf3e:    add    %al,(%rax)
End of assembler dump.
(gdb) disassemble 0x1002008380,0x10020083a0
Dump of assembler code from 0x1002008380 to 0x10020083a0:
   0x0000001002008380:    add    %eax,(%rax)
   0x0000001002008382:    add    %al,(%rax)
   0x0000001002008384:    add    (%rax),%al
   0x0000001002008386:    add    %al,(%rax)
   0x0000001002008388:    add    %al,(%rax)
   0x000000100200838a:    add    %al,(%rax)
   0x000000100200838c:    add    %al,(%rax)
   0x000000100200838e:    add    %al,(%rax)
   0x0000001002008390:    sbb    $0x5809a2,%eax
   0x0000001002008395:    add    %al,(%rax)
   0x0000001002008397:    add    %dl,%cl
   0x0000001002008399:    addb   $0x0,(%rcx)
   0x000000100200839c:    add    %al,(%rax)
   0x000000100200839e:    add    %al,(%rax)
End of assembler dump.
(gdb) disassemble 0x1002eadf18,0x1002eadf28
Dump of assembler code from 0x1002eadf18 to 0x1002eadf28:
   0x0000001002eadf18:    rolb   %cl,0x1(%rax)
   0x0000001002eadf1e:    add    %al,(%rax)
   0x0000001002eadf20:    add    %al,(%rax)
   0x0000001002eadf22:    add    %al,(%rax)
   0x0000001002eadf24:    add    %al,(%rax)
   0x0000001002eadf26:    add    %al,(%rax)
End of assembler dump.
(gdb) quit
A debugging session is active.

    Inferior 1 [process 30378] will be killed.

Quit anyway? (y or n) y
mailtest@debian-vbox-ci:~$

===

I noticed one thing. Sorry I did not pass proper valgrind options in the run below, but it did not seem to change the result.

I have found out, I can continue past the first two SIGSEGV. This suggests that the first couple of SIGSEGVs are probably handled properly by signal handler of valgrind to allocate more memory by means of mmap function, etc.

However, after the third error, I seem to get stuck in the same position.

This is the interaction from that point on:

mailtest@debian-vbox-ci:~$ gdb valgrind
GNU gdb (Debian 8.2.1-2) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from valgrind...done.
(gdb) run ~ishikawa/thunderbird/thunderbird
Starting program: /usr/local/bin/valgrind ~ishikawa/thunderbird/thunderbird
process 30468 is executing new program: /usr/local/lib/valgrind/memcheck-amd64-linux
==30468== Memcheck, a memory error detector
==30468== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==30468== Using Valgrind-3.15.0.RC1 and LibVEX; rerun with -h for copyright info
==30468== Command: /home/ishikawa/thunderbird/thunderbird
==30468==

Program received signal SIGSEGV, Segmentation fault.
0x00000010039f8064 in ?? ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000001003a6014d in ?? ()
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0000001003a6014d in ?? ()
(gdb) where
#0  0x0000001003a6014d in ?? ()
#1  0x0000001002eadf30 in ?? ()
#2  0x0000001002008320 in ?? ()
#3  0x0000001002eadf18 in ?? ()
#4  0x0000001002eadf30 in ?? ()
#5  0x0000001002eadf40 in ?? ()
#6  0x00000000592babc0 in ?? ()
#7  0x0000001002eb1000 in ?? ()
#8  0x000000000001124a in ?? ()
#9  0x0000000000001ef5 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) disasm 0x1003a6014d,0x1003a60160
Undefined command: "disasm".  Try "help".
(gdb) disass 0x1003a6014d,0x1003a60160
Dump of assembler code from 0x1003a6014d to 0x1003a60160:
=> 0x0000001003a6014d:    mov    %r10,(%rbx)
   0x0000001003a60150:    movq   $0x40153d3,0xb8(%rbp)
   0x0000001003a6015b:    lea    0x8(%rbx),%r14
   0x0000001003a6015f:    mov    %r14,%rdi
End of assembler dump.
(gdb) info reg
rax            0x1002da73f0        68767347696
rbx            0x1ffeffcfc0        137422163904
rcx            0xffffaaaa          4294945450
rdx            0x1002da4000        68767334400
rsi            0x59298d60          1495895392
rdi            0x1ffeffcfc0        137422163904
rbp            0x1002008330        0x1002008330
rsp            0x1002eade00        0x1002eade00
r8             0x1002da4000        68767334400
r9             0x1a57              6743
r10            0x0                 0
r11            0x58010f90          1476464528
r12            0x1002eadf40        68768423744
r13            0x1002eadf30        68768423728
r14            0x0                 0
r15            0x1ffeffd340        137422164800
rip            0x1003a6014d        0x1003a6014d
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb)


Any advice for further debugging is appreciated.

TIA for your attention.

PS: OK, maybe I am doing it all incorrectly and should invoke the vgdb feature of VALGRIND. But even then I obtained PC address not within the binary in question... But I could be wrong.
Comment 8 zephyrus00jp 2019-04-09 10:10:04 UTC
(In reply to zephyrus00jp from comment #7)
> This is what I found.
> 
> (A side note: under 
> 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux,
> I could run very old 32-bit TB 22.0a1 (2013-03-20)
> under valgrind-3.15.0.RC1.)
> 
> However, under the same OS, I could not run 2.9.1 (64-bit) (the official
> release, not the one I built locally).
>

2.9.1 is a typo of 52.9.1 (the current release I use on my office PC).
Comment 9 zephyrus00jp 2019-09-04 22:46:17 UTC
A very mysterious things happened.

I could not run valgrind against locally created Thunderbird mail client binary for quite some time under Debian.

A few years ago, this happened on stock GNU Debian/Linux kernel 3.1x.y series, and I found that older kernel 3.9 or something like that helped.
Yes, it was kernel dependent. Ugh...
I kept that kernel around for quite some time, but other developement tools and 4K display needed a newer kernel and so I ditched that kernel.

I could not run valgrind+thunderbird combination for a couple of years now.

Today, I tried to see if increasing the user space stack are might help.
ulimit -a showed my stack size was about 13MB big, but I increased it to 16GiB by
ulimit -s 16000.
Still no luck. valgrind died due to mysterious segmentation error that is not
even reported back to the gdb remote session when I used vgdb feature of valgrind. (This mysterious nature of the crash reported as "Segmentation Error" when I tried to run thunderbird under valgrind really puzzled me to no end since it started to appear a few years ago.)

Anyway, I tried to tweak ulimit for stack size further. Then somehow the command refused. It seemed that I could not get to set it larger than 16000.
I could reset it to something smaller like 15000.

So I tried to invoke "su" and try running thunderbird+valgrind. Then thunderbird binary refused to run stating something to the effect of "TB does not run as superuser in normal user session": presumably due to security reasons.

SO I QUIT my ordinary login, and then LOGGED IN AS ROOT USER/

S U R P R I S E ( ! ! ! )

thunderbird runs under valgrind when I ran it as logged in superuser (root) !?

So it seems to me that 
- "Segmentation Error" is caused by some kind of security mechansim (superuser vs ordinary user restriction)?

Anyway, I am about to run the whole |make mozmill| test suite of thunderbird under valgrind again. Yes, it is a bit risky to run this test under root, but
I have saved patch and others into a different linux image, and so I think I can recover even if there is some serious bug in thunderbird...

The security mechanism Debian my have are
- APP ARMOR
- SELinux
- etc.

But when I looked under /var/log to see if there are any relevant messages about the security mechanism kicked into abort valgrind+thunderbird, I have found nothing so far.

Just a short message that thunderbird under valgrind runs as logged in superuser, but not as normal user under the latest Debian GNU/Linux kernel.
This is the output from uname on the PC.
Linux ip030 4.19.0-5-amd64 #1 SMP Debian 4.19.37-6 (2019-07-18) x86_64 GNU/Linux

(Actually it runs inside VirtualBox.)

I already see a few disturbing memory error messages from the first couple of minutes run. I hope I can find a slew of unreported errors.

So this is not the most helpful report, but at least now I know how to run thunderbird under valgrind...

TIA
Comment 10 zephyrus00jp 2019-09-04 23:05:27 UTC
Now I realize that where the ordinary user receives mysterious segmentation error message (not even caught by valgrind and the remote gdb session to vgdb of valgrind is terminated) is the following position marked as (*) in the snippet of log from the valgrind+thunderbird log that appears near the beginning.
So maybe extending stack as an ordinary user is not allowed under Debian GNU/Linux stock kernel (???)

Log snippet
  
 ... lots of messages regarding reading dynamic shared libraries and their symbols.
--30306-- Reading syms from /usr/lib/x86_64-linux-gnu/libICE.so.6.3.0
--30306--    object doesn't have a symbol table
--30306-- REDIR: 0x4d582d0 (libc.so.6:__rawmemchr_avx2) redirected to 0x483a8d0 
(rawmemchr)
--30306-- sys_sigaction: sigNo 13, new 0x1ffeffd6b0, old 0x0, new flags 0x400000
0
**** I don't see the following line and subsequent lines as an ordinary user.  I see "Segmentation Error" here. 
--30306-- sync signal handler: signal=11, si_code=1, EIP=0x4c5280a, eip=0x1004db
f881, from kernel
--30306-- SIGSEGV: si_code=1 faultaddr=0x1ffeffad40 tid=1 ESP=0x1ffeffac40 seg=0
x1ffe801000-0x1ffeffbfff
--30306--        -> extended stack base to 0x1ffeffa000
--30306-- REDIR: 0x4d5bdb0 (libc.so.6:__strchrnul_avx2) redirected to 0x483a8a0 
(strchrnul)
[30306, Unnamed thread 4e45880] WARNING: XPCOM objects created/destroyed from st
atic ctor/dtor: file /NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp, li
ne 198
     ... normal valgrind messages ....

TIA
Comment 11 zephyrus00jp 2019-09-05 02:06:18 UTC
My previous comment about valgrind + thunderbird running fine as superuser under Debian GNU/Linux may be a bit premature:
The test certainly ran some tests, but after printing the log below, valgrind seems to get stuck. There was no CPU activity...

-DOCSHELL 0x42ea2c20 == 14 [pid = 4068] [id = {e07f68d5-9c05-4905-a767-511217216710}] [url = chrome://messenger/content/AccountManager.xul]
--4068-- memcheck GC: 81117 nodes, 30669 survivors (37.8%)
[4068, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF: file /NREF-COMM-CENTRAL/mozilla/comm/mailnews/base/util/nsMsgDBFolder.cpp, line 2952
NS_NewBufferedOutputStream: outputStream (= std::move(aOutputputStream)) =0x5ffb6c98
--4068-- memcheck GC: 81117 nodes, 31071 survivors (38.3%)


The pseudo desktop (Xephyr) has a greyed screen all over. So I am not sure whether it is the bug of valgrind+thunderbird, or some issues related to timing of Xephyr server (X11 protocol packet exchange and there could be some implicit assumption about the timing satisfied without explicit lock, etc. on the client side, i.e. TB.)
The valgrind+TB got stuck while executing the test:
TEST-START | /NREF-COMM-CENTRAL/mozilla/comm/mail/test/mozmill/account/test-archive-options.js | test_open_archive_options
Comment 12 zephyrus00jp 2021-03-16 09:44:34 UTC
Great news.

After so many months actually a few years of mysterious valgrind crash under Debian GNU/Linux 64-bit when I tried to run thunderbird (TB) client, today, I tested the operation with this combination.

ishikawa@ip030:/home/ishikawa$ valgrind --version
valgrind-3.16.1
ishikawa@ip030:/home/ishikawa$ uname -a
Linux ip030 5.10.0-4-amd64 #1 SMP Debian 5.10.19-1 (2021-03-02) x86_64 GNU/Linux
ishikawa@ip030:/home/ishikawa$ 


A surprize. I could run TB under valgrind without the mysterious crashes so far.
But one thing is for sure, where I would have gotten the
segmentation error after signal catch presumably for extending stack, now I see a strange pipe error. Maybe previously valgrind died as ordinary user with the error related to this message, but I am not sure.

Anyway, this is a real progress and I will report any more issue with the use of valgrind and TB.

But I have to tell you that the memory pressure with the new valgrind and TB combination is much more pronounced.
I allocate 16GB of memory to Debian GNU/Linux image in a virtualbox that runs under windows 10 (the host has 32GB).
But now it is causing much paging/swapping and sometimes, the whole virtualbox does not respond for a while. 

Still, being able to run valgrind + TB is great improvement.

I have a feeling valgrind 3.16 had a few internal improvements and
linux 5.10.1 released by Debian must have a kernel feature that is valgrind friendly when valgrind needs to deal with a huge binary like TB.

Thank you agian for releasing the great program to wide community of users.

Happy Hacking!
Comment 13 zephyrus00jp 2021-03-16 09:57:37 UTC
Clarification:

This is the mysterious error I see today (valgrind would have crashed before here.)

...
14:38.51 GECKO(31703) --31707-- SIGSEGV: si_code=1 faultaddr=0x1ffeff6f90 tid=1 ESP=0x1ffeff6f90 seg=0x1ffe001000-0x1ffeff6fff
14:38.51 GECKO(31703) --31707--        -> extended stack base to 0x1ffeff6000
14:38.51 GECKO(31703) --31707-- sync signal handler: signal=11, si_code=1, EIP=0x483873e, eip=0x100cd76014, from kernel
14:38.51 GECKO(31703) --31707-- SIGSEGV: si_code=1 faultaddr=0x1ffeff5ff0 tid=1 ESP=0x1ffeff5fe0 seg=0x1ffe001000-0x1ffeff5fff
14:38.51 GECKO(31703) --31707--        -> extended stack base to 0x1ffeff5000
14:41.15 GECKO(31703) --31707-- WARNING: Serious error when reading debug info
14:41.15 GECKO(31703) --31707-- When reading debug info from /memfd:mozilla-ipc (deleted):
14:41.15 GECKO(31703) --31707-- failed to stat64/stat this file
14:41.19 GECKO(31703) --31707-- WARNING: Serious error when reading debug info
14:41.19 GECKO(31703) --31707-- When reading debug info from /memfd:mozilla-ipc (deleted):
14:41.19 GECKO(31703) --31707-- failed to stat64/stat this file
14:41.62 GECKO(31703) ==31707== Warning: set address range perms: large range [0x132d88c000, 0x172d88d000) (noaccess)
17:23.24 GECKO(31703) --31707-- REDIR: 0x4930490 (libstdc++.so.6:operator delete(void*)) redirected to 0x4839e40 (operator delete(void*))

Before, mysteriously, valgrind + TB worked in a superuser account.
However, as I had mentioned earlier, valgrind crashed by experincing a mysterious segfault if I tried to run valgrind with TB as an ordinary user as in comment 5 until several months ago. I gave up using valgrind for a while and tried to use ASAN as often as possible, but ASAN can't detect uninitialized value use.

After so many months of not using valgrind, I tried valgrind as an ordinary user today, and it is running without the crash so far.
Not needing to login as superuser and mess up file permissions all over is not quite nice. 
So I am very happy that valgrind + TB combination can run in ordinary user account 
under Debian GNU/Linux again.
But like I said, I have no idea what was the problem.
That "Serious error" condition I see today 
> 14:41.15 GECKO(31703) --31707-- WARNING: Serious error when reading debug info
> 14:41.15 GECKO(31703) --31707-- When reading debug info from /memfd:mozilla-ipc (deleted):

may have caused the older version of valgrind to crash.
Of course, Debian's kernel may have changed for the better.

After all I could run small binary under valgrind as ordinary user without an issue under Debian GNU/linux all the while :-(