Bug 501713 - Thread behavior is not expected when I run thunderbird client under valgrind.
Summary: Thread behavior is not expected when I run thunderbird client under valgrind.
Status: RESOLVED NOT A BUG
Alias: None
Product: valgrind
Classification: Developer tools
Component: memcheck (other bugs)
Version First Reported In: 3.25 GIT
Platform: PCLinuxOS Linux
: NOR normal
Target Milestone: ---
Assignee: Julian Seward
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-18 22:51 UTC by zephyrus00jp
Modified: 2025-10-24 14:54 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description zephyrus00jp 2025-03-18 22:51:41 UTC
SUMMARY

I ran thunderbird mail client under valgrind.
There  is a test suite called xpcshell-tst, and during the execution of test suite running Comm-Centeral Thuderbrid,
I got a "Conditional jump or move depends on uninitialised value(s)" in one of the file.

STEPS TO REPRODUCE
1. The original bug is filed  in mozilla's bugzilla. https://bugzilla.mozilla.org/show_bug.cgi?id=1952749  

2.  I ran comm-central thunderbird (compiled locally on my PC) under valgrind with the following
     parameters.:

Please note that I am NOT using mozilla's versatile |mach| command
that can be used to invoke valgrind in a simplistic manner. It has a
shell quoting problem and cannot pass the complext valgrind options
which I use correctly.

So, I opted to rename the original xpcshell binary to xpcshell-bin,
and installs a binary that calls xpcshell-bin under valgrind with
appropriate options.
With this setup, I ran thunderbird's xpcshell-test test suite, and
the suite is executed by thunderbird running under valgrind.

Here is the options that  I pass to valgrind.:

run-valgrind-xpcshell invoked ...
sizeof(prepargs)=136
argc=26
finalargs[	 0] = valgrind
finalargs[	 1] = --track-origins=yes
finalargs[	 2] = --trace-children=yes
finalargs[	 3] = --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python3,*/bin/python2,*/bin/python2.7,*/bin/lsb_release,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3,/bin/sh
finalargs[	 4] = --vex-iropt-register-updates=allregs-at-mem-access
finalargs[	 5] = --smc-check=all-non-file
finalargs[	 6] = --gen-suppressions=all
finalargs[	 7] = --show-mismatched-frees=no
finalargs[	 8] = --fair-sched=yes
finalargs[	 9] = --num-callers=50
finalargs[ 10] = --suppressions=/NEW-SSD/NREF-COMM-CENTRAL/mozilla/build/valgrind/cross-architecture.sup
finalargs[ 11] = --suppressions=/NEW-SSD/moz-obj-dir/objdir-tb3/_valgrind/i386-pc-linux-gnu.sup
finalargs[ 12] = --suppressions=/home/ishikawa/Dropbox/myown.sup
finalargs[ 13] = --show-possibly-lost=no
finalargs[ 14] = --malloc-fill=0xA5
finalargs[ 15] = --free-fill=0xC3
finalargs[ 16] = /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/xpcshell-bin
finalargs[ 17] = -g
finalargs[ 18] = /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin
finalargs[ 19] = -a
finalargs[ 20] = /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin
finalargs[ 21] = -m
finalargs[ 22] = -e
finalargs[ 23] = const _HEAD_JS_PATH = "/NEW-SSD/NREF-COMM-CENTRAL/mozilla/testing/xpcshell/head.js";
finalargs[ 24] = -e
finalargs[ 25] = const _MOZINFO_JS_PATH = "/NEW-SSD/moz-obj-dir/objdir-tb3/temp/xpc-profile-g5u3x9tf/mozinfo.json";
finalargs[ 26] = -e
finalargs[ 27] = const _PREFS_FILE = "/NEW-SSD/moz-obj-dir/objdir-tb3/temp/user.js";
finalargs[ 28] = -e
finalargs[ 29] = const _TESTING_MODULES_DIR = "/NEW-SSD/moz-obj-dir/objdir-tb3/_tests/modules/";
finalargs[ 30] = -f
finalargs[ 31] = /NEW-SSD/NREF-COMM-CENTRAL/mozilla/testing/xpcshell/head.js
finalargs[ 32] = -e
finalargs[ 33] = const _HEAD_FILES = ["/NEW-SSD/moz-obj-dir/objdir-tb3/_tests/xpcshell/comm/mailnews/imap/test/unit/head_imap_maildir.js"];
finalargs[ 34] = -e
finalargs[ 35] = const _JSDEBUGGER_PORT = 0;
finalargs[ 36] = -e
finalargs[ 37] = const _TEST_FILE = ["/NEW-SSD/moz-obj-dir/objdir-tb3/_tests/xpcshell/comm/mailnews/imap/test/unit/test_localToImapFilter.js"];
finalargs[ 38] = -e
finalargs[ 39] = const _TEST_NAME = "xpcshell-maildir.ini:comm/mailnews/imap/test/unit/test_localToImapFilter.js";
finalargs[ 40] = -e
finalargs[ 41] = _execute_test(); quit(0);


    Please note that I am using fair scheduling which has run thunderbird (and presumably  Firefox) without issues regarding thread scheduling issues before. But this time, it may not work as expected.
 

OBSERVED RESULT

EXPECTED RESULT

SOFTWARE/OS VERSIONS
uname -a
Linux ip030 6.12.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.12-1 (2025-02-02) x86_64 GNU/Linux

ADDITIONAL INFORMATION

To summarize the issue discussed in mozilla's bugzilla,
thunderbird creates a lambda/function that is then passed to a different thread using SyncRunnable::DispatchToThread(), and
usually the different thread executes that function in the context of THAT thread, and then returns the function value if any.
The calling thread is  blocked as the SyncRunnable::DispatchToThread() name suggests (so it seems).|

However, somehow under valgrind, this blocking of the calling thread of SyncRunnable::DisaptachToThread() does not occur and it does not wait for the completion  of the function invocation that is to be done on a different thread.
Thunderbird merrily proceeds without waiting, so the variable(s) that are assigned values in the function/lambda are not
assigned a value yet, and I see "Conditional jump or move depends on uninitialised value(s)".

I have investigated this far.
I am puzzled since I have not seen this happen before. I have run thunderbird under vlgrind on and off for the last 7-8 years, or maybe longer. But for the last 10 months or so, I did not because of an issue maybe timing issue inside the GUI library. (See  https://bugzilla.mozilla.org/show_bug.cgi?id=1880148 )
I say the problem was with thunderbird  because the issue discussed there happened with gdb, strace, and valgrind. Basically, slowdown caused by these tools cause the context menu system misbehave. It was NOT valgrind issue.

I can't rule out this particular issue crept in the last 10 months. 
BUT there were a few old code with the same  pattern (create a lambda/function and then pass it to a different thread for execution in the context of the different thread,  and wait for it). 
Moreover, there ARE tons of such usages in firefox code base. 
So I think the code pattern is correct, and it only now happens that valgrind somehow mishandles the 
thread switch issue completely.

Does anything in the newer version (3.24, 3.25GIT) ring a bell?

I wonder if I should change  "--fair-sched=yes", but I see no reason that it will solve the current issue.

I see the same issue under 3.24 and 3.25 GIT.

Thank you again for offering this great  tool to the programming community.
Comment 1 zephyrus00jp 2025-03-18 23:16:33 UTC
You can find the use of SyncRunnable::DispatchToSend() in comm-central thunderbird tree.

https://searchfox.org/comm-central/search?q=SyncRunnable%3A%3ADispatchToThread%28&path=&case=false&regexp=false

Within the code used only by Thunderbird:
Textual Occurrences (43 lines across 31 files)
	mailnews/base/src/nsMsgContentPolicy.cpp
212	mozilla::SyncRunnable::DispatchToThread(   <--- modified in 2025
	mailnews/base/src/nsNewMailnewsURI.cpp
61	mozilla::SyncRunnable::DispatchToThread(  <--- modified in 2023
73	mozilla::SyncRunnable::DispatchToThread(  <--- ditto
90	mozilla::SyncRunnable::DispatchToThread(  <--- ditto
160	mozilla::SyncRunnable::DispatchToThread(  <--- modified in 2019
	mailnews/imap/src/nsImapProtocol.cpp
1292	mozilla::SyncRunnable::DispatchToThread( <-- modified in 2023 <--- THIS WHERE THE PROBLEM was reported.
1317	mozilla::SyncRunnable::DispatchToThread(  <-- modified in 2023
1345	mozilla::SyncRunnable::DispatchToThread(  <--- modified in 2023

The rest of the  listing is in mozilla-central which is used by Firefox and thunderbird shares it.
In order to learn the modify date of the code, we need to access the M-C tree:
https://searchfox.org/mozilla-central/search?q=SyncRunnable%3A%3ADispatchToThread%28&path=&case=false&regexp=false
Many have been creatd in 2010's and in 2020, etc. and thus if Firefox developers have run firefox under valgrind, they may have reported the issue already.

So maybe newer versions of valgrind has issue(s) [maybe it does not handle thread sync related primtives correctly? Maybe hellgrind does and memcheck forgets to handle some primitives?].
Or thunderbird does something wrong which firefox gets right.
Just  a thought.
Comment 2 Paul Floyd 2025-03-19 06:48:04 UTC
(In reply to zephyrus00jp from comment #1)

> So maybe newer versions of valgrind has issue(s) [maybe it does not handle
> thread sync related primtives correctly? Maybe hellgrind does and memcheck
> forgets to handle some primitives?].
> Or thunderbird does something wrong which firefox gets right.

I don't see anything concrete here that indicates a bug in Valgrind. Memcheck has detected a conditional read error. I strongly suggest that you take memcheck's word that there is an error and don't start making random guesses about other causes.

It is possible that the change in scheduling that you get when running under Valgrind is revealing an underlying bug in the guest code. That's still a guest issue. The Valgrind core has to do some hacky things so that newly spawned threads also run under Valgrind. Memcheck doesn't do anything else with thread primitives. DRD and Helgrind intercept pthread functions so that they an validate and record the thread state. The intercepts still call the intercepted pthread functions.

In order to see where the error is try using vgdb. You will need 2 terminals, one with valgrind and the other with gdb. When you hit the error you can use the memcheck monitor commands to see which part of the 'if' expression is uninitialized.
Comment 3 zephyrus00jp 2025-03-19 07:51:56 UTC
(In reply to Paul Floyd from comment #2)
> (In reply to zephyrus00jp from comment #1)
> 
> > So maybe newer versions of valgrind has issue(s) [maybe it does not handle
> > thread sync related primtives correctly? Maybe hellgrind does and memcheck
> > forgets to handle some primitives?].
> > Or thunderbird does something wrong which firefox gets right.
> 
> I don't see anything concrete here that indicates a bug in Valgrind.
> Memcheck has detected a conditional read error. I strongly suggest that you
> take memcheck's word that there is an error and don't start making random
> guesses about other causes.
> 
> It is possible that the change in scheduling that you get when running under
> Valgrind is revealing an underlying bug in the guest code. That's still a
> guest issue. The Valgrind core has to do some hacky things so that newly
> spawned threads also run under Valgrind. Memcheck doesn't do anything else
> with thread primitives. DRD and Helgrind intercept pthread functions so that
> they an validate and record the thread state. The intercepts still call the
> intercepted pthread functions.
> 
> In order to see where the error is try using vgdb. You will need 2
> terminals, one with valgrind and the other with gdb. When you hit the error
> you can use the memcheck monitor commands to see which part of the 'if'
> expression is uninitialized.

The uninitialized value is |rv|.
https://searchfox.org/comm-central/source/mailnews/imap/src/nsImapProtocol.cpp#1281

```
**
 * Dispatch socket thread to to determine if connection is alive.
 */
nsresult nsImapProtocol::IsTransportAlive(bool* alive) {
  nsresult rv;  <------------- THIS is declared without an initialization.
  auto GetIsAlive = [transport = nsCOMPtr{m_transport}, &rv, alive]() mutable {
    rv = transport->IsAlive(alive);   <-------------- |rv| is supposed to get set in this lambda.
  };
  nsCOMPtr<nsIEventTarget> socketThread(
      do_GetService(NS_SOCKETTRANSPORTSERVICE_CONTRACTID));
  if (socketThread) {
    mozilla::SyncRunnable::DispatchToThread(
        socketThread,
        NS_NewRunnableFunction("nsImapProtocol::IsTransportAlive", GetIsAlive));  <--- calling thread does not stop here to wait for
                                                              lambda (GetIsAlive), and proceeds.
  } else {
    rv = NS_ERROR_NOT_AVAILABLE;
  }
  return rv;  <--- Thus, this |rv| returns an uninitialized value since GetIsAlive has not been executed. 
```

The above is what I found.
Comment 4 Paul Floyd 2025-03-19 12:48:27 UTC
So not an issue with memcheck.
Comment 5 zephyrus00jp 2025-03-24 04:32:08 UTC
(In reply to Paul Floyd from comment #4)
> So not an issue with memcheck.

I am afraid that I have o disagree here.
There is an issue of memcheck here, which I have not seen before.

memcheck changes the behavior of the program here.
In normal circumstances, |rv| is set properly in the lambda/function that is executed in another thread.
(That lambda/function invocation is supposed to be waited.)
Here, I found that somehow the control flow is no longer which is supposed to happen (the execution of lambda/function via DispatchToThread() is waited until completion) during the execution of thunderbird.

I don't know the details, but here is the first case I have seen memcheck changed the behavior of thunderbird which resulted in logical error, and reported.

As I noted in comment 43 (which may not have been clear)), under normal circumstances, |rv| is set correctly before returned.

HOWEVER, UNDER VALGRIND/MEMCHECK

>        NS_NewRunnableFunction("nsImapProtocol::IsTransportAlive", GetIsAlive));  <--- calling thread does not stop here to wait for
                                                              lambda (GetIsAlive), and proceeds.

but UNDER NORMAL RUN  WITHOUT VALGRIND,  the calling thread  STOPS HERE TO WAIT FOR lambda (GetIsAlive).
Sorry, it was not clear.

But since memcheck somehow changes the control flow despite thread context switch and 
the wait introduced by monitor/lock, the value returned is uninitialized.
See the synchronization code at
https://searchfox.org/mozilla-central/source/xpcom/threads/SyncRunnable.h#71

This thread code was written in 2014 and I assume it has worked well for Firefox and Thunderbird for more than a decade.
However,  the particular instance of the DispatchToThread() call was introduced in 2023. So this particular line of code  may have uncovered an issue with memcheck and the primitives of thunderbird mail client. However, plase note there are tons similar call patterns in Firefox created more than dozen years ago or so.
See the calls to DispathToThread() in mozilla code base. There are several dozens such places. 
https://searchfox.org/mozilla-central/search?q=DispatchToThread&path=&case=false&regexp=false

I assume they have worked under valgrind because there were people who ran firefox under valgrind to
find errors.

So I wonder what has changed in the last couple of years when I could not run thunderbird under valgrind due to the change of framework for its test suite. (From mozimill to mochitest).
Compiler's generated code? I am using GCC-14 for now.
Binary utilities? 
Both valgrind 3.24.0 and valgrind-3.2.5.0GIT ( I compiled locally) showed the symptom.
valgrind 3.23 was too old to run thunderbird code of today since there are some syscalls which were not handled by valgrind 3.23.

I initially thought of inserting my own sync primitive, but then I realized that thunderbird/firefox has already implemented it already.
https://searchfox.org/mozilla-central/source/xpcom/threads/SyncRunnable.h#71

So I am puzzled what I can to make valgrind and thunderbird  run together by sticking to the thread synchronization behavior observed under normal run.
Comment 6 zephyrus00jp 2025-03-24 04:48:27 UTC
BTW, please see the use of valgrind to check for bugs in Firefox.
https://blog.mozilla.org/jseward/2015/02/11/mochitests-are-now-valgrind-clean/
(Admittedly, it is old and was written in 2014. But it was my understanding that some people did run firefox for checking memory errors since then.)
Also, I checked the Thunderbird code by ASAN run. 
But ASAN can't detected uninitialized code, that is why I ran thunderbird under valgrind.

Problem with thunderbird is that the size of developer community is much smaller than that of firefox and that is why there was not much input from thunderbird developer community on this particular issue.
I am following the usage of valgrind with mozilla code: https://firefox-source-docs.mozilla.org/contributing/debugging/debugging_firefox_with_valgrind.html
But there have been issues of running thunderbird under valgrind because not many people have run such combination.

I think I need to fix this issue one way or the other to make the execution of mochitest suite of thunderbird under valgrind. Otherwise, I cannot trust the
result of the test run under valgrind.

Obviously, the failure to initialize return code which may have random value would invalidate the test run. Hmm...
As of now, ASAN run does not show any glaring errors, but just recently, coverity static analyzer reported various uninitialized field usages and so I am very uncomfortable with the current comm-central thunderbird tree as far as the uninitialized memory issue is concerned.

TIA
Comment 7 Paul Floyd 2025-03-24 06:58:25 UTC
Does the code run cleanly with Helgrind, DRD and TSAN?

This still looks like a thread issue to me and not a memcheck issue.
Comment 8 zephyrus00jp 2025-03-24 11:06:07 UTC
(In reply to Paul Floyd from comment #7)
> Does the code run cleanly with Helgrind, DRD and TSAN?
> 
> This still looks like a thread issue to me and not a memcheck issue.

Thank you for your comment.
This indeed looks like a thread issue, but the circumstantial evidence suggests it has worked well for firefox and thunderbird for quite a while.

Given the difficulty of running thunderbird under valgrind when we execute the test suites, I will opt for running TSAN version of thunderbird.

Let me try TSAN first. Then Helgrind, or DRD.
I will report the result.

TIA
Comment 9 zephyrus00jp 2025-05-21 05:39:54 UTC
(In reply to zephyrus00jp from comment #8)
...
> 
> Let me try TSAN first. Then Helgrind, or DRD.
> I will report the result.
> 
> TIA

I am working on this issue.
I am checking thunderbird under TSAN and fixed a few race issues.
However, now I have hit on race issues in WebRenderer, specifically in OpenGL library and
a surprise, garbage collection subsystem. The latter probably should be whitelisted since it is not reported in
bugzilla of mozilla firefox or thunderbird. 
I am trying to whitelist the graphics system's race, but somehow I have not been able to do that.

So the progress is very slow.
Comment 10 Mark Wielaard 2025-10-17 12:41:16 UTC
Any update on reproducing the issue?
I agree with Paul that this doesn't seem a memcheck issue but probably some threading issue.
Comment 11 Mark Wielaard 2025-10-24 14:54:00 UTC
(In reply to Mark Wielaard from comment #10)
> Any update on reproducing the issue?
> I agree with Paul that this doesn't seem a memcheck issue but probably some
> threading issue.

Lets close this for now. Feel free to reopen if you are able to reproduce the issue.