Bug 380269 - [PATCH] No multithreading in macOS Sierra (10.12)
Summary: [PATCH] No multithreading in macOS Sierra (10.12)
Status: CONFIRMED
Alias: None
Product: valgrind
Classification: Developer tools
Component: general (show other bugs)
Version: 3.13 SVN
Platform: macOS (DMG) macOS
: NOR normal
Target Milestone: ---
Assignee: Rhys Kidd
URL:
Keywords:
Depends on:
Blocks: 365327
  Show dependency treegraph
 
Reported: 2017-05-28 08:50 UTC by Louis Brunner
Modified: 2023-09-21 09:02 UTC (History)
7 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Minimal test case (303 bytes, text/plain)
2017-11-27 15:49 UTC, Alexandru Croitor
Details
Fix crash when using multithreading on Mac OS X >= 10.12 (2.60 KB, patch)
2017-12-10 22:30 UTC, Louis Brunner
Details
Partial wqthread fix (6.61 KB, patch)
2017-12-27 19:35 UTC, Louis Brunner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Louis Brunner 2017-05-28 08:50:43 UTC
On macOS Sierra (10.12), if pthread_create is called while using valgrind the program stops because of a SIGSEGV.

Snippet of the error:
==22190== Thread 2:
==22190== Invalid read of size 4
==22190==    at 0x10050790D: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==22190==    by 0x1005078FA: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==22190==    by 0x100507100: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==22190==  Address 0x18 is not stack'd, malloc'd or (recently) free'd

After some investigation, this bug comes when starting the new thread just before the routine supplied to pthread_create is called. From my limited understanding of assembly/macOS internals/valgrind, the thread tries to access its thread-local storage (TLS) and fails because gs (the register that seems to be allowed to it on macOS) is 0.

This means that valgrind interpret the problematic instruction
mov %gs:0x18,%ecx
as
lea [0+0x18],%ecx

While debbuging with gdb and lldb, with and without valgrind, I noted that gs is always 0. So I don't know if it's technically possible but macOS Sierra seems to hide the value of gs from the user, so when the binary is run without valgrind the instruction loads the correct address even though gs is 0. When running with valgrind, gs is taken directly from the register and is always 0 which triggers the bug.

The only solution I can see is guessing the address of the TLS and set it as gs' value before going into pthread (in pthread_hijack), e.g.
diff --git a/coregrind/m_syswrap/syswrap-amd64-darwin.c b/coregrind/m_syswrap/syswrap-amd64-darwin.c
index c827bab..3c926cf 100644
--- a/coregrind/m_syswrap/syswrap-amd64-darwin.c
+++ b/coregrind/m_syswrap/syswrap-amd64-darwin.c
@@ -364,6 +364,10 @@ void pthread_hijack(Addr self, Addr kport, Addr func, Addr func_arg,
    vex->guest_R9  = flags;
    vex->guest_RSP = sp;
 
+#if DARWIN_VERS >= DARWIN_10_12
+   vex->guest_GS_CONST = self;
+#endif
+
    // Record thread's stack and Mach port and pthread struct
    tst->os_state.pthread = self;
    tst->os_state.lwpid = kport;

However, I wasn't able to guess a correct value for gs, and with the above fix the program goes on but fails at the next pthread call, example with the above fix applied:
==23490== Thread 2:
==23490== Invalid read of size 4
==23490==    at 0x100508538: _pthread_testcancel (in /usr/lib/system/libsystem_pthread.dylib)
==23490==    by 0x1002E0AD4: nanosleep (in /usr/lib/system/libsystem_c.dylib)
==23490==    by 0x1002E09D2: sleep (in /usr/lib/system/libsystem_c.dylib)
==23490==    by 0x100000E02: slavethread (pth_term_signal.c:27)
==23490==    by 0x1005079AE: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==23490==    by 0x1005078FA: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==23490==    by 0x100507100: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==23490==  Address 0x54485258 is not stack'd, malloc'd or (recently) free'd

Note that at this point, gdb shows gs value as 0x4ee4000 instead of 0x700004ee4000, so maybe the value is getting truncated at some point.

In previous versions of macOS, valgrind would set guest_GS_CONST in the thread_fast_set_cthread_self (syswrap-darwin.c:9133) trap. However this function is not called before going into the new thread, leaving an invalid value for gs.
Comment 1 Alexandru Croitor 2017-11-27 15:49:00 UTC
I encountered, and reproduced the same issue I believe, with a slightly different stack trace, after applying the provided patch in https://bugs.kde.org/show_bug.cgi?id=383723 . Attaching the minimal example to this issue as well.
Comment 2 Alexandru Croitor 2017-11-27 15:49:14 UTC
Created attachment 109082 [details]
Minimal test case
Comment 3 Alexandru Croitor 2017-11-27 15:49:52 UTC
Stack trace is

==75877== Invalid read of size 4
==75877==    at 0x1014B62B1: _pthread_wqthread (in /usr/lib/system/libsystem_pthread.dylib)
==75877==    by 0x1014B607C: start_wqthread (in /usr/lib/system/libsystem_pthread.dylib)
==75877==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==75877==
==75877==
==75877== Process terminating with default action of signal 11 (SIGSEGV)
==75877==  Access not within mapped region at address 0x18
==75877==    at 0x1014B62B1: _pthread_wqthread (in /usr/lib/system/libsystem_pthread.dylib)
==75877==    by 0x1014B607C: start_wqthread (in /usr/lib/system/libsystem_pthread.dylib)
Comment 4 Louis Brunner 2017-12-10 22:30:13 UTC
Created attachment 109303 [details]
Fix crash when using multithreading on Mac OS X >= 10.12

I have finally managed to fix the issue that made pthread SIGSEGV when starting a new thread.

The issue was related to the TSD. In _pthread_body, an inline function called __pthread_add_thread is called, which locks a mutex using a os_unfair_lock_lock function from libplatform. These functions use the TSD while calling _os_lock_owner_get_self to retrieve the current the current MAC Thread using _os_tsd_get_direct(__TSD_MACH_THREAD_SELF). Looking at this last function will reveal a `__asm__("mov %%gs:%1, %0" : "=r" (ret) : "m" (*(void **)(slot * sizeof(void *))));`, knowing that __TSD_MACH_THREAD_SELF is 3, that outputs the `mov %gs:0x18,%ecx` that was causing problems. 

Now, I don't know why GS_CONST is suddenly 0 instead of the TSD address, but I found that pthread keeps the latter in its pthread_t structure and that it conveniently gives us the offset in an internal structure passed to bsdthread_register by _pthread_bsdthread_init (4th argument). In syswrap-darwin.c, we already store a few arguments passed to bsdthread_register, so I added a new one, which is the offset of the TSD member in the pthread_t structure. While starting the new thread in pthread_hijack, we can then set the value of GS_CONST to self (current pthread_t address) + pthread_tsd_offset.

The program then executes normally when using pthread and multithreading. Note that a lot of programs (including yours Alexandru) still crash or malfunction (mismanagement of signals, ...) once they start running. However these issues seem unrelated to this particular pthread problem.
Comment 5 Rhys Kidd 2017-12-11 05:01:03 UTC
Louis,
Thanks for your research on this and the provided patch.

Given this is a change that impacts fairly low-level system and Valgrind data structures, can you please run the full Valgrind test suite with and without this proposed patch to ensure there are no regressions?

http://valgrind.org/docs/manual/dist.readme-developers.html has full instructions, with the simple command to run this 'make regtest'.
Comment 6 Louis Brunner 2017-12-11 23:09:45 UTC
Rhys,

No problem, here are the results before the patch (but with https://bugs.kde.org/show_bug.cgi?id=385279 applied):
== 655 tests, 335 stderr failures, 85 stdout failures, 8 stderrB failures, 8 stdoutB failures, 31 post failures ==

And after:
== 655 tests, 322 stderr failures, 74 stdout failures, 8 stderrB failures, 8 stdoutB failures, 31 post failures ==

Note that some tests block the test runner in both case (mcwatchpoints) and some tests block after the patch (pselect_alarm and pth_term_signal). This is due to the tests previously crashing because of pthread, and now some issues with signals is preventing them to complete.

Here is the difference between the failures:
$ diff tests_pre.txt tests_post.txt
49,50d48
< memcheck/tests/err_disable3              (stderr)
< memcheck/tests/err_disable4              (stderr)
99d96
< memcheck/tests/threadname                (stderr)
150,151d146
< callgrind/tests/threads-use              (stderr)
< callgrind/tests/threads                  (stderr)
210,216d204
< none/tests/pselect_alarm                 (stderr)
< none/tests/pth_2sig                      (stderr)
< none/tests/pth_atfork1                   (stdout)
< none/tests/pth_atfork1                   (stderr)
< none/tests/pth_blockedsig                (stdout)
< none/tests/pth_blockedsig                (stderr)
< none/tests/pth_cancel1                   (stdout)
219,221d206
< none/tests/pth_cvsimple                  (stdout)
< none/tests/pth_cvsimple                  (stderr)
< none/tests/pth_exit                      (stderr)
224,225d208
< none/tests/pth_stackalign                (stdout)
< none/tests/pth_stackalign                (stderr)
229d211
< none/tests/sigsusp                       (stderr)
234d215
< none/tests/threaded-fork                 (stdout)
330d310
< helgrind/tests/pth_destroy_cond          (stdout)
342d321
< helgrind/tests/tc07_hbl1                 (stdout)
344d322
< helgrind/tests/tc08_hbl2                 (stdout)
348d325
< helgrind/tests/tc11_XCHG                 (stdout)
358d334
< helgrind/tests/tc21_pthonce              (stdout)
Comment 7 Louis Brunner 2017-12-27 19:35:55 UTC
Created attachment 109548 [details]
Partial wqthread fix

As reported by Alexandru in https://bugs.kde.org/show_bug.cgi?id=383723 and this thread, as well as by FX in https://bugs.kde.org/show_bug.cgi?id=385279, my previous patch doesn't include any fix for wqthread.

Here is a new patch that can be applied on top of my previous and Alexandru's kevent_qos patches (which are needed for wqthread programs). It includes the following elements (which I didn't know how to breakdown between the different bug reports):
 - GS_CONST fix (as for regular threads)
 - Zero offset for tst->os_state.pthread (as it was the case for 10.6)
 - workq_ops(THREAD_KEVENT_RETURN) now behaves like workq_ops(THREAD_RETURN)
 - kevent_id (syscall:375) and thread_get_special_reply_port (mach:50) implementation (the kevent_id one is copied from Alexandru's kevent_qos)

Note that after all these changes, programs using wqthreads still don't work. Unfortunately, I am running into race conditions with the BigLock: either one thread waiting for the lock to be released or another locking it twice. All these changes are wrapped in #if DARWIN_VERS >= DARWIN_10_13, apart from the THREAD_KEVENT_RETURN change. Last but not least, the patch doesn't worsen the regtest results:

== 655 tests, 321 stderr failures, 74 stdout failures, 8 stderrB failures, 8 stdoutB failures, 31 post failures ==
Comment 8 Rhys Kidd 2018-08-20 03:32:20 UTC
A short update on this bug report.

I've been spending time on Valgrind's macOS threading support, as it's causing a number of bug reports and needs some substantial work.

Additionally, I've started running macOS-based CI runs (see my github branches). Whilst it isn't ready to ship, it's making testing across numerous macOS versions much quicker.

Louis: I am seeing test suite hangs on memcheck/tests/threadname and memcheck/tests/thread_alloca with variants of your patches. Did you ever experience these?
Comment 9 Louis Brunner 2018-08-20 21:02:39 UTC
Rhys: Doesn't ring a bell. I mainly had issues with mcwatchpoints, pselect_alarm and pth_term_signal. Is there anything I can do to help out?