I can reproduce on the N900 and the HTC Touch Pro2, that after a few crypto operations, the crypto thread of kmail-mobile seems to hang. Symptoms: No other crypto operations completes in Kontact Touch Mail. Can still operate and switch to other email otherwise. Can do crypto operations on the command line. No dirmngr and no gpgsm running. gpg-connect-agent gets prompt from the gpg-agent. kontact-mobile getting all available CPU seen on top. How to reproduce: E.g send yourself three emails: Signed only, signed and encrypted, encrypted only. Switch between all those emails back and forth. After doing this up a few times (usually three times is enough), the symptoms occur. N900 Package: kdepim-mobile Version: 4:4.6~.20110211.1251.git9d81e94-1maemo1.1219908 Package: libqt4-experimental-gui Version: 4.7.0~git20101111-0maemo1 Config-Version: 4.7.0~git20101111-0maemo1 HTC Touch pro2: Version build 20110217-1613 (I believe)
Oh, on the N900 it does not use dirmngr which in this version has other defects: cat .gnupg/gpgsm.conf disable-crl-checks prefer-system-dirmngr
Werner wrote: It would be useful to see a gpgme log. Attaching gdb and trying to break into the running and CPU hogging application several times only resulted in: ^C Program received signal SIGINT, Interrupt. 0x42a77798 in poll () from /lib/libc.so.6 0x42a77798 <poll+104>: cmn r0, #4096 ; 0x1000 (gdb) bt #0 0x42a77798 in poll () from /lib/libc.so.6 #1 0x44f5a0a8 in g_poll () from /usr/lib/libglib-2.0.so.0 #2 0x44f4cd80 in ?? () from /usr/lib/libglib-2.0.so.0 #3 0x44f4cd80 in ?? () from /usr/lib/libglib-2.0.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) info threads * 1 LWP 3384 0x42a77798 in poll () from /lib/libc.so.6 (gdb) c Continuing.
Created attachment 57855 [details] gpgme 5 log of testrun which gets stuck Got stuck on the first gpgme_op_decrypt_verify
Created attachment 57857 [details] a level 9 gpgsm log showing the hang on N900, search for "MARK" For reference here is a level 9 gpgme debug log (as Werner asked for). How to create it: On the command line of the N900: export GPGME_DEBUG=9:/home/user/20110311-gpgme-4.txt kmail-mobile on a second shell you can enter several markers into the log file so you can later add overall information about what you have done echo "MARK:" >>20110311-gpgme-4.txt
(10:46:15) werner: What I can see from the logs is that Kmail runs 9 VERIFY operations on the same data: MARK: Now trying to decrypt and verify it: GPGME 2011-03-11 09:42:15 <0x4a397480> chan_30 -> DECRYPT GPGME 2011-03-11 09:42:17 <0x4ab97480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:18 <0x4ab97480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:19 <0x4a397480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:21 <0x4a397480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:22 <0x4ab97480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:22 <0x4a397480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:23 <0x4ab97480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:23 <0x4a397480> chan_31 -> VERIFY GPGME 2011-03-11 09:42:24 <0x4ab97480> chan_31 -> VERIFY MARK: decryption seems to have worked, but verifying is hung, (10:46:49) werner: This is a bit fishy. I can't see any hangs in gpgme.
Running it with --nofork in gdb seems to look like a thread is started several times, though no changes in the interface, when trying to decrypt and verify: [New LWP 4628] [LWP 4628 exited] [New LWP 4632] [LWP 4632 exited] QTextBrowser: No document for file:/// [New LWP 4636] [LWP 4636 exited] QTextBrowser: No document for file:/// [New LWP 4639] [LWP 4639 exited] [New LWP 4642] [LWP 4642 exited] QTextBrowser: No document for file:/// [New LWP 4645] [LWP 4645 exited] [New LWP 4648] [LWP 4648 exited] [New LWP 4651] QTextBrowser: No document for file:/// [LWP 4651 exited] [New LWP 4654] [LWP 4654 exited] QTextBrowser: No document for file:/// [New LWP 4657] [LWP 4657 exited] [New LWP 4660] [LWP 4660 exited] QTextBrowser: No document for file:/// [New LWP 4663] [LWP 4663 exited] [New LWP 4666] [LWP 4666 exited] QTextBrowser: No document for file:/// [New LWP 4669] [LWP 4669 exited] [New LWP 4672] [LWP 4672 exited] QTextBrowser: No document for file:/// [New LWP 4675] [LWP 4675 exited] [New LWP 4678] [LWP 4678 exited] QTextBrowser: No document for file:/// [New LWP 4681] [LWP 4681 exited] [New LWP 4684] [LWP 4684 exited] QTextBrowser: No document for file:///
Meanwhile some defects in libkleo got fixed that should avoid the several threads for one verification problem. It still hangs, reports Tobias, so we need a GPGME_DEBUG 9 log and lsof -p <processid> outputs.
Created attachment 58087 [details] gpgme debug 9 logfile Log file of a version that includes Marc's patch set and the connect-before-start fix
Tobias: Werner needs more information about #8. He needs to know what was done when to analyse the situation. Maybe you can do the log again and use the "MARK" method I describe in #4 to enter text before an operation you do, until you quit the application. Please also add lsof -p outputs to each of those (or at least some). See my level 9 debug file as an example.
Created attachment 58117 [details] new gpgme logfile with MARK entries
Created attachment 58118 [details] output of lsof at the point of the hang
Some more infos from Werner: (09:47:08) tokoe: werner: what exactly are you looking for in the lsof output? (09:48:21) werner: Processes used by kleo and all gpg and gpgsm processes. I need to look at the established pipes. (09:48:57) werner: What is the limit on open file descriptors on the n900 (per process)? How many are open for kleo? (09:50:08) werner: If the log is correct, gpgme stops somwhere between [ fork, fork, exec, waitpid [ (09:50:17) werner: Not ethat we use a double fork approach. (09:58:23) werner: if so an strace -fp <pid_of_kmail> might be useful too. strace on the N900 works, see http://wiki.maemo.org/Documentation/devtools/maemo5/strace and go to installation instruction to add the right repository
Created attachment 58133 [details] new gpgme logfile with MARK entries
Created attachment 58134 [details] output of lsof at the point of the hang
Created attachment 58135 [details] strace -fp log of kmail-mobile during the hanging session
Created attachment 58141 [details] new gpgme logfile with MARK entries
Created attachment 58142 [details] output of lsof at the point of the hang
Created attachment 58143 [details] strace -f -F -p log of kmail-mobile during the hanging session
Please test latest WinCE package (17.03.), the issue seems to be fixed there.
Note that we MUST fix it for the N900 as well.
Tobias: You've reported that symptoms go away on the n900 if you add debugging print statements. If this is consistently so, you could remove/add statements until you find a small set, that when added will remove the symptoms. Could be a bad solution, but better then nothing.
Testing kdepim-ce-package_2011-03-17-04-14 (keeping the existing configuration) I had a bit of other difficulties until I could send a signed emails: (In my first run, it claimed it could not find my certificates, though they were there in the only identity, on the subsequent second sending attempt it hung. After a reboot no s/mime certificates in my identiy. And after stopping the frontend and running kleo to check everything, it then worked.) I my test, I switched back and forth between a few emails that are signed, encrypted and encrypted and signed. I openened about emails out of this group for about 25 times. Conclusing: This version on CE behaves better.
Created attachment 58211 [details] strace -f -e trace=process (version 4.6) log
We need a newer libgpgme from git.
still reproducible with libgpgme11 1.3.1+20110328.1638.gitb001a8d-0maemo1.e637d92 and kdepim 4:4.6~.20110316.0703.gitb6cb367-1maemo1.1224491
Created attachment 58453 [details] strace -f -e trace=process (version 4.6) log from new libgpgme packages I have no clue why, but the 'echo "MARK: foobar" >> strace.log' didn't work anymore, so the log only contains a marker for the point where the app hangs. I hope it gives some hints nevertheless
Marcus flashed his N900 and installed the latest version on it. He was able to replicate the problem just once but not on the next day. It is possible that the noticed problem was a different one. Anyway, he can't replicate it. OTOH the bug still is replicatable on other N900s. We might need to ship such an N900 to here; pure ssh access without X makes debugging not very easy.
Looking at log files show that all gpgme crypto operations are finished at the time of the hang. The crypto thread then hangs in poll with 12 fds and and an infinite timeout. Confirmed via strace and gdb. We are now looking at the upper layers (gpgme++, libkleo) and the event loop code (qt and glib).
We found out that when interrupting a GPGME operation (probably due to switching to a different mail), a gpgme_cancel_async is issued as expected. However, unexpectedly, the thread targeted by the cancel is also presumably terminated (at the very least it does not continue running in the normal way). Normally, it should see the cancel flag and that clean up and terminate eventually. here it looks like it is forcefully terminated with a pthread_cancel. However, we could not yet find out why this would be the case. Marc suggested putting an explicit join in the destructor. Maybe putting a breakpoint on the QThread destructor would point out who is responsible. This problem should also potentially affect any other platform, but it is a race condition and everybody races a bit differently.
I did another test which confirms that the thread really terminates while inside _gpgme_io_spawn. The address of the TCB which is what pthread_self () seems to return is immediately reused. However using SYS_gettid instead of pthread_self clearly shows that a a new thread is used.
After adding some additional debug statements to libgpgme I saw that the first fork() call in _gpgme_io_spawn blocks and nothing else in this thread is executed. Attaching gdb to kmail-mobile process afterwards shows that: ------------------------------------------------------------- Thread 4 (Thread 0x48768480 (LWP 3561)): #0 0x4340b928 in pthread_cond_wait@@GLIBC_2.4 () from /lib/libpthread.so.0 #1 0x42a93b78 in pthread_cond_wait@@GLIBC_2.4 () from /lib/libc.so.6 #2 0x42ca5c44 in ?? () from /usr/lib/libQtScript.so.4 #3 0x42ca5c44 in ?? () from /usr/lib/libQtScript.so.4 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 3 (Thread 0x4aafc480 (LWP 3582)): #0 0x42a810a0 in select () from /lib/libc.so.6 #1 0x42a94080 in __libc_enable_asynccancel () from /lib/libc.so.6 #2 0x425d7ad0 in ?? () from /usr/lib/libQtCore.so.4 #3 0x425d7ad0 in ?? () from /usr/lib/libQtCore.so.4 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 2 (Thread 0x496ab480 (LWP 3586)): #0 0x42a32fe0 in ptmalloc_lock_all () from /lib/libc.so.6 #1 0x42a56a50 in fork () from /lib/libc.so.6 #2 0x435dfdac in _gpgme_io_spawn (path=0xdce4e0 "/usr/bin/gpgsm", argv=0x496aaa68, flags=2, fd_list=0x49701758, atfork=0x435ff88c, atforkvalue=0x496aa974, r_pid=0x496aa994) at posix-io.c:392 #3 0x435d762c in my_spawn (ctx=0x497038e0, r_pid=0x496aa994, name=0xdce4e0 "/usr/bin/gpgsm", argv=0x496aaa68, fd_in=32, fd_out=31, fd_child_list=0x496aa9f4, atfork=0x435ff88c, atforkvalue=0x496aa974, flags=0) at assuan-support.c:171 #4 0x435fa2a8 in ?? () from /usr/lib/libassuan.so.0 #5 0x435fa2a8 in ?? () from /usr/lib/libassuan.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) Thread 1 (Thread 0x44fe3400 (LWP 3560)): #0 0x42a7e798 in poll () from /lib/libc.so.6 #1 0x4503f0a8 in g_poll () from /usr/lib/libglib-2.0.so.0 #2 0x45031d80 in ?? () from /usr/lib/libglib-2.0.so.0 #3 0x45031d80 in ?? () from /usr/lib/libglib-2.0.so.0 Backtrace stopped: previous frame identical to this frame (corrupt stack?) ------------------------------------------------------------- The backtrace from thread 2 looks identical to the one in https://bugs.maemo.org/show_bug.cgi?id=11533 and the described behavior is the same we are facing here. Therefor I'd say this is a bug in Maemo and not in Kontact/gpgme and I don't see a way for a 100% workaround here.
Additional note: Disabling all debug output (by clicking 'disable all debug output' in kdebugdialog) seems to prevent the bug to occur. This workaround needs some more testing though.
Good thinking. The reaction by Nokia in the bug report is aweful, though. Are there any applications using malloc and fork in concurrent threads? W-T-F
To conclude the issue, we believe this is an N900 platform issue with the current PR1.3 release. There is no easy workaround and more involved workarounds would endanger the functions for other platforms. https://bugs.maemo.org/show_bug.cgi?id=11533