Bug 268138 - Crypto operations like verify and decrypt hang in the mail frontend after a few operations, cpu intensive (on N900 only)
Summary: Crypto operations like verify and decrypt hang in the mail frontend after a f...
Status: RESOLVED UPSTREAM
Alias: None
Product: KMail Mobile
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Maemo 5 Linux
: NOR major
Target Milestone: ---
Assignee: Tobias Koenig
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-10 16:28 UTC by Bernhard E. Reiter
Modified: 2011-06-15 11:29 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
gpgme 5 log of testrun which gets stuck (3.79 KB, text/plain)
2011-03-11 09:16 UTC, Bernhard E. Reiter
Details
a level 9 gpgsm log showing the hang on N900, search for "MARK" (86.15 KB, application/x-bzip2)
2011-03-11 09:51 UTC, Bernhard E. Reiter
Details
gpgme debug 9 logfile (199.98 KB, application/octet-stream)
2011-03-16 12:39 UTC, Tobias Koenig
Details
new gpgme logfile with MARK entries (830.55 KB, text/plain)
2011-03-17 10:24 UTC, Tobias Koenig
Details
output of lsof at the point of the hang (673.46 KB, text/plain)
2011-03-17 10:24 UTC, Tobias Koenig
Details
new gpgme logfile with MARK entries (408.47 KB, text/plain)
2011-03-18 09:42 UTC, Tobias Koenig
Details
output of lsof at the point of the hang (463.24 KB, text/x-log)
2011-03-18 09:42 UTC, Tobias Koenig
Details
strace -fp log of kmail-mobile during the hanging session (68.72 KB, text/plain)
2011-03-18 09:43 UTC, Tobias Koenig
Details
new gpgme logfile with MARK entries (665.76 KB, text/plain)
2011-03-18 12:19 UTC, Tobias Koenig
Details
output of lsof at the point of the hang (463.29 KB, text/plain)
2011-03-18 12:20 UTC, Tobias Koenig
Details
strace -f -F -p log of kmail-mobile during the hanging session (337.67 KB, text/plain)
2011-03-18 12:20 UTC, Tobias Koenig
Details
strace -f -e trace=process (version 4.6) log (32.00 KB, text/plain)
2011-03-21 13:02 UTC, Tobias Koenig
Details
strace -f -e trace=process (version 4.6) log from new libgpgme packages (13.92 KB, text/plain)
2011-03-30 09:07 UTC, Tobias Koenig
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard E. Reiter 2011-03-10 16:28:51 UTC
I can reproduce on the N900 and the HTC Touch Pro2,
that after a few crypto operations, the crypto thread of kmail-mobile
seems to hang. 

Symptoms:
No other crypto operations completes in Kontact Touch Mail.
Can still operate and switch to other email otherwise.
Can do crypto operations on the command line. 
No dirmngr and no gpgsm running.
gpg-connect-agent gets prompt from the gpg-agent.
kontact-mobile getting all available CPU seen on top. 

How to reproduce:
E.g send yourself three emails: Signed only, signed and encrypted, encrypted only.
Switch between all those emails back and forth.
After doing this up a few times (usually three times is enough),
the symptoms occur.

N900
Package: kdepim-mobile
Version: 4:4.6~.20110211.1251.git9d81e94-1maemo1.1219908
Package: libqt4-experimental-gui
Version: 4.7.0~git20101111-0maemo1
Config-Version: 4.7.0~git20101111-0maemo1

HTC Touch pro2:
Version build 20110217-1613 (I believe)
Comment 1 Bernhard E. Reiter 2011-03-10 16:30:03 UTC
Oh, on the N900 it does not use dirmngr which in this version has other defects:
cat .gnupg/gpgsm.conf
disable-crl-checks
prefer-system-dirmngr
Comment 2 Bernhard E. Reiter 2011-03-10 17:44:56 UTC
Werner wrote: It would be useful to see a gpgme log.

Attaching gdb and trying to break into the running and CPU hogging application
several times only resulted in:
^C
Program received signal SIGINT, Interrupt.
0x42a77798 in poll () from /lib/libc.so.6
0x42a77798 <poll+104>:  cmn     r0, #4096       ; 0x1000
(gdb) bt
#0  0x42a77798 in poll () from /lib/libc.so.6
#1  0x44f5a0a8 in g_poll () from /usr/lib/libglib-2.0.so.0
#2  0x44f4cd80 in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x44f4cd80 in ?? () from /usr/lib/libglib-2.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) info threads
* 1 LWP 3384  0x42a77798 in poll () from /lib/libc.so.6
(gdb) c
Continuing.
Comment 3 Bernhard E. Reiter 2011-03-11 09:16:21 UTC
Created attachment 57855 [details]
gpgme 5 log of testrun which gets stuck

Got stuck on the first gpgme_op_decrypt_verify
Comment 4 Bernhard E. Reiter 2011-03-11 09:51:01 UTC
Created attachment 57857 [details]
a level 9 gpgsm log showing the hang on N900, search for "MARK"

For reference here is a level 9 gpgme debug log (as Werner asked for).
How to create it:
On the command line of the N900:

  export GPGME_DEBUG=9:/home/user/20110311-gpgme-4.txt
  kmail-mobile

on a second shell you can enter several markers into the log file 
so you can later add overall information about what you have done
  echo "MARK:"  >>20110311-gpgme-4.txt
Comment 5 Bernhard E. Reiter 2011-03-11 10:49:57 UTC
(10:46:15)  werner:  What I can see from the logs is that Kmail runs 9 VERIFY operations on the same data:
MARK: Now trying to decrypt and verify it:
GPGME 2011-03-11 09:42:15 <0x4a397480>  chan_30 -> DECRYPT
GPGME 2011-03-11 09:42:17 <0x4ab97480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:18 <0x4ab97480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:19 <0x4a397480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:21 <0x4a397480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:22 <0x4ab97480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:22 <0x4a397480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:23 <0x4ab97480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:23 <0x4a397480>  chan_31 -> VERIFY
GPGME 2011-03-11 09:42:24 <0x4ab97480>  chan_31 -> VERIFY
MARK: decryption seems to have worked, but verifying is hung, 
  
(10:46:49)  werner:  This is a bit fishy.  I can't see any hangs in gpgme.
Comment 6 Bernhard E. Reiter 2011-03-11 11:15:34 UTC
Running it with --nofork in gdb seems to look like a thread is started
several times, though no changes in the interface, when trying to decrypt and 
verify:

[New LWP 4628]
[LWP 4628 exited]
[New LWP 4632]
[LWP 4632 exited]
QTextBrowser: No document for file:///
[New LWP 4636]
[LWP 4636 exited]
QTextBrowser: No document for file:///
[New LWP 4639]
[LWP 4639 exited]
[New LWP 4642]
[LWP 4642 exited]
QTextBrowser: No document for file:///
[New LWP 4645]
[LWP 4645 exited]
[New LWP 4648]
[LWP 4648 exited]
[New LWP 4651]
QTextBrowser: No document for file:///
[LWP 4651 exited]
[New LWP 4654]
[LWP 4654 exited]
QTextBrowser: No document for file:///
[New LWP 4657]
[LWP 4657 exited]
[New LWP 4660]
[LWP 4660 exited]
QTextBrowser: No document for file:///
[New LWP 4663]
[LWP 4663 exited]
[New LWP 4666]
[LWP 4666 exited]
QTextBrowser: No document for file:///
[New LWP 4669]
[LWP 4669 exited]
[New LWP 4672]
[LWP 4672 exited]
QTextBrowser: No document for file:///
[New LWP 4675]
[LWP 4675 exited]
[New LWP 4678]
[LWP 4678 exited]
QTextBrowser: No document for file:///
[New LWP 4681]
[LWP 4681 exited]
[New LWP 4684]
[LWP 4684 exited]
QTextBrowser: No document for file:///
Comment 7 Bernhard E. Reiter 2011-03-16 11:36:23 UTC
Meanwhile some defects in libkleo got fixed that should avoid
the several threads for one verification problem.
It still hangs, reports Tobias, so we need a GPGME_DEBUG 9 log
and lsof -p <processid> outputs.
Comment 8 Tobias Koenig 2011-03-16 12:39:37 UTC
Created attachment 58087 [details]
gpgme debug 9 logfile

Log file of a version that includes Marc's patch set and the connect-before-start fix
Comment 9 Bernhard E. Reiter 2011-03-17 09:41:00 UTC
Tobias:
Werner needs more information about #8. He needs to know what was done
when to analyse the situation. Maybe you can do the log again and use the "MARK" method I describe in #4 to enter text before an operation you do,
until you quit the application. Please also add lsof -p outputs to each of those (or at least some). 
See my level 9 debug file as an example.
Comment 10 Tobias Koenig 2011-03-17 10:24:11 UTC
Created attachment 58117 [details]
new gpgme logfile with MARK entries
Comment 11 Tobias Koenig 2011-03-17 10:24:57 UTC
Created attachment 58118 [details]
output of lsof at the point of the hang
Comment 12 Bernhard E. Reiter 2011-03-17 10:25:46 UTC
Some more infos from Werner:
(09:47:08)  tokoe:  werner: what exactly are you looking for in the lsof output? 
(09:48:21)  werner:  Processes used by kleo and all gpg and gpgsm processes.  I need to look at the established pipes. 
(09:48:57)  werner:  What is the limit on open file descriptors on the n900 (per process)?  How many are open for kleo? 
(09:50:08)  werner:  If the log is correct, gpgme stops somwhere between [ fork, fork, exec, waitpid [  
(09:50:17)  werner:  Not ethat we use a double fork approach.

(09:58:23)  werner:  if so an strace -fp <pid_of_kmail> might be useful too.

strace on the N900 works, see http://wiki.maemo.org/Documentation/devtools/maemo5/strace and go to installation instruction to add the right repository
Comment 13 Tobias Koenig 2011-03-18 09:42:07 UTC
Created attachment 58133 [details]
new gpgme logfile with MARK entries
Comment 14 Tobias Koenig 2011-03-18 09:42:50 UTC
Created attachment 58134 [details]
output of lsof at the point of the hang
Comment 15 Tobias Koenig 2011-03-18 09:43:41 UTC
Created attachment 58135 [details]
strace -fp log of kmail-mobile during the hanging session
Comment 16 Tobias Koenig 2011-03-18 12:19:19 UTC
Created attachment 58141 [details]
new gpgme logfile with MARK entries
Comment 17 Tobias Koenig 2011-03-18 12:20:02 UTC
Created attachment 58142 [details]
output of lsof at the point of the hang
Comment 18 Tobias Koenig 2011-03-18 12:20:46 UTC
Created attachment 58143 [details]
strace -f -F -p log of kmail-mobile during the hanging session
Comment 19 Tobias Koenig 2011-03-18 16:19:45 UTC
Please test latest WinCE package (17.03.), the issue seems to be fixed there.
Comment 20 Bernhard E. Reiter 2011-03-21 08:59:45 UTC
Note that we MUST fix it for the N900 as well.
Comment 21 Bernhard E. Reiter 2011-03-21 09:23:07 UTC
Tobias: You've reported that symptoms go away on the n900 if you add debugging
print statements. If this is consistently so, you could remove/add statements
until you find a small set, that when added will remove the symptoms. Could be a bad solution, but better then nothing.
Comment 22 Bernhard E. Reiter 2011-03-21 11:19:20 UTC
Testing kdepim-ce-package_2011-03-17-04-14 (keeping the existing configuration)
I had a bit of other difficulties until I could send a signed emails:
(In my first run, it claimed it could not find my certificates, though
they were there in the only identity, on the subsequent second sending
attempt it hung. After a reboot no s/mime certificates in my identiy. And after stopping the frontend and running kleo to check everything, it then worked.)

I my test, I switched back and forth between a few emails that are signed,
encrypted and encrypted and signed. I openened about emails out of this group for about 25 times. 
Conclusing: This version on CE behaves better.
Comment 23 Tobias Koenig 2011-03-21 13:02:03 UTC
Created attachment 58211 [details]
strace -f -e trace=process (version 4.6) log
Comment 24 Bernhard E. Reiter 2011-03-28 16:53:04 UTC
We need a newer libgpgme from git.
Comment 25 Bjoern Ricks 2011-03-29 11:50:02 UTC
still reproducible with libgpgme11 1.3.1+20110328.1638.gitb001a8d-0maemo1.e637d92 and kdepim 4:4.6~.20110316.0703.gitb6cb367-1maemo1.1224491
Comment 26 Tobias Koenig 2011-03-30 09:07:37 UTC
Created attachment 58453 [details]
strace -f -e trace=process (version 4.6) log from new libgpgme packages

I have no clue why, but the 'echo "MARK: foobar" >> strace.log' didn't work anymore, so the log only contains a marker for the point where the app hangs.
I hope it gives some hints nevertheless
Comment 27 Werner Koch 2011-04-11 15:38:55 UTC
Marcus flashed his N900 and installed the latest version on it.  He
was able to replicate the problem just once but not on the next day.
It is possible that the noticed problem was a different one.
Anyway, he can't replicate it.  OTOH the bug still is replicatable on
other N900s.  We might need to ship such an N900 to here; pure ssh
access without X makes debugging not very easy.
Comment 28 Werner Koch 2011-04-11 15:43:22 UTC
Looking at log files show that all gpgme crypto operations are
finished at the time of the hang.  The crypto thread then hangs in
poll with 12 fds and and an infinite timeout.  Confirmed via strace
and gdb.

We are now looking at the upper layers (gpgme++, libkleo) and the
event loop code (qt and glib).
Comment 29 Marcus Brinkmann 2011-04-14 20:27:31 UTC
We found out that when interrupting a GPGME operation (probably due to switching to a different mail), a gpgme_cancel_async is issued as expected.  However, unexpectedly, the thread targeted by the cancel is also presumably terminated (at the very least it does not continue running in the normal way).  Normally, it should see the cancel flag and that clean up and terminate eventually.  here it looks like it is forcefully terminated with a pthread_cancel.

However, we could not yet find out why this would be the case.  Marc suggested putting an explicit join in the destructor.  Maybe putting a breakpoint on the QThread destructor would point out who is responsible.

This problem should also potentially affect any other platform, but it is a race condition and everybody races a bit differently.
Comment 30 Werner Koch 2011-04-15 14:45:26 UTC
I did another test which confirms that the thread really terminates
while inside _gpgme_io_spawn.  The address of the TCB which is what
pthread_self () seems to return is immediately reused.  However using
SYS_gettid instead of pthread_self clearly shows that a a new thread
is used.
Comment 31 Tobias Koenig 2011-06-06 14:12:20 UTC
After adding some additional debug statements to libgpgme I saw that the first fork() call in _gpgme_io_spawn blocks and nothing else in this thread is executed.
Attaching gdb to kmail-mobile process afterwards shows that:

-------------------------------------------------------------
Thread 4 (Thread 0x48768480 (LWP 3561)):
#0  0x4340b928 in pthread_cond_wait@@GLIBC_2.4 () from /lib/libpthread.so.0
#1  0x42a93b78 in pthread_cond_wait@@GLIBC_2.4 () from /lib/libc.so.6
#2  0x42ca5c44 in ?? () from /usr/lib/libQtScript.so.4
#3  0x42ca5c44 in ?? () from /usr/lib/libQtScript.so.4
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 0x4aafc480 (LWP 3582)):
#0  0x42a810a0 in select () from /lib/libc.so.6
#1  0x42a94080 in __libc_enable_asynccancel () from /lib/libc.so.6
#2  0x425d7ad0 in ?? () from /usr/lib/libQtCore.so.4
#3  0x425d7ad0 in ?? () from /usr/lib/libQtCore.so.4
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0x496ab480 (LWP 3586)):
#0  0x42a32fe0 in ptmalloc_lock_all () from /lib/libc.so.6
#1  0x42a56a50 in fork () from /lib/libc.so.6
#2  0x435dfdac in _gpgme_io_spawn (path=0xdce4e0 "/usr/bin/gpgsm", argv=0x496aaa68, flags=2, fd_list=0x49701758, atfork=0x435ff88c, atforkvalue=0x496aa974, r_pid=0x496aa994)
    at posix-io.c:392
#3  0x435d762c in my_spawn (ctx=0x497038e0, r_pid=0x496aa994, name=0xdce4e0 "/usr/bin/gpgsm", argv=0x496aaa68, fd_in=32, fd_out=31, fd_child_list=0x496aa9f4, 
    atfork=0x435ff88c, atforkvalue=0x496aa974, flags=0) at assuan-support.c:171
#4  0x435fa2a8 in ?? () from /usr/lib/libassuan.so.0
#5  0x435fa2a8 in ?? () from /usr/lib/libassuan.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x44fe3400 (LWP 3560)):
#0  0x42a7e798 in poll () from /lib/libc.so.6
#1  0x4503f0a8 in g_poll () from /usr/lib/libglib-2.0.so.0
#2  0x45031d80 in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x45031d80 in ?? () from /usr/lib/libglib-2.0.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
-------------------------------------------------------------

The backtrace from thread 2 looks identical to the one in https://bugs.maemo.org/show_bug.cgi?id=11533 and the described behavior is the same we are facing here. Therefor I'd say this is a bug in Maemo and not in Kontact/gpgme and I don't see a way for a 100% workaround here.
Comment 32 Tobias Koenig 2011-06-06 14:16:13 UTC
Additional note: Disabling all debug output (by clicking 'disable all debug output' in kdebugdialog) seems to prevent the bug to occur. This workaround needs some more testing though.
Comment 33 Marcus Brinkmann 2011-06-06 21:40:59 UTC
Good thinking.  The reaction by Nokia in the bug report is aweful, though.  Are there any applications using malloc and fork in concurrent threads?  W-T-F
Comment 34 Bernhard E. Reiter 2011-06-15 11:27:55 UTC
To conclude the issue, we believe this is an N900 platform issue with the current PR1.3 release. There is no easy workaround
and more involved workarounds would endanger the functions for other platforms.
 
https://bugs.maemo.org/show_bug.cgi?id=11533