Bug 248058

Summary: Message preview pane character encoding issue (utf-8, unicode)
Product: [Applications] kmail2 Reporter: Wouter Van Hemel <wouter-kde>
Component: cryptoAssignee: Sandro Knauß <sknauss>
Status: RESOLVED FIXED    
Severity: normal CC: aheinecke, sknauss, t.glaser
Priority: NOR    
Version: 4.14.2   
Target Milestone: ---   
Platform: Debian unstable   
OS: Linux   
Latest Commit: Version Fixed In: 5.4.0
Sentry Crash Report:
Attachments: Testcase
An encrypted ISO-8859-15 text

Description Wouter Van Hemel 2010-08-16 14:20:10 UTC
Version:           1.13.5 (using KDE 4.4.5) 
OS:                Linux

Hello,

The message preview pane sometimes shows utf-8 rendered as latin1 (iso-8859-1 or iso-8859-15). One message I just read was multipart/mime with plain/text utf-8 encoding in the main message part, but the preview pane renders it as latin1 so it shows two funny characters per unicode character.

When I press 'v' for viewing the source and headers, this window shows the message correctly, and in the same time the preview pane in the background re-renders to utf-8, showing the message correctly too.

My mailbox and the relevant message is on an IMAP server. The problem and hence solution might be related to partial fetching/caching or other IMAP specifics.

Reproducible: Always

Steps to Reproduce:
1. View message (on IMAP server) in message preview pane.
2. Notice that the message has pairs of funny characters because it renders unicode characters as latin1.
3. Press 'v' for viewing the headers and original source.
4. When the source view window opens, the preview pane also switches to unicode and both windows show the message correctly.

Actual Results:  
Message preview pane renders utf-8 as latin1, producing garbled output for unicode characters.

Opening a message source window also corrects the message preview window in the background.

Expected Results:  
Since the message is clearly marked multipart/mime and the main message part has text/plain and utf-8 charset, the preview window should show the message correctly as the source window and message window do after opening the message.

It's quite strange the message preview pane renders unicode as latin1, but on opening a message source view window, it suddenly realises that and re-renders it to utf-8.

OS: Linux (i686) release 2.6.32-5-686
Compiler: cc
Comment 1 Thorsten Glaser 2014-08-15 08:11:34 UTC
Confirmed, although I primarily notice the issue with encrypted eMails.

「V̲iew → S̲et Encoding → A̲uto」 in the menu shows the broken behaviour.

「V̲iew → S̲et Encoding → Unicode ( UTF̲-8 )」 manually forces the correct one.

See also: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=754265
which also lists the three versions in which I already noticed this
bug (4.11 4.12 and 4.13 are affected, at least).
Comment 2 Thorsten Glaser 2014-08-18 09:08:04 UTC
Created attachment 88293 [details]
Testcase

By request from one of the Mozilla developers, I just made a testcase for this bug. I attached it here, as I can confirm it also triggers the bug in Kontact.

The passphrase for the PGP key I attached is 123123.
Comment 3 Laurent Montel 2015-04-12 10:25:33 UTC
Thank you for taking the time to file a bug report.

KMail2 was released in 2011, and the entire code base went through significant changes. We are currently in the process of porting to Qt5 and KF5. It is unlikely that these bugs are still valid in KMail2.

We welcome you to try out KMail 2 with the KDE 4.14 release and give your feedback.
Comment 4 Thorsten Glaser 2015-04-12 13:53:52 UTC
(In reply to Laurent Montel from comment #3)

> We welcome you to try out KMail 2 with the KDE 4.14 release and give your
> feedback.

RECONFIRMED: The bug still happens with kmail 4:4.14.2-2 (Debian unstable).

Dude, what gives? It took me a minute to re-check that. Please reopen.
Comment 5 Sandro Knauß 2015-04-18 15:46:09 UTC
The problem is that PGP inline is not a standard at all and it is not definded what charset is to use when displaying the mail. For kmail we said that the charset field in the mail indicate, what charset the decrypted message have. Because the encrypted part is everytime ascii. This is mostly that what the most email clients does.

RFC 2440 only says:
"Charset", a description of the character set that the plaintext
       is in. Please note that OpenPGP defines text to be in UTF-8 by
       default. An implementation will get best results by translating
       into and out of UTF-8. However, there are many instances where
       this is easier said than done. Also, there are communities of
       users who have no need for UTF-8 because they are all happy with
       a character set like ISO Latin-5 or a Japanese character set. In
       such instances, an implementation MAY override the UTF-8 default
       by using this header key. An implementation MAY implement this
       key and any translations it cares to; an implementation MAY
       ignore it and assume all text is UTF-8.

-> The best would be if apine would have used the Armor Header Key "Charset".

To not break the existing way, i would only switch to default utf8 decoding if the surrounding charset is ascii :) Because every ascii text is the same in utf8...
Comment 6 Thorsten Glaser 2015-04-20 08:01:17 UTC
(In reply to Sandro Knauß from comment #5)

PGP Inline is perfectly fine standardised: the display agent has to use the charset indicated by the PGP message, and discard any charset/encoding information of the surrounding message.

It works like this:

Encode: secret message (charset/encoding A) → PGP → ASCII-armoured thing ⇒ MUA → MUA’s own encoding (charset B, possibly encoding Quoted-Printable) → RFC822

Decode: RFC822 → MUA decode (e.g. QP) → ASCII-armoured thing ⇒ PGP → secret message

There are double-stroked arrows here, which means, someone can do this manually. For example, the standard/initial way of doing Inline PGP was to edit the message in $EDITOR, then throw it through PGP, then to paste the resulting .asc into the MUA editor. Reversely, to save the message (*after* MUA decoding! this is what some get wrong!) as .asc then to call PGP on it on the command line.

> -> The best would be if apine would have used the Armor Header Key "Charset".

Alpine has no concept of PGP. I write my messages by telling alpine to invoke an external editor (always, as I loathe pico) in which I type the message, then pipe it through e.g. “gpg --clearsign” or “gpg -seatr foo@bar.com”.

PGP actually *does* have the “Charset” header in the ASCII armour. GnuPG just doesn’t write the header if it’s the default value, namely, UTF-8. (If I write the secret message in, say, latin1, and then tell GnuPG that it’s latin1, then the “Charset” header is there. But I use UTF-8 everywhere.)

> To not break the existing way, i would only switch to default utf8 decoding if the surrounding charset is ascii :) Because every ascii text is the same in utf8...

That is actually the correct fix for my scenario.

The more broad correct fix is to do this, for Inline PGP:

① decode the RFC822 message using the MIME content-type, content-transfer-encoding
② if it has a “Charset” header in the PGP ASCII armour, note that down for later
③ decode the MIME-decoded message through GnuPG
④ use the charset noted down earlier, or UTF-8 if none, for displaying the armoured part (i.e. the part within the green and/or blue boxes; for anything outside of them, if any, keep using the MIME charset information; this is important for mixed content!)

Thanks!
Comment 7 Andre Heinecke 2016-06-24 09:30:11 UTC
> PGP Inline is perfectly fine standardised: the display agent has to use the charset indicated by the PGP
> message, and discard any charset/encoding information of the surrounding message.

No it's not. Especially the Encoding handling is very problematic and not standardised. See: https://debian-administration.org/users/dkg/weblog/108  ( https://dkg.fifthhorseman.net/notes/inline-pgp-harmful/ )

Basically your Mail says that it's ASCII Encoded but then actually has UTF-8 encoding in the content after decryption. I would argue that this is not a KMail bug but that your Mail is broken. For proper encoding Handling you need to use PGP/MIME. One of the Advantages of PGP/MIME is proper encoding handling.  KMail uses the Content-Type charset of the PGP Message which would be correct.
GnuPG / GPGME itself does not do any reencoding it just decrypts the "bytes" of the message.

The Armor Header from RFC2440 is afaik not used in practice. As changing the encoding can change the meaning and the armor headers themself are not signed / encrypted this offers not much advantage over the Content-Type.
Except that you would have an even more fragile implementation because you would have to handle mixed encodings in a message for multiple PGP/Message parts. And you would have to treat PGP Clearsigned messages differently,..

As a "workaround" / to improve compatibility with broken MUA's I like Sandro's idea to treat PGP Messages as UTF-8 if the specified Charset is 7Bit ASCII. I think that would be a good solution to fix your bug.

Although I would suggest to use a proper MUA with PGP/MIME support.
Comment 8 Thorsten Glaser 2016-06-24 10:24:27 UTC
(In reply to Andre Heinecke from comment #7)
> > PGP Inline is perfectly fine standardised: the display agent has to use the charset indicated by the PGP
> > message, and discard any charset/encoding information of the surrounding message.
> 
> No it's not. Especially the Encoding handling is very problematic and not
> standardised. See: https://debian-administration.org/users/dkg/weblog/108  (

It is, and especially the encoding is trivial. It’s just often misunderstood or implemented wrong.
Citing someone who doesn’t fully understand it doesn’t help (I knew that posting).

Inline PGP is easy: the MIME-level encoding is valid for the “outer” part of the message; for
example, if MIME says quoted-printable then those ‘=’ in the ASCII armour of the PGP message
are encoded as “=3D”.

The “inner” part of the message, i.e. the output of pgp/gpg decrypting it, is *completely*
independent of the MIME message surrounding it, and for displaying it, *only* the rules
that the command-line utilities use are valid; this means, that the OpenPGP-level encoding
is used (which is always 8bit not quoted-printable or base64, and in absence of an explicit
charset selection is UTF-8).

The reason for this is easy: Inline PGP works, basically (i.e. without explicit MUA support),
by someone writing a plaintext file, throwing that through pgp or gpg, and copy/pasting
that into their MUA’s composer. Anything an MUA does to integrate Inline PGP support
*must* behave *exactly the same*.

> Basically your Mail says that it's ASCII Encoded but then actually has UTF-8
> encoding in the content after decryption. I would argue that this is not a

See above, “after decryption” when Inline PGP is used means you *have* to
*forget* anything from the previous container.

Yes, this is different than what PGP/MIME requires. Yes, both are right, for
their respective scopes.

> KMail bug but that your Mail is broken. For proper encoding Handling you
> need to use PGP/MIME. One of the Advantages of PGP/MIME is proper encoding

This sounds half like a sales pitch, half like “KMail doesn’t handle encoding in
Inline PGP correctly” – which is *exactly my point*.

> GnuPG / GPGME itself does not do any reencoding it just decrypts the "bytes"
> of the message.

It does *record* the charset of the message.

> As a "workaround" / to improve compatibility with broken MUA's I like
> Sandro's idea to treat PGP Messages as UTF-8 if the specified Charset is
> 7Bit ASCII. I think that would be a good solution to fix your bug.

That would help in the specific case, but still leave KMail a broken MUA
claiming to support Inline PGP and not doing it correctly.

However, as a first step, it’s okay; please do so. Actually, why haven’t
you done so yet…

> Although I would suggest to use a proper MUA with PGP/MIME support.

No, PGP/MIME often breaks, interestingly enough, with encoding-related
issues, and with mailing lists. Its interoperability is also limited to MUAs
supporting it, whereas interoperability of Inline PGP is maximal.
Comment 9 Sandro Knauß 2016-06-24 11:50:05 UTC
(In reply to Thorsten Glaser from comment #8)
> (In reply to Andre Heinecke from comment #7)
> > > PGP Inline is perfectly fine standardised: the display agent has to use the charset indicated by the PGP
> > > message, and discard any charset/encoding information of the surrounding message.
> > 
> > No it's not. Especially the Encoding handling is very problematic and not
> > standardised. See: https://debian-administration.org/users/dkg/weblog/108  (
> 
> It is, and especially the encoding is trivial. It’s just often misunderstood
> or implemented wrong.
> Citing someone who doesn’t fully understand it doesn’t help (I knew that
> posting).
dkg and andre know what about they are talking - search for references in the internet and what they do inside the openpg project.
You will find a lot references to them.

> 
> Inline PGP is easy: the MIME-level encoding is valid for the “outer” part of
> the message; for
> example, if MIME says quoted-printable then those ‘=’ in the ASCII armour of
> the PGP message
> are encoded as “=3D”.
> 
In your comment you mix often differnent encodings. in the mail context we have two:
- content-transfer-encoding - this is the encoding how the text (that is not ascii 7bit encoded) is modified to be 7bit. This is quoted-printablem base64 or plain.
It is out of question, that we have first do decode this before entering the content. This is the "=3D" -> "="

the encoding of the text is more problematic :) We have one field, where we can set the encoing of the mimepart that is the content-type header for a mime part with the charset setting:

   Content-Type: text/plain; charset="UTF-8"

the problem is now, that you are arguing, that gnupg have a defined in/output charset, so that we should ignore the charset setting of the mimepart after we piped the content through gnupg. But this is not true.
gnupg only parsing bytestream and do charset handling at all. The only thing, is that gnupg suggest that you SHOULD use utf-8, but do not force this.

It only works for you, because alpine is a cmdline mua, that puts it output to your console, and your console using utf-8 encoding, but if you would switch to something else, you couldn't read the text successfully.

> The “inner” part of the message, i.e. the output of pgp/gpg decrypting it,  is *completely* independent of the MIME message surrounding it, and for displaying it,  *only* the rules that the command-line utilities use are valid; this means, that the OpenPGP-level encoding is used (which is always 8bit not quoted-printable or base64, and in absence of an explicit charset selection is UTF-8).

Well, the problem is that there is no "OpenPGP-level encoding". There is no API to ask gnupg about the encoding ( if there would be a api Andre would know this, because he is one of the authors fof the gnupg apis :) .

> The reason for this is easy: Inline PGP works, basically (i.e. without explicit MUA support), by someone writing a plaintext file, throwing that through pgp or gpg, and copy/pasting  that into their MUA’s composer. Anything an MUA does to integrate Inline PGP support *must* behave *exactly the same*.

Make the experiment - change the charset of you konsole/ and use a text document with a different encoding and encrypt it and look at the output in your normal console ( utf-8). You will see that this is broken. This all works for you because you have a consistent utf8 environment. But for mails we can't say, what is the encoding of the sender, we can only guess here.

> > GnuPG / GPGME itself does not do any reencoding it just decrypts the "bytes"
> > of the message.
> 
> It does *record* the charset of the message.

But maybe all are wrong and you are right - give me the link to the documentation or a script/snippset, how It detect the correct charset of the decrypted mail i'll fix this instantly in kmail.

Okay here is my console test:

% LANG=C luit  -encoding ISO-8859-15 gpg --encrypt -a -o test.enc
You did not specify a user ID. (you may use "-r")

Current recipients:

Enter the user ID.  End with an empty line: 0x36FD5E35D1D8EFD2
gpg: 0x36FD5E35D1D8EFD2: There is no assurance this key belongs to the named user

pub  1024R/0x36FD5E35D1D8EFD2 2014-08-18 Test for Mozilla bug#1054187
 Primary key fingerprint: 8D15 3316 76F4 6081 1A99  DB56 36FD 5E35 D1D8 EFD2

It is NOT certain that the key belongs to the person named
in the user ID.  If you *really* know what you are doing,
you may answer the next question with yes.

Use this key anyway? (y/N) y

Current recipients:
1024R/0x36FD5E35D1D8EFD2 2014-08-18 "Test for Mozilla bug#1054187"

Enter the user ID.  End with an empty line: 
test äöü test
% LANG=C luit -encoding ISO-8859-15 gpg -d test.enc

You need a passphrase to unlock the secret key for
user: "Test for Mozilla bug#1054187"
1024-bit RSA key, ID 0x36FD5E35D1D8EFD2, created 2014-08-18

gpg: encrypted with 1024-bit RSA key, ID 0x36FD5E35D1D8EFD2, created 2014-08-18
      "Test for Mozilla bug#1054187"
test äöü test

^^ yeah that matches :D

% LANG=C gpg -d test.enc

You need a passphrase to unlock the secret key for
user: "Test for Mozilla bug#1054187"
1024-bit RSA key, ID 0x36FD5E35D1D8EFD2, created 2014-08-18

gpg: encrypted with 1024-bit RSA key, ID 0x36FD5E35D1D8EFD2, created 2014-08-18
      "Test for Mozilla bug#1054187"
test  test

^^ argh this is not what I enterted - and you see here, that gnupg on the commandline has no handling for encoding - it just using the default encoding of the console. The information, that the output has to be interpresed as ISO-8859-15 is lost.
Comment 10 Sandro Knauß 2016-06-24 11:51:14 UTC
Created attachment 99676 [details]
An encrypted ISO-8859-15 text
Comment 11 Sandro Knauß 2016-06-24 11:53:53 UTC
Just for make it clear - my console is also by default utf-8 luit is a programm that translate from/to the encding that is specified. So within the command everything is like it is ISO-8859-15 input and output.
Comment 12 Thorsten Glaser 2016-06-27 14:03:46 UTC
(In reply to Sandro Knauß from comment #9)

> Make the experiment - change the charset of you konsole/ and use a text
> document with a different encoding and encrypt it and look at the output in
> your normal console ( utf-8). You will see that this is broken. This all
> works for you because you have a consistent utf8 environment. But for mails

Possibly, but ISTR that OpenPGP still stores the encoding of the message,
so I’d have a way to know what charset to pass to iconv(1) to be able to
read it, and I’m not talking about the ASCII armour pseudo-header either.

I’ll search for it when I have more time.

> > > GnuPG / GPGME itself does not do any reencoding it just decrypts the "bytes"
> > > of the message.
> > 
> > It does *record* the charset of the message.
> 
> But maybe all are wrong and you are right - give me the link to the
> documentation or a script/snippset, how It detect the correct charset of the
> decrypted mail i'll fix this instantly in kmail.

OK.
Comment 13 Andre Heinecke 2016-07-07 09:30:31 UTC
Btw. I've asked about armor headers as part of another issue regarding gpgme_data_identify and the maintainer of gnupg also says that they should not be used and are not used by gnupg: https://bugs.gnupg.org/gnupg/issue2314
Comment 14 Sandro Knauß 2016-07-18 07:50:32 UTC
Git commit 04334e2f8390b967fc5b1c4ecde8caacf4787238 by Sandro Knauß.
Committed on 18/07/2016 at 07:49.
Pushed by knauss into branch 'Applications/16.08'.

Fix: Message with wrong charset

MUAs sometimes fail to set the correct character encoding.
If the set us-ascii, we can help a little bit by setting it to utf-8.
Because utf-8 is a superset of us-ascii we do not break anything.
FIXED-IN: 5.4.0

A  +34   -0    mimetreeparser/autotests/data/openpgp-inline-wrong-charset-encrypted.mbox
A  +47   -0    mimetreeparser/autotests/data/openpgp-inline-wrong-charset-encrypted.mbox.html
A  +4    -0    mimetreeparser/autotests/data/openpgp-inline-wrong-charset-encrypted.mbox.tree
M  +8    -1    mimetreeparser/src/viewer/nodehelper.cpp

http://commits.kde.org/messagelib/04334e2f8390b967fc5b1c4ecde8caacf4787238