Bug 360205 - Inconsistent TAB parsing in Subject header
Summary: Inconsistent TAB parsing in Subject header
Status: RESOLVED INTENTIONAL
Alias: None
Product: trojita
Classification: Unmaintained
Component: Desktop GUI (show other bugs)
Version: git
Platform: Other Linux
: NOR minor
Target Milestone: ---
Assignee: Trojita default assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-07 10:54 UTC by Erik Quaeghebeur
Modified: 2019-09-07 19:31 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Quaeghebeur 2016-03-07 10:54:12 UTC
TAB characters may appear in Subject headers. Currently, TAB characters are correctly shown in the message list view, but are replaced by a SPACE character in the envelope view and the titles of windows (composer, message view window).

Reproducible: Always

Steps to Reproduce:
1. Create message with TAB in Subject header
2. Send it, so that it is in an IMAP mailbox
3. Fetch the message and view it with Trojita

Actual Results:  
TAB is replaced by SPACE in envelope view and in window titles

Expected Results:  
TAB is not replaced by SPACE in envelope view and in window titles, but is kept as-is, as in the message list view.

I looked at the source, but could not find the function actually doing the incorrect replacement.
Comment 1 Thomas Lübking 2016-03-07 11:07:30 UTC
"toHtmlEscaped()"
The concept of a tabulator doesn't exist in html and I *assume* the function won't translate it to a stack of  's

Kinda "wontfix" (notably for window titles, WMs will usually simplify all whitespace and other funky junk, just assume there was a couple of newlines in it... ;-)
Comment 2 Erik Quaeghebeur 2016-03-07 11:29:10 UTC
(In reply to Thomas Lübking from comment #1)
> "toHtmlEscaped()"
> The concept of a tabulator doesn't exist in html and I *assume* the function
> won't translate it to a stack of  's

I understand. Another option might be to translate TAB to  , an em space, to better approximate the visual intent of a TAB.

> Kinda "wontfix" (notably for window titles, WMs will usually simplify all
> whitespace and other funky junk, just assume there was a couple of newlines
> in it... ;-)

Ok, I see why window titling code may do its own pre-processing.

Feel free to close as WONFIX.
Comment 3 Jan Kundrát 2016-03-07 12:14:30 UTC
I wonder what is a correct rendering of \t in a GUI with non-monospace 
fonts. How do you expect to see them, and what it the use case behind 
showing them in a different way than "just space"?
Comment 4 Erik Quaeghebeur 2016-03-07 12:31:08 UTC
(In reply to Jan Kundrát from comment #3)
> I wonder what is a correct rendering of \t in a GUI with non-monospace 
> fonts. How do you expect to see them, and what it the use case behind 
> showing them in a different way than "just space"?

Assume the TAB was intentionally placed, so that a space would have been used if that was meant instead. Then it was meant to convey meaning. That meaning could be ‘a longer stretch of white space’, e.g, for obtaining some visual effect. It cannot be ‘alignment’, because line-breaks in subjects are not allowed. The use case of not using a space here is to convey that meaning. (I currently do not see another meaning than ‘a longer stretch of white space’ in the context of a Subject header.)

If it is ‘a longer stretch of white space’, then using an em-space would be more appropriate than a normal space.
Comment 5 Pali Rohár 2016-03-07 12:49:38 UTC
On Monday 07 March 2016 12:31:08 Erik Quaeghebeur via KDE Bugzilla wrote:
> line-breaks in subjects are not allowed.

Hm... is there restriction in RFC for this? Just asking.
Comment 6 Jan Kundrát 2016-03-07 13:10:20 UTC
> If it is ‘a longer stretch of white space’, then using an em-space would be
> more appropriate than a normal space.

Sounds like plan -- please submit a patch for this.

> Hm... is there restriction in RFC for this? Just asking.

None that I'm aware of.
Comment 7 Erik Quaeghebeur 2016-03-07 13:17:45 UTC
(In reply to Pali Rohár from comment #5)
> On Monday 07 March 2016 12:31:08 Erik Quaeghebeur via KDE Bugzilla wrote:
> > line-breaks in subjects are not allowed.
> 
> Hm... is there restriction in RFC for this? Just asking.

In RFC 5322, it is said that Subject is an unstructured field, which is described in Section 2.2.1 as

« Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions. These are referred to as unstructured field bodies.
Semantically, unstructured field bodies are simply to be treated as a
single line of characters with no further processing (except for
"folding" and "unfolding" as described in section 2.2.3). »

More generally, as far as I've understood (and modulo folding), a header value MUST NOT contain any line breaks.
Comment 8 Jan Kundrát 2016-03-07 13:30:21 UTC
> More generally, as far as I've understood (and modulo folding), 
> a header value MUST NOT contain any line breaks.

Line breaks are special when they appear as-is because RFC5322 assigns a 
special meaning to them.

There is RFC2047 which specifies how to send Unicode. It gets triggered by 
presence of LFs in my testing, and I think that this usage is actually 
kosher -- but i'll be happy to be proven wrong by a quite from some other 
RFC.
Comment 9 Pali Rohár 2016-03-07 13:33:03 UTC
On Monday 07 March 2016 13:17:45 Erik Quaeghebeur via KDE Bugzilla wrote:
> https://bugs.kde.org/show_bug.cgi?id=360205
> 
> --- Comment #7 from Erik Quaeghebeur <kdebugs@equaeghe.nospammail.net> ---
> (In reply to Pali Rohár from comment #5)
> > On Monday 07 March 2016 12:31:08 Erik Quaeghebeur via KDE Bugzilla wrote:
> > > line-breaks in subjects are not allowed.
> > 
> > Hm... is there restriction in RFC for this? Just asking.
> 
> In RFC 5322, it is said that Subject is an unstructured field, which is
> described in Section 2.2.1 as
> 
> « Some field bodies in this specification are defined simply as
> "unstructured" (which is specified in section 3.2.5 as any printable
> US-ASCII characters plus white space characters) with no further
> restrictions. These are referred to as unstructured field bodies.
> Semantically, unstructured field bodies are simply to be treated as a
> single line of characters with no further processing (except for
> "folding" and "unfolding" as described in section 2.2.3). »
> 
> More generally, as far as I've understood (and modulo folding), a header value
> MUST NOT contain any line breaks.
> 

This is truth, but RFC 2047 allows you to include any unicode character
into header body (via encoding unicode character to UTF-8 and encoding
it to either QP or Base64). It means that also newline can be encoded
into Base64. From RFC 5322 point of view, Base64 is single line (after
unfolding) but you have funny unicode and also ascii whitespaces in
header body...

Or is there some restriction that above is violation of MIME or
something else?
Comment 10 Erik Quaeghebeur 2016-03-07 15:05:24 UTC
(In reply to Jan Kundrát from comment #8)
> > More generally, as far as I've understood (and modulo folding), 
> > a header value MUST NOT contain any line breaks.
> 
> There is RFC2047 which specifies how to send Unicode. It gets triggered by 
> presence of LFs in my testing, and I think that this usage is actually 
> kosher -- but i'll be happy to be proven wrong by a quite from some other 
> RFC.

The token definition in RFC822 that allowed bare CR and bare LF was updated in RFC2822 to preclude them:

RFC2047 Section 5. (1) on p.7:
« An ’encoded-word’ may replace a ’text’ token (as defined by RFC 822) in any Subject or Comments header field [...] »

RFC822 Section 3.3 on p.10:
« text = <any CHAR, including bare CR & bare LF, but NOT including CRLF> »

RFC2822 Section 3.2.1 on p.10:
« text = %d1-9 / %d11 / %d12 / %d14-127 / obs-text     ; Characters excluding CR and LF »

(RFC5322 doesn't reintroduce them.)

So digesting them is needed for dealing with old messages, but producing them is, in my understanding, against the latest RFC.
Comment 11 Pali Rohár 2016-03-07 15:16:39 UTC
On Monday 07 March 2016 15:05:24 Erik Quaeghebeur via KDE Bugzilla wrote:
> https://bugs.kde.org/show_bug.cgi?id=360205
> 
> --- Comment #10 from Erik Quaeghebeur <kdebugs@equaeghe.nospammail.net> ---
> (In reply to Jan Kundrát from comment #8)
> > > More generally, as far as I've understood (and modulo folding), 
> > > a header value MUST NOT contain any line breaks.
> > 
> > There is RFC2047 which specifies how to send Unicode. It gets triggered by 
> > presence of LFs in my testing, and I think that this usage is actually 
> > kosher -- but i'll be happy to be proven wrong by a quite from some other 
> > RFC.
> 
> The token definition in RFC822 that allowed bare CR and bare LF was updated in
> RFC2822 to preclude them:
> 
> RFC2047 Section 5. (1) on p.7:
> « An ’encoded-word’ may replace a ’text’ token (as defined by RFC 822) in any
> Subject or Comments header field [...] »

So it means that UTF-8 encoded unicode character "newline" in MIME
Base64 is valid content for Subject header. Right?

> RFC822 Section 3.3 on p.10:
> « text = <any CHAR, including bare CR & bare LF, but NOT including CRLF> »

Ah, so old RFC 822 allows directly "\n" in Subject. I did not know that.

> RFC2822 Section 3.2.1 on p.10:
> « text = %d1-9 / %d11 / %d12 / %d14-127 / obs-text     ; Characters excluding
> CR and LF »
> 
> (RFC5322 doesn't reintroduce them.)
> 
> So digesting them is needed for dealing with old messages, but producing them
> is, in my understanding, against the latest RFC.

Right, my original question about newlines was because I wanted to know
if compliant email client needs to deal with such thing. So answer is
yes, "\n" in Subject is valid character.
Comment 12 Erik Quaeghebeur 2016-03-07 23:17:16 UTC
(In reply to Jan Kundrát from comment #6)
> > If it is ‘a longer stretch of white space’, then using an em-space would be
> > more appropriate than a normal space.
> 
> Sounds like plan -- please submit a patch for this.

It seems unicode em-spaces get converted to plain spaces by QString::toHtmlEscaped, or my font doesn't have the em-spaces. Is there an easy way to have a look at the html of a message view in Trojita?

Another observation: it seems that tabs in message bodies get converted to 4 spaces within the message view. Where should I look for that code? I guess it is best to treat tabs in the subject the same as in the body.
Comment 13 Thomas Lübking 2016-03-07 23:57:37 UTC
The message works here (it's pre-wrapped) and lead me to a solution that should work for you:

diff --git a/src/Gui/EnvelopeView.cpp b/src/Gui/EnvelopeView.cpp
index 1850456..14c049a 100644
--- a/src/Gui/EnvelopeView.cpp
+++ b/src/Gui/EnvelopeView.cpp
@@ -87,7 +87,7 @@ void EnvelopeView::setMessage(const QModelIndex &index)
     if (e.date.isValid()) {
         subDate = QStringLiteral("<table style=\"margin:0px; margin-left:4em; float:right;\"><tr style=\"margin:0px;\"><td style=\"margin:0px;\">%1</td></tr></table>").arg(e.date.toLocalTime().toString(Qt::SystemLocaleLongDate));
     }
-    subDate += QStringLiteral("<span style=\"font:bold large;\">%1</span>").arg(e.subject.toHtmlEscaped());
+    subDate += QStringLiteral("<span style=\"font:bold large; white-space:pre;\">%1</span>").arg(e.subject.toHtmlEscaped());
     auto lbl = new QLabel(subDate, this);
     SET_LABEL_OPTIONS(lbl)
     layout()->addWidget(lbl);
Comment 14 Jan Kundrát 2016-03-08 15:23:46 UTC
> -    subDate += QStringLiteral("<span style=\"font:bold
> large;\">%1</span>").arg(e.subject.toHtmlEscaped());
> +    subDate += QStringLiteral("<span style=\"font:bold large;
> white-space:pre;\">%1</span>").arg(e.subject.toHtmlEscaped());

While you're changing this, try to consider a situation where the subject 
consists of just one overly long sequence of characters with no spaces in 
the middle -- just a reminder for future.

On topic of preserving whitespace, let's be realistic, please. We already 
wrap the subject in funny ways around the date, and I *like* this. If we 
want to go full RFC-lawyering here, nothing defines how to render these 
subjects, and wrapping them into several lines is an obvious usability 
feature. I do not think that distinguishing between tabs, spaces, emspaces 
and newlines makes a lot of sense in subjects. If anything, my today's vote 
:) is to ensure that newlines do not lead to visible line breaks (think 
possible information hiding) and that spaces are normalized to a reasonable 
amount of whitespace.

> It seems unicode em-spaces get converted to plain spaces by
> QString::toHtmlEscaped

I've checked the source of that method, and it only replaces <, >, & and ".

> Is there an easy way to have a look at the html of a message view in Trojita?

Add a call to qDebug() to an appropriate place. The markup is asked from 
from within src/Gui/SimplePartWidget.cpp.
Comment 15 Thomas Lübking 2016-03-08 16:20:35 UTC
(In reply to Jan Kundrát from comment #14)

> While you're changing this, try to consider a situation where the subject 
> consists of just one overly long sequence of characters with no spaces in 
> the middle -- just a reminder for future.

=> "white-space:pre-line;"

Personally, I'm not overly interested in showing special whitespace *anywhere* (including the message list) since it only means that








































































































valuable information ends up being out of sight ;-P
Comment 16 Erik Quaeghebeur 2016-03-09 22:08:22 UTC
(In reply to Jan Kundrát from comment #14)
>
> [...] If anything, my today's vote 
> :) is to ensure that newlines do not lead to visible line breaks (think 
> possible information hiding) and that spaces are normalized to a reasonable 
> amount of whitespace.

(In reply to Thomas Lübking from comment #15)
> 
> Personally, I'm not overly interested in showing special whitespace
> *anywhere* (including the message list) [...]

1. I agree that information-hiding should be avoided.
2. I agree that stretches of white-space should not be overly long.

I feel that 1. includes faithful representation of the message subject and content, including tabs. If a tab is there, it is either by intent (which we should consider), or by some error (which we could guess at... but I don't think we want to, as tabs in subjects are rather rare). Currently, I think that in the message list, information hiding may take place. (My patch does not address this, but I feel that there actually, it is OK to deviate from faithful representation.) My patch should not give rise to information-hiding, perhaps the tooltip case needs to be treated differently from the envelope view; let me know.

Regarding 2.: I tried to set the tab length to something smaller than 8 characters, but alas, it seems that the ‘tab-size’ CSS is not yet supported. In any case, now there is some consistency between what is in the message list and what is in the envelope (which is actually what prompted my report/patch). In the future, this should ‘start to work’, once the Qt html viewer is brought up-to-date.
Comment 17 Jan Kundrát 2016-03-09 23:18:42 UTC
> I feel that 1. includes faithful representation of the message subject and
> content, including tabs. If a tab is there, it is either by intent (which we
> should consider), or by some error (which we could guess at... but I don't
> think we want to, as tabs in subjects are rather rare). 

I've, unfortunately, seen totally botched subjects, including a case where 
a single Unicode codepoint which gets encoded into two bytes in UTF-8 was 
separted into two encoded-words bby the sender's RFC2047 generator. Based 
on this experience and on a quick consultation with an oracle (the 
grapefruit-flavored Salzburger Stiegel was particularly helpful), I think 
that it's going to be much more common to see tabs as an accident.

Otherwise, consider the question of there to draw a line. Should we support 
line breaks for the same reason, for example? What about Unicode characters 
for drawing boxes?

> In the future, this should ‘start to work’, once the Qt html viewer is brought 
> up-to-date.

We're using a simple QLabel. It will likely never come with a full-blown 
HTML renderer. The supported subset of HTML is documented somewhere in Qt's 
docs, but it seems that it is actualy richer than what is documented.
Comment 18 Thomas Lübking 2016-03-09 23:31:08 UTC
Fwwi, mutt displays "?" for tabs (didn't try newline)

I actually wonder about the trigger of this - few MUAs will (easily) allow to enter a tab into the subject (but jump to the next GUI element), ie. you'll copy it there (from a webpage)?
Comment 19 Erik Quaeghebeur 2016-03-10 22:44:05 UTC
(In reply to Jan Kundrát from comment #17)
>
> [...], I think  that it's going to be much more common to see tabs as an accident.

(In reply to Thomas Lübking from comment #18)
> 
> I actually wonder about the trigger of this - few MUAs will (easily) allow
> to enter a tab into the subject (but jump to the next GUI element), [...]

I'm convinced. I suggest you close this WONTFIX. I hope you don't mind the effort spent too much.
Comment 20 Thomas Lübking 2016-03-11 00:01:41 UTC
(In reply to Erik Quaeghebeur from comment #19)
> I hope you don't mind the effort spent too much.
*Your* efforts? ;-)
Questioning things and justification is never wasted time ever.

And: we figured to maybe re-check on the display in the message list.