Bug 344849 - PDF metadata is displayed incorrectly in File -> Properties
Summary: PDF metadata is displayed incorrectly in File -> Properties
Status: RESOLVED UPSTREAM
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: 0.20.0
Platform: Ubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL: https://bugs.kde.org/attachment.cgi?i...
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-04 23:11 UTC by 4aa7f31e
Modified: 2015-03-15 12:34 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
PDF file exhibiting the bug (10.65 KB, application/force-download)
2015-03-04 23:14 UTC, 4aa7f31e
Details
PDF file without problems despite "problematic" characters (10.66 KB, application/force-download)
2015-03-04 23:15 UTC, 4aa7f31e
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 4aa7f31e 2015-03-04 23:11:51 UTC
If certain characters are present in the metadata of a PDF file (fields "Title", "Author", "Subject", "Keywords"), they are displayed incorrectly in the properties window of the document which can be opened by selecting "File" -> "Properties".
For example, in a PDF file whose "Author" field contains the characters "–‰" (U+2013 U+2030), Okular displays two boxes "[NEL][PLD]" instead. Copying the two characters reveals that they are the characters "" (U+0085 U+008B or "NEXT LINE" and "PARTIAL LINE FORWARD"). Opening the same PDF file with Evince 3.14.1 produces the expected output ("–‰"), so the problem does not seem to be in the PDF file itself. An example PDF file can be found at:
http://s000.tinyupload.com/index.php?file_id=22397336241734990640

Strangely, in some circumstances (which I have been unable to determine thus far), the output in Okular is correct even if the same characters are present in the PDF metadata. For example, if the "Author" field of the PDF file contains "–‰‒" (U+2013 U+2030 U+2012), the characters are correctly displayed in both Okular and Evince. An example PDF document for this phenomenon can be found at:
http://s000.tinyupload.com/index.php?file_id=22609422767022144297

(I wasn't sure were to file this report, as it might be purely a display issue or a problem with the PDF backend.)

Reproducible: Always

Steps to Reproduce:
1. Open a PDF file whose metadata (e.g. the "Author" field) contains (for example) the characters "–‰".
2. Click on "File" in the menu bar and select the item "Properties".
3. Observe (in this case) the line "Author:".

Actual Results:  
(In this case:) The characters  (U+0085 U+008B or "NEXT LINE" and "PARTIAL LINE FORWARD") are displayed.

Expected Results:  
(In this case:) The characters "–‰" (U+2013 U+2030) should be displayed.
Comment 1 4aa7f31e 2015-03-04 23:14:24 UTC
Created attachment 91426 [details]
PDF file exhibiting the bug
Comment 2 4aa7f31e 2015-03-04 23:15:21 UTC
Created attachment 91427 [details]
PDF file without problems despite "problematic" characters
Comment 3 Albert Astals Cid 2015-03-05 00:43:24 UTC
It's not a bug, the first document just has characters wrongly encoded, the Author field in the info dictionary is a text string, thus it can either be Latin1 or UTF-16BE, since that text is not Latin1, it must be UTF-16BE, but for UTF-16BE the first two bytes must be 254 followed by 255, which is not happening here, so the file is broken and there's nothing we can do to do it "right"
Comment 4 4aa7f31e 2015-03-05 16:11:10 UTC
So I suppose it's a bug in the software which produced the document (sejda)?

Still, I wonder why Evince seems to have no problems with it.
Comment 5 Albert Astals Cid 2015-03-05 16:27:13 UTC
Correct, bug in whatever created it, Evince must be using a different codepath for broken documents that in this case happens to luckly work
Comment 6 Jason Crain 2015-03-05 18:07:31 UTC
FWIW, both adobe reader and poppler's pdfinfo also show the correct information '–‰'.  pdfinfo passes the string through TextStringToUCS4 which looks up the character in pdfDocEncoding.  Per PDFDocEncoding, character codes 0x85 and 0x8b should map to U+2013 U+2030.
Comment 7 4aa7f31e 2015-03-05 20:10:52 UTC
I have reported the problem to the authors of Sejda, which is the program I used to generate the example PDF documents. The relevant report is here: https://github.com/torakiki/sejda/issues/170

However, the author of Sejda says that its behavior conforms to the PDF specifications. Additionally, a few other PDF viewers also display the desired characters (contrary to Okular). Unfortunately, I don't consider myself knowledgeable enough to decide who's correct about this issue.
Comment 8 4aa7f31e 2015-03-15 03:29:22 UTC
I have read some parts of the PDF standard (ISO 32000-1:2008) and can only confirm the assessment in the Sejda bug report (which has been closed in the meantime).

According to section 7.9.2.2 "Text String Type" of ISO 32000-1:2008, fields such as the "Author" field in the example document must be represented as a PDF "text string", which can be encoded either as UTF16-BE with byte order mark or as PDFDocEncoding. PDFDocEncoding can encode all Latin1 characters; however, it is NOT the same as either ISO Latin1 or Windows-1252!

The mapping of PDFDocEncoding bytes to characters is defined in Annex D, table D.2 "Latin Character Set and Encodings". Note that both PDFDocEncoding and Windows-1252 can in fact encode the characters "–‰". Thus, the string need not be encoded as UTB16-BE and the provided PDF document is valid (the characters "–‰" are correctly encoded as "0x85 0x8B" in PDFDocEncoding). It seems that Okular does not correctly parse PDFDocEncoded text strings.

(The other example document works correctly because U+2012 cannot be encoded in PDFDocEncoding, so UTF16-BE was used, which is correctly read by Okular.)
Comment 9 Albert Astals Cid 2015-03-15 12:34:15 UTC
Right, i misread the spec, fix will be on poppler next release
http://cgit.freedesktop.org/poppler/poppler/commit/?id=bc8076d8f638ccb44f8e3b94aaae96850b025deb