If certain characters are present in the metadata of a PDF file (fields "Title", "Author", "Subject", "Keywords"), they are displayed incorrectly in the properties window of the document which can be opened by selecting "File" -> "Properties". For example, in a PDF file whose "Author" field contains the characters "–‰" (U+2013 U+2030), Okular displays two boxes "[NEL][PLD]" instead. Copying the two characters reveals that they are the characters "" (U+0085 U+008B or "NEXT LINE" and "PARTIAL LINE FORWARD"). Opening the same PDF file with Evince 3.14.1 produces the expected output ("–‰"), so the problem does not seem to be in the PDF file itself. An example PDF file can be found at: http://s000.tinyupload.com/index.php?file_id=22397336241734990640 Strangely, in some circumstances (which I have been unable to determine thus far), the output in Okular is correct even if the same characters are present in the PDF metadata. For example, if the "Author" field of the PDF file contains "–‰‒" (U+2013 U+2030 U+2012), the characters are correctly displayed in both Okular and Evince. An example PDF document for this phenomenon can be found at: http://s000.tinyupload.com/index.php?file_id=22609422767022144297 (I wasn't sure were to file this report, as it might be purely a display issue or a problem with the PDF backend.) Reproducible: Always Steps to Reproduce: 1. Open a PDF file whose metadata (e.g. the "Author" field) contains (for example) the characters "–‰". 2. Click on "File" in the menu bar and select the item "Properties". 3. Observe (in this case) the line "Author:". Actual Results: (In this case:) The characters (U+0085 U+008B or "NEXT LINE" and "PARTIAL LINE FORWARD") are displayed. Expected Results: (In this case:) The characters "–‰" (U+2013 U+2030) should be displayed.
Created attachment 91426 [details] PDF file exhibiting the bug
Created attachment 91427 [details] PDF file without problems despite "problematic" characters
It's not a bug, the first document just has characters wrongly encoded, the Author field in the info dictionary is a text string, thus it can either be Latin1 or UTF-16BE, since that text is not Latin1, it must be UTF-16BE, but for UTF-16BE the first two bytes must be 254 followed by 255, which is not happening here, so the file is broken and there's nothing we can do to do it "right"
So I suppose it's a bug in the software which produced the document (sejda)? Still, I wonder why Evince seems to have no problems with it.
Correct, bug in whatever created it, Evince must be using a different codepath for broken documents that in this case happens to luckly work
FWIW, both adobe reader and poppler's pdfinfo also show the correct information '–‰'. pdfinfo passes the string through TextStringToUCS4 which looks up the character in pdfDocEncoding. Per PDFDocEncoding, character codes 0x85 and 0x8b should map to U+2013 U+2030.
I have reported the problem to the authors of Sejda, which is the program I used to generate the example PDF documents. The relevant report is here: https://github.com/torakiki/sejda/issues/170 However, the author of Sejda says that its behavior conforms to the PDF specifications. Additionally, a few other PDF viewers also display the desired characters (contrary to Okular). Unfortunately, I don't consider myself knowledgeable enough to decide who's correct about this issue.
I have read some parts of the PDF standard (ISO 32000-1:2008) and can only confirm the assessment in the Sejda bug report (which has been closed in the meantime). According to section 7.9.2.2 "Text String Type" of ISO 32000-1:2008, fields such as the "Author" field in the example document must be represented as a PDF "text string", which can be encoded either as UTF16-BE with byte order mark or as PDFDocEncoding. PDFDocEncoding can encode all Latin1 characters; however, it is NOT the same as either ISO Latin1 or Windows-1252! The mapping of PDFDocEncoding bytes to characters is defined in Annex D, table D.2 "Latin Character Set and Encodings". Note that both PDFDocEncoding and Windows-1252 can in fact encode the characters "–‰". Thus, the string need not be encoded as UTB16-BE and the provided PDF document is valid (the characters "–‰" are correctly encoded as "0x85 0x8B" in PDFDocEncoding). It seems that Okular does not correctly parse PDFDocEncoded text strings. (The other example document works correctly because U+2012 cannot be encoded in PDFDocEncoding, so UTF16-BE was used, which is correctly read by Okular.)
Right, i misread the spec, fix will be on poppler next release http://cgit.freedesktop.org/poppler/poppler/commit/?id=bc8076d8f638ccb44f8e3b94aaae96850b025deb