Bug 395660 - okular cannot preserve annotations in some pdf files.
Summary: okular cannot preserve annotations in some pdf files.
Status: RESOLVED FIXED
Alias: None
Product: okular
Classification: Applications
Component: PDF backend (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-20 16:11 UTC by iuri soter viana segtovich
Modified: 2018-11-21 22:32 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
a original pdf created in libreoffice and two pages splited using pdfsam (9.55 KB, application/gzip)
2018-06-20 16:15 UTC, iuri soter viana segtovich
Details

Note You need to log in before you can comment on or make changes to this bug.
Description iuri soter viana segtovich 2018-06-20 16:11:11 UTC
Hi, I noticed a problem when annotating pdf files with okular and noticed that it only occurred with the pdf files that I had processed in pdfsam. I thought it might concern the okular developers, but I have posted this issue on the pdfsam issue tracker as well. The problem is described as follows:

* pdf sam versions 3.3.2, when using the Split tool, creates pdf files which okular 0.24.2 can open and annotate in, but okular cannot preserve annotations when using the "save-as" or "export to archive" functions in these files.
* okular worked fine annotating and preserving annotations in the original file, before splitting with pdfsam.
* the problem persisted on updating pdfsam to 3.3.5.
Comment 1 iuri soter viana segtovich 2018-06-20 16:15:27 UTC
Created attachment 113468 [details]
a original pdf created in libreoffice and two pages splited using pdfsam

uploading a "good" and a "bad" pdf file.
okular can annotate in the original file (Untitled 2.pdf) and preserve annotations upon "save as" or "export to archive".
okular can annotate on the files processed in pdfsam, but these are not preserved upon "save as" or "export to archive".
Comment 2 Oliver Sander 2018-06-21 07:13:35 UTC
Hi Iuri,

okular 0.24.2 is quite old.  There are reasons to believe that new versions handle these annotations better.  Could you please try that first?

Thanks,
Oliver
Comment 3 Albert Astals Cid 2018-06-21 09:13:40 UTC
It does fail on current version, would need someone to investigate why, probably a poppler issue
Comment 4 Tobias Deiminger 2018-06-21 22:08:06 UTC
(In reply to Albert Astals Cid from comment #3)
> It does fail on current version, would need someone to investigate why,
> probably a poppler issue
I reproduced the error with a standalone poppler application, to rule out errors in Okular. Poppler immediatelly gave first hints about what's wrong:
"Error: Couldn't find trailer dictionary"
"Error: Invalid XRef entry"

Looking a bit deeper, it are two characteristics of 'Untitled 1.pdf' that make poppler fail
- The document has an "XRef stream", instead of a "XRef table". XRef streams are available since PDF 1.5 and legitimately have no "trailer" keyword.
- The first object in the XRef stream is 1 (see "17 0 obj <<... /Index [1 17] ...>>", instead of 0. 

The start-at-1 thing causes XRef::entries[0].type = xrefEntryNone (see initialization in XRef::resize).

Then, upon document save, PDFDoc::saveIncrementalUpdate iterates over entires ranging from 0 to (getNumObjects-1). Accessing entries[0] where type == xrefEntryNone causes poppler to think this is a damaged file and it tries to reconstruct the xref table with XRef::constructXRef. Now XRef::constructXRef wants a "trailer" keyword. But there is no "trailer" keyword in the file (that's not an error because we've got a PDF 1.5 XRef stream). But XRef::constructXRef can't work without, and bails out with error.

I believe there are two things to fix in poppler:
- XRef::constructXRef should support PDF 1.5 XRef streams without trailer dictionary.
- First object number > 0 doesn't indicate a damaged file, but it's valid (am unsure about this). No need to reconstruct XRef at all. Actually, everything works fine if I trick poppler to start iteration at 1 in saveIncrementalUpdate.

There's no problem with the second document 'Untitled 2.pdf', because it uses XRef table with trailer dictionary and has objects 0..22.

Albert, does this sound reasonable? This was my first play on XRef, so the observation my be somewhat wrong. Anyway, we should open a bug at poppler.
Comment 5 Tobias Deiminger 2018-06-24 21:41:08 UTC
(In reply to Tobias Deiminger from comment #4)
> - First object number > 0 doesn't indicate a damaged file, but it's valid
> (am unsure about this)
After investigating a bit more, now I think not having an object 0 is invalid. This would mean '1_PDFsam_Untitled 2.pdf' is invalid, and poppler is NOT to blame (maybe poppler could provide a workaround, though).

Standard section 7.5.4 is explicit that an old fashioned XRef table needs a special object 0:
"The first entry in the table (object number 0) shall always be free and shall have a generation number of 65,535; it is shall be the head of the linked list of free objects."

Now '1_PDFsam_Untitled 2.pdf' has no XRef table but an XRef stream, and it seems a bit ambigous if the above statement about object 0 applies for XRef streams too. This needs to be clarified before we can actually blame either poppler or pdfsam. Maybe ask at adobe forum, or poppler list?

The XRef stream in '1_PDFsam_Untitled 2.pdf' looks like this (needed to decode /Filter /FlateDecode first)

$ dd if=1_PDFsam_Untitled\ 2.pdf ibs=1 skip=5841 count=64 | python -c 'import zlib;import sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))' | hexdump -e '5/1 " %02X" "\n"'
 01 13 73 00 00 # Object 1. Type 1 (used, not compressed), object offset = 0x1373, generation 0
 01 00 0F 00 00 # Object 2. Type 1 (used, not compressed), object offset = 0xf, generation 0
 02 00 01 00 00 # Object 3. Type 2 (compressed), stored in object nr.1, index in object stream 0
 02 00 01 00 01 # Object 4. Type 2 (compressed), stored in object nr.1, index in object stream 1
 02 00 01 00 02 # Object 5. Type 2 (compressed), stored in object nr.1, index in object stream 2
 02 00 01 00 03 # Object 6. Type 2 (compressed), stored in object nr.1, index in object stream 3
 02 00 01 00 04 # Object 7. Type 2 (compressed), stored in object nr.1, index in object stream 4
 02 00 01 00 05 # Object 8. Type 2 (compressed), stored in object nr.1, index in object stream 5
 02 00 01 00 06 # Object 9. Type 2 (compressed), stored in object nr.1, index in object stream 6
 01 00 6F 00 00 # Object 10. Type 1 (used, not compressed), object offset = 0x6f, generation 0
 02 00 01 00 07 # Object 11. Type 2 (compressed), stored in object nr.1, index in object stream 7
 02 00 01 00 08 # Object 12. Type 2 (compressed), stored in object nr.1, index in object stream 8
 02 00 01 00 09 # Object 13. Type 2 (compressed), stored in object nr.1, index in object stream 9
 01 01 0D 00 00 # Object 14. Type 1 (used, not compressed), object offset = 0x10d, generation 0
 01 02 3E 00 00 # Object 15. Type 1 (used, not compressed), object offset = 0x23e, generation 0
 01 15 ED 00 00 # Object 16. Type 1 (used, not compressed), object offset = 0x15ed, generation 0
 01 16 01 00 00 # Object 17. Type 1 (used, not compressed), object offset = 0x1601, generation 0

You see, no special object 0 here. It would look something like this
 00 00 00 FF FF # Object 0. Type 0 (member of linked list of free objects), generation nr. 65535
Comment 6 Albert Astals Cid 2018-06-24 23:02:32 UTC
I think the important question is, does Adobe Reader let you save stuff in that broken file? If so we should try to do the same, and if we can't make it happen i guess we'd need some kind of visual warning (we have one in the command line when saving fails, but that's hardly enough)
Comment 7 Tobias Deiminger 2018-06-25 21:05:41 UTC
(In reply to Albert Astals Cid from comment #6)
> I think the important question is, does Adobe Reader let you save stuff in
> that broken file?
Yes, Adobe Reader can save annotations in '1_PDFsam_Untitled 1.pdf'. Okular can view the saved file afterwards. Details see below.

> If so we should try to do the same, and if we can't make
> it happen i guess we'd need some kind of visual warning (we have one in the
> command line when saving fails, but that's hardly enough)
Nothing is impossible:) I'd take it as learning story, with open end and no guarantees. As this may take a looooong time, let's better add the visual warning as interim solution. Or are there some experienced poppler guys out there to join? 

Some details.

On full rewrite ("Save As..."), Adobe Reader created a new XRef stream for objects 0..13. So there was an object 0 after save.

On incremental update ("Save"), Adobe Reader instead added a new XRef stream with /Index[2 2 6 1 18 11] to the end of the file.
The original XRef stream with /Index [1 17] was preserved. In that case there was still no object 0 after save.

The content of the full rewrite XRef looked as follows
$ dd if='1_PDFsam_Untitled 1.pdf' ibs=1 skip=12306 count=52 | ./unpredict_png.py | hexdump -e '4/1 " %02X" "\n"'
 00 00 00 00 # obj 0 free, next free object = 0, use gen 0 if reused
 01 1D FB 00
 01 20 D8 00
 01 2D 8A 00
 01 2E 59 00
 01 2F 3E 00
 02 00 01 00
 02 00 01 01
 02 00 01 02
 02 00 01 03
 02 00 03 00
 02 00 03 01
 02 00 03 02
 02 00 04 00

Adobe saves the stream with /DecodeParms<</Columns 4/Predictor 12>> /Filter/FlateDecode.
So to analyze it, one has to decode and unpredict the PNG prediction first. I used this quick and dirty python script:

Listing unpredict_png.py

#!/usr/bin/python3
import zlib
import sys
predicted = zlib.decompress(sys.stdin.buffer.read())
rows = [predicted[i+1:i+5] for i in range(0, len(predicted), 5)]
prev = bytearray(4)
for row in range(len(rows)):
    for byte in range(len(rows[row])):
        prev[byte] = (rows[row][byte] + prev[byte]) & 0xFF
    sys.stdout.buffer.write(prev)
Comment 8 Tobias Deiminger 2018-06-27 21:33:55 UTC
(In reply to Tobias Deiminger from comment #7)
> guarantees. As this may take a looooong time, let's better add the visual
> warning as interim solution.
Probably it's not that bad, here's a poppler patch: https://bugs.freedesktop.org/show_bug.cgi?id=107057
It's sufficient to fix the bug, if approach is valid.

Another related poppler issue would be to support XRef streams, and discovery of objects inside object streams in XRef::constructXRef. I did some experiments, partially working, but it's more difficult and I'm not sure if it's worth the while.
Comment 9 Albert Astals Cid 2018-11-21 22:32:25 UTC
I'm going to close assuming that https://gitlab.freedesktop.org/poppler/poppler/issues/139 fixed it. 

Tobias please complain if it isn't correct.