Hi, I noticed a problem when annotating pdf files with okular and noticed that it only occurred with the pdf files that I had processed in pdfsam. I thought it might concern the okular developers, but I have posted this issue on the pdfsam issue tracker as well. The problem is described as follows: * pdf sam versions 3.3.2, when using the Split tool, creates pdf files which okular 0.24.2 can open and annotate in, but okular cannot preserve annotations when using the "save-as" or "export to archive" functions in these files. * okular worked fine annotating and preserving annotations in the original file, before splitting with pdfsam. * the problem persisted on updating pdfsam to 3.3.5.
Created attachment 113468 [details] a original pdf created in libreoffice and two pages splited using pdfsam uploading a "good" and a "bad" pdf file. okular can annotate in the original file (Untitled 2.pdf) and preserve annotations upon "save as" or "export to archive". okular can annotate on the files processed in pdfsam, but these are not preserved upon "save as" or "export to archive".
Hi Iuri, okular 0.24.2 is quite old. There are reasons to believe that new versions handle these annotations better. Could you please try that first? Thanks, Oliver
It does fail on current version, would need someone to investigate why, probably a poppler issue
(In reply to Albert Astals Cid from comment #3) > It does fail on current version, would need someone to investigate why, > probably a poppler issue I reproduced the error with a standalone poppler application, to rule out errors in Okular. Poppler immediatelly gave first hints about what's wrong: "Error: Couldn't find trailer dictionary" "Error: Invalid XRef entry" Looking a bit deeper, it are two characteristics of 'Untitled 1.pdf' that make poppler fail - The document has an "XRef stream", instead of a "XRef table". XRef streams are available since PDF 1.5 and legitimately have no "trailer" keyword. - The first object in the XRef stream is 1 (see "17 0 obj <<... /Index [1 17] ...>>", instead of 0. The start-at-1 thing causes XRef::entries[0].type = xrefEntryNone (see initialization in XRef::resize). Then, upon document save, PDFDoc::saveIncrementalUpdate iterates over entires ranging from 0 to (getNumObjects-1). Accessing entries[0] where type == xrefEntryNone causes poppler to think this is a damaged file and it tries to reconstruct the xref table with XRef::constructXRef. Now XRef::constructXRef wants a "trailer" keyword. But there is no "trailer" keyword in the file (that's not an error because we've got a PDF 1.5 XRef stream). But XRef::constructXRef can't work without, and bails out with error. I believe there are two things to fix in poppler: - XRef::constructXRef should support PDF 1.5 XRef streams without trailer dictionary. - First object number > 0 doesn't indicate a damaged file, but it's valid (am unsure about this). No need to reconstruct XRef at all. Actually, everything works fine if I trick poppler to start iteration at 1 in saveIncrementalUpdate. There's no problem with the second document 'Untitled 2.pdf', because it uses XRef table with trailer dictionary and has objects 0..22. Albert, does this sound reasonable? This was my first play on XRef, so the observation my be somewhat wrong. Anyway, we should open a bug at poppler.
(In reply to Tobias Deiminger from comment #4) > - First object number > 0 doesn't indicate a damaged file, but it's valid > (am unsure about this) After investigating a bit more, now I think not having an object 0 is invalid. This would mean '1_PDFsam_Untitled 2.pdf' is invalid, and poppler is NOT to blame (maybe poppler could provide a workaround, though). Standard section 7.5.4 is explicit that an old fashioned XRef table needs a special object 0: "The first entry in the table (object number 0) shall always be free and shall have a generation number of 65,535; it is shall be the head of the linked list of free objects." Now '1_PDFsam_Untitled 2.pdf' has no XRef table but an XRef stream, and it seems a bit ambigous if the above statement about object 0 applies for XRef streams too. This needs to be clarified before we can actually blame either poppler or pdfsam. Maybe ask at adobe forum, or poppler list? The XRef stream in '1_PDFsam_Untitled 2.pdf' looks like this (needed to decode /Filter /FlateDecode first) $ dd if=1_PDFsam_Untitled\ 2.pdf ibs=1 skip=5841 count=64 | python -c 'import zlib;import sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))' | hexdump -e '5/1 " %02X" "\n"' 01 13 73 00 00 # Object 1. Type 1 (used, not compressed), object offset = 0x1373, generation 0 01 00 0F 00 00 # Object 2. Type 1 (used, not compressed), object offset = 0xf, generation 0 02 00 01 00 00 # Object 3. Type 2 (compressed), stored in object nr.1, index in object stream 0 02 00 01 00 01 # Object 4. Type 2 (compressed), stored in object nr.1, index in object stream 1 02 00 01 00 02 # Object 5. Type 2 (compressed), stored in object nr.1, index in object stream 2 02 00 01 00 03 # Object 6. Type 2 (compressed), stored in object nr.1, index in object stream 3 02 00 01 00 04 # Object 7. Type 2 (compressed), stored in object nr.1, index in object stream 4 02 00 01 00 05 # Object 8. Type 2 (compressed), stored in object nr.1, index in object stream 5 02 00 01 00 06 # Object 9. Type 2 (compressed), stored in object nr.1, index in object stream 6 01 00 6F 00 00 # Object 10. Type 1 (used, not compressed), object offset = 0x6f, generation 0 02 00 01 00 07 # Object 11. Type 2 (compressed), stored in object nr.1, index in object stream 7 02 00 01 00 08 # Object 12. Type 2 (compressed), stored in object nr.1, index in object stream 8 02 00 01 00 09 # Object 13. Type 2 (compressed), stored in object nr.1, index in object stream 9 01 01 0D 00 00 # Object 14. Type 1 (used, not compressed), object offset = 0x10d, generation 0 01 02 3E 00 00 # Object 15. Type 1 (used, not compressed), object offset = 0x23e, generation 0 01 15 ED 00 00 # Object 16. Type 1 (used, not compressed), object offset = 0x15ed, generation 0 01 16 01 00 00 # Object 17. Type 1 (used, not compressed), object offset = 0x1601, generation 0 You see, no special object 0 here. It would look something like this 00 00 00 FF FF # Object 0. Type 0 (member of linked list of free objects), generation nr. 65535
I think the important question is, does Adobe Reader let you save stuff in that broken file? If so we should try to do the same, and if we can't make it happen i guess we'd need some kind of visual warning (we have one in the command line when saving fails, but that's hardly enough)
(In reply to Albert Astals Cid from comment #6) > I think the important question is, does Adobe Reader let you save stuff in > that broken file? Yes, Adobe Reader can save annotations in '1_PDFsam_Untitled 1.pdf'. Okular can view the saved file afterwards. Details see below. > If so we should try to do the same, and if we can't make > it happen i guess we'd need some kind of visual warning (we have one in the > command line when saving fails, but that's hardly enough) Nothing is impossible:) I'd take it as learning story, with open end and no guarantees. As this may take a looooong time, let's better add the visual warning as interim solution. Or are there some experienced poppler guys out there to join? Some details. On full rewrite ("Save As..."), Adobe Reader created a new XRef stream for objects 0..13. So there was an object 0 after save. On incremental update ("Save"), Adobe Reader instead added a new XRef stream with /Index[2 2 6 1 18 11] to the end of the file. The original XRef stream with /Index [1 17] was preserved. In that case there was still no object 0 after save. The content of the full rewrite XRef looked as follows $ dd if='1_PDFsam_Untitled 1.pdf' ibs=1 skip=12306 count=52 | ./unpredict_png.py | hexdump -e '4/1 " %02X" "\n"' 00 00 00 00 # obj 0 free, next free object = 0, use gen 0 if reused 01 1D FB 00 01 20 D8 00 01 2D 8A 00 01 2E 59 00 01 2F 3E 00 02 00 01 00 02 00 01 01 02 00 01 02 02 00 01 03 02 00 03 00 02 00 03 01 02 00 03 02 02 00 04 00 Adobe saves the stream with /DecodeParms<</Columns 4/Predictor 12>> /Filter/FlateDecode. So to analyze it, one has to decode and unpredict the PNG prediction first. I used this quick and dirty python script: Listing unpredict_png.py #!/usr/bin/python3 import zlib import sys predicted = zlib.decompress(sys.stdin.buffer.read()) rows = [predicted[i+1:i+5] for i in range(0, len(predicted), 5)] prev = bytearray(4) for row in range(len(rows)): for byte in range(len(rows[row])): prev[byte] = (rows[row][byte] + prev[byte]) & 0xFF sys.stdout.buffer.write(prev)
(In reply to Tobias Deiminger from comment #7) > guarantees. As this may take a looooong time, let's better add the visual > warning as interim solution. Probably it's not that bad, here's a poppler patch: https://bugs.freedesktop.org/show_bug.cgi?id=107057 It's sufficient to fix the bug, if approach is valid. Another related poppler issue would be to support XRef streams, and discovery of objects inside object streams in XRef::constructXRef. I did some experiments, partially working, but it's more difficult and I'm not sure if it's worth the while.
I'm going to close assuming that https://gitlab.freedesktop.org/poppler/poppler/issues/139 fixed it. Tobias please complain if it isn't correct.