Bug 436738 - docdata duplicated each time pdf is edited
Summary: docdata duplicated each time pdf is edited
Status: REPORTED
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: 20.12.3
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-07 15:59 UTC by pbs3141
Modified: 2023-02-15 04:48 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description pbs3141 2021-05-07 15:59:39 UTC
SUMMARY

In a typical TeX workflow, a user works on a tex file which they periodically use to generate a pdf file. It is common to have the pdf file open in Okular, so that when the pdf file is regenerated, it automatically reloads.

The bug is that each time this happens, Okular writes a new xml file to the docdata folder. This results in docdata being filled with huge numbers of identical files. As a result, mine has now reached an impressive size of 25,895 files.

The same problem also occurs when using LyX according to its intended workflow. Each time one hits Ctrl+R to view the document, or Ctrl+Shift+R to refresh the document if already open, one gets a new docdata entry.

STEPS TO REPRODUCE

Simply run the following bash script, which simulates the effect of a user making regular updates to a pdf file.

#! /bin/bash

pdftex "\TeX \end" > /dev/null
okular ./texput.pdf &

while true; do
    sleep 1
    pdftex "\TeX $(pwgen -N 1) \end" > /dev/null
    find ~/.local/share/okular/docdata -name "*.texput.pdf.xml" | wc -l
done

OBSERVED RESULT

The script prints an ever-increasing sequence of numbers, indicating the presence of xml files piling up in docdata.

EXPECTED RESULT

The should repeatedly print out "1".
Comment 1 Albert Astals Cid 2021-05-07 18:07:35 UTC
I don't see how this is a bug, we need to save the settings, what would you expect us to do?
Comment 2 pbs3141 2021-05-07 22:49:33 UTC
I would expect the old xml file to be overwritten with the new one. After all, Okular clearly already knows that the new pdf file has overwritten the old one; that's why it reloads the pdf and keeps the same viewing position in the way it does. Given this, it doesn't make sense to keep the old xml file around. It's like Okular's saying "I know you just overwrote your file, but I'm going to keep the viewing parameters around for the old version just in case you want to have another look at it."
Comment 3 pbs3141 2021-05-07 23:02:11 UTC
Here is another way to put it. The only way the old viewing parameters could ever come in useful again is if I secretly kept a copy of the old version, and overwrote the new version with it at some later date. Then Okular could say "aha, let me take you back to where you were". It would then reset the viewing position to how it was when I was last looking at it, which might be different to the current viewing position. But what does the user want in this case? They'd rather stay where they are. So the saved information proves to be of no use.
Comment 4 Laura David Hurka 2021-05-07 23:26:43 UTC
To me this appears clear. There is no point in storing old versions of these files.

My suggestion is to delete old files that describe the same document, and also delete the old file when the document is reloaded automatically.

In case we don’t want to save file names, but hashes of the file content, only my second suggestion would apply. Delete/migrate old descriptions when the document is reloaded.

Question: What do these numbers mean? For example, I have these two files:

13107254.pgfplots.pdf.xml
13572671.pgfplots.pdf.xml

They point to the same file under same paths, but do not store any timestamp. If I remove all but one of them, nothing at all will happen, right?
Comment 5 Albert Astals Cid 2021-05-08 08:33:27 UTC
We only store filesize and filename so you can move files around and your settings are kept.

The fact that you overwrote this instance of (filesize,filename) doesn't mean you don't have other copies of (filesize,filename) in your filesystem, so no, i don't see why we should assume that (filesize,filename) is not useful anymore.
Comment 6 pbs3141 2021-05-08 20:09:37 UTC
Albert's point is that I could make many copies of my pdf file, then overwrite it. Under my proposal, all the copies would have their viewing positions reset.

(That is, assuming all the copies have been made in different directories sharing the same filename. And also assuming that the new version is viewed immediately after updating it.)

Given that Okular already resets the viewing position in far more everyday situations like file rename, I don't see how resetting the viewing position in this exotic situation is such a big deal.

Still, my proposal would probably annoy that one user who takes regular snapshots of their system, and regularly looks back at old versions of their pdf documents in old snapshots, who would now find their viewing position keep resetting.

It's a question of balancing the effect of breaking one person's workflow by changing something they shouldn't be relying on anyway (https://xkcd.com/1172), given that Okular doesn't preserve viewing position that well in general, vs littering everyone else's systems with thousands of tiny harmless files which, while not taking up very much space, is certainly far from optimal.

I leave it up to you!
Comment 7 Laura David Hurka 2021-05-08 23:12:17 UTC
This is the content of my 13572671.pgfplots.pdf.xml : (If formatting doesn’t break here)

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE documentInfo>
<documentInfo url="/home/david/Literatur/pgfplots.pdf">
 <generalInfo>
  <history>
   <oldPage viewport="183;C2:0.499555:0.997468:1"/>
   [...]
   <current viewport="90;C2:0.499555:0.368987:1"/>
  </history>
  <views>
   <view name="PageView">
    <zoom mode="0" value="1.009"/>
    <continuous mode="1"/>
    <viewMode mode="0"/>
    <trimMargins value="0"/>
   </view>
  </views>
 </generalInfo>
</documentInfo>

So it is not just (filesize,filename), but actually stores the full file path.

When I overwrite one instance of (filesize,path), that means that there can’t be another instance of (filesize,path) somewhere in the system. Except when the user actively restores an old version of the file.

Is the url attribute ignored when docdata files for an opened document are searched? In that case it is true that there may be another file which fits this docdata file.

The workflow originally addressed by this bug report was that Okular automatically reloads the document, and so actively creates these duplicates. Now we are talking a bit more general. Should we split this into two bug reports?
Comment 8 Albert Astals Cid 2021-05-09 10:44:01 UTC
> So it is not just (filesize,filename), but actually stores the full file path.

Is that full file path used anywhere for anything? (yes, yes, i know why are we storing the url if we don't use it, good question)

> The workflow originally addressed by this bug report was that Okular automatically reloads the document, and so actively creates these duplicates. Now we are talking a bit more general. Should we split this into two bug reports?

My question is how can Okular know that the user wants to remove the file, for humans that are inteligent beings it's quite easy to see but for an app, I don't see how it can figure out that the scenario is one in such the data doesn't have to be stored.
Comment 9 pbs3141 2021-05-13 00:03:35 UTC
> Now we are talking a bit more general. Should we split this into two bug reports?

If you're considering enlarging the scope of the bug report, then this may provide an opportunity to fix both this issue and the file-rename issue all at once.

How about storing a docdata file whose name is the hash of the file size and the first 4kB (or middle 4kB, or whatever). Assume this is as good as a hash of the whole file, though obviously less expensive to compute.

Store in the docdata file all full filepaths where this document has been opened from. (This is already done, according to Comment 7.)

Purge from each docdata file any filepaths that have been deleted, and purge any docdata files that have had no filepaths for 6 months (or some configurable expiration period). Do this in an amortised / randomised fashion, only checking a few files on each startup, to keep the io negligible.

That fixes file rename. To deal with modifications, create soft links "full path" -> "docdata file" in docdata directory. If a file is opened with no matching docdata file for its hash, search instead by filename, and if one is found, use that. (And write out a new docdata file named by the hash.) Purge old links where the path no longer exists in an amortised manner similar to before.

Has something like this been previously considered and ruled out?
Comment 10 Laura David Hurka 2021-05-13 13:33:45 UTC
In Bug 317856, it was requested to store the file name only as a hash.

Related, in the merge request https://invent.kde.org/graphics/okular/-/merge_requests/422#note_238154 (Create “Welcome screen” that replaces window where nearly all widgets are in disabled state) Jiří Wolker writes:

> And there is also privacy risk – users sometimes do not want to store
> thumbnails of their documents. (Example: Home directory incl. config
> files is not encrypted, Documents is. When user opens file from Documents,
> the thumbnail gets stored in the home directory. This makes unencrypted
> image of part of the file.)

My idea would be to store docdata (maybe including thumbnails) hashed by the file name/path/content, and encrypted with a hash of the file content, so they can only be read with read access to the document file (or a copy of it).

To delete old docdata files, there could be a list of the last 5000 used docdata files. This list is updated every time a file is opened/closed, and those docdata not anymore in the list are deleted. Those docdata which are so old that they leave the “Open Recent” list, are stripped from their thumbnail.

But this way, there wouldn’t be a reliable way to detect duplicate docdata files.
Comment 11 Albert Astals Cid 2021-05-13 20:28:59 UTC
I fear that what you're suggesting would create too much I/O. Each time i open a PDF i have never opened before i would have to read all the filenames in the docdata folder in case some of them has a matching sha.

Doesn't sound like it would work fine at scale.
Comment 12 Laura David Hurka 2021-05-15 10:29:16 UTC
Every time I open a document, Okular checks whether the file “<size>,<filename>.xml” exists. Is that different to checking whether the file “<hash>.xml” exists?

Or do I understand something wrong?
Comment 13 Albert Astals Cid 2021-05-15 20:42:27 UTC
(In reply to David Hurka from comment #12)
> Every time I open a document, Okular checks whether the file
> “<size>,<filename>.xml” exists. Is that different to checking whether the
> file “<hash>.xml” exists?
> 
> Or do I understand something wrong?

My answer was to pbs3141 that was suggesting something different as far as I understand.

I don't like the idea of identifying the docdata exclusively by hash.

Hashes by definition will have collisions, so will have filenames+filesize, but it's much easier to explain that two documents "share" their docdata because of that (and if the user actually has two files with the same filename and size and are not the same, she can rename one of the files) than the fact that if they share the hash of the first N bytes, which is something that no one "normal" can really understand and if even they understand they can't fix it.
Comment 14 Laura David Hurka 2021-05-15 22:16:48 UTC
> My answer was to pbs3141 that was suggesting something 
> different as far as I understand.
Ok, now I understand. pbs3141 and me suggested mostly the same, just that my suggestion does not use any filepaths, and so does not need to process docdata files to decide whether to delete them.

The fear for collisions is probably real for .txt files and similar, where two different documents can easily have the same first 4kB. (The text on the first two pages are the same.) The SHA algorithms don’t make collisions that we should care about. :)
Comment 15 pbs3141 2021-05-17 14:42:57 UTC
> My idea would be to store docdata (maybe including thumbnails) hashed by the file name/path/content, and encrypted with a hash of the file content, so they can only be read with read access to the document file (or a copy of it).

So, you're suggesting the encryption to address the privacy issue mentioned in that thread. Would it not be simpler to make docdata only user-readable?

> I fear that what you're suggesting would create too much I/O. Each time i open a PDF i have never opened before i would have to read all the filenames in the docdata folder in case some of them has a matching sha.
>
> Doesn't sound like it would work fine at scale.

No, I only suggested testing the existence of docdata/$HASH and docdata/$FULLPATH, which takes constant io. (I think David already said the same thing in the next comment.)

The only potential source of large io in my suggestion was the amortised deletion of stale files. The difficulty is in randomly selecting k files from a directory containing N files, where say k ~ 5 and N ~ 5000. I think this can still be done quickly, in O(k) io not O(N), by walking the linked list returned by opendir, but only reading a random selection of k files. But I'll need to benchmark / read up on disk formats to be sure.

> I don't like the idea of identifying the docdata exclusively by hash.

It's good enough for git! Surely it should be good enough here?

> Hashes by definition will have collisions, so will have filenames+filesize, but it's much easier to explain that two documents "share" their docdata because of that (and if the user actually has two files with the same filename and size and are not the same, she can rename one of the files) than the fact that if they share the hash of the first N bytes, which is something that no one "normal" can really understand and if even they understand they can't fix it.

I don't like hashing the whole file. Users may want to open some pretty large PDFs. I've personally needed to view a PDF of a long slideshow with many large pictures that was over 1GB. I shouldn't have to hash the whole lot just to view a small part of it.

For PDF, the hash of filesize + a couple of 4kB chunks throughout the file would surely be good enough. For some formats I can imagine users might want to change small bits of the file in a way this can't detect, but PDF isn't one of them.

> Ok, now I understand. pbs3141 and me suggested mostly the same, just that my suggestion does not use any filepaths, and so does not need to process docdata files to decide whether to delete them.

The thing that is lacking with an implementation that doesn't use filepaths is that if you overwrite a PDF in-place, then you will lose the viewing data if it is not currently open in Okular. (I encounter this problem frequently when using LyX.)
Comment 16 Albert Astals Cid 2021-05-18 21:48:22 UTC
> The SHA algorithms don’t make collisions that we should care about. :)

I don't understand what you mean with this?

> For PDF, the hash of filesize + a couple of 4kB chunks throughout the file would surely be good enough. For some formats I can imagine users might want to change small bits of the file in a way this can't detect, but PDF isn't one of them.

So I open a file, add an annotation, save it as filea.pdf then move the annotation around and save it as fileb.pdf. Your algorithm would think it's the same file
Comment 17 pbs3141 2021-05-18 23:12:22 UTC
> So I open a file, add an annotation, save it as filea.pdf then move the
> annotation around and save it as fileb.pdf. Your algorithm would think it's
> the same file.

Yes, except in the rare case that the annotation coordinates lie in a 4kB chunk. This indeterminacy alone is enough to make me not like my partial hashing scheme, unless it can be guaranteed that the annotation data will always be included.

Maybe the hash of the whole file is the way to go. Possibly falling back to partial hashing only for huge files.

By the way, if I add and save an annotation on a huge PDF file (> 1GB), does the whole PDF get rewritten out, or just the annotation?
Comment 18 Albert Astals Cid 2021-05-19 21:10:39 UTC
(In reply to pbs3141 from comment #17)
> > So I open a file, add an annotation, save it as filea.pdf then move the
> > annotation around and save it as fileb.pdf. Your algorithm would think it's
> > the same file.
> 
> Yes, except in the rare case that the annotation coordinates lie in a 4kB
> chunk. This indeterminacy alone is enough to make me not like my partial
> hashing scheme, unless it can be guaranteed that the annotation data will
> always be included.
> 
> Maybe the hash of the whole file is the way to go. Possibly falling back to
> partial hashing only for huge files.
> 
> By the way, if I add and save an annotation on a huge PDF file (> 1GB), does
> the whole PDF get rewritten out, or just the annotation?

Normal behaviour is append at the end.