In the following document: http://ww1.microchip.com/downloads/en/devicedoc/61143h.pdf I'm trying to find the string "tad". Okular returns 0 results, but I've tried Evince, Foxit Reader, Acrobat Reader: all of them returns several results: at the page 124, 181, 182, 183, etc. While trying, I've found out that I can search "t ad" (i.e. add a space), and then okular founds some garbage items plus real occurences. Please note this is NOT an acceptable workaround: I need to be sure that my reader is able to find what I need without any "hacks": I don't know what else it is unable to find. Reproducible: Always Steps to Reproduce: 1. Download a file I specified: http://ww1.microchip.com/downloads/en/devicedoc/61143h.pdf (this is microchip microcontroller datasheet) 2. Open it in Okular 3. Try to find string "tad" (without quotes) - there will be 0 results 4. Open the same file in Evince (or Foxit Reader, or Acrobat Reader) 5. Try to find the same string "tad" - there will be several occurences. Actual Results: Okular doesn't find the string "tad" in this document, but other readers do. Expected Results: Okular should find occurences of "tad" string.
Created attachment 86345 [details] PDF document in which okular failed to find occurence of "tad" string
Unfortunately PDF files mostly don't include "text", they just include characters and their positions and it's up to the client to "guess" the text those characters+positions form, and sometimes is not as easy as we would like
Jaan, do you think you could have a look at this?
The problem with this file is that the bounding boxes of "T" and "A" overlap and Okular's layout detection algorithm only considers two glyphs to belong to the same word if the second one's bounding box touches the first one's right side exactly (rounded to integer pixels at a certain resolution), not if there is overlap or a gap. I think I can write a small patch to solve it: accept overlap (or maybe also gap) within a percentage of the width of the following character. In the long run, as layout detection is something that will never be 100% perfect and in particular the XY Cut layout detection approach that Okular uses has some fundamental limitations, I think the layout detection in Okular would benefit from a major refactoring to 1) use existing text flow info in the file if available (Tagged PDF, ePUB, OpenDocument etc.) and 2) for files where text flow data is really missing, reuse algorithms from other similar projects to save the research & development effort. For the current file, however, 1) would not help since it is not a Tagged PDF, i. e. it is one of the kind that Albert described in his comment.
Albert, Jaan, thank you for your comments! Jaan, despite of that (1) would not help for the current file, you said that you can modify Okular's layout detection algorithm so that it will be able to detect text like we have "tad" in this document. It would be very nice, really. Okular is really nice viewer, and I'd be happy to use it as my everyday PDF viewer. For now, I have to use Windows version of FoxitReader in Wine, it works not perfectly, but after a long research it's the best way I found. I really hope for the patch! Thank you.
Jaan, poppler supports tagged pdf, you can always have a look at it. About improvements for this bug to go away, they're always welcome :-)
When testing with some PDF documents on my hard drive, I found that improving this bug would cause a regression for some PDFs (OCR'ed papers) from JSTOR which have slightly wrong bounding rectangles; for those documents the current rule "two glyphs belong to the same word iff their bounding box edges exactly match" works best. (An example is http://www.jstor.org/stable/1970717 but unfortunately they want money for downloading the PDF unless you belong to a university that has a contract with them.) However, those JSTOR PDFs are Tagged PDFs and their Tagged PDF actual text content (which can be obtained by copying text from Acrobat Reader) is good. So, in order to avoid regressions, Tagged PDF support (i. e., not doing layout detection for Tagged PDFs) should also be added to Okular when fixing this bug. However, I didn't find a method returning the Tagged PDF actual text in the Qt4 interface of poppler. The only promising one was Poppler::Page::textList(), which is also currently used by Okular (but Okular does some layout detection chemistry on top of it) but from testing with poppler (0.26.0 and 0.26.1), but I found that textList() still doesn't return the Tagged PDF text but the results of layout detection done by poppler.
I am pretty sure poppler supports tagged pdf, maybe it's just not exported. Do you think you could have a look? I'm also the poppler maintainer so it should not be a problem getting patches in (any more than my usual "i'm very busy and thus slow reviewing stuff")
Thank you for the bug report. As this report hasn't seen any changes in 5 years or more, we ask if you can please confirm that the issue still persists. If this bug is no longer persisting or relevant please change the status to resolved.
(In reply to Justin Zobel from comment #9) > Thank you for the bug report. > > As this report hasn't seen any changes in 5 years or more, we ask if you can > please confirm that the issue still persists. > > If this bug is no longer persisting or relevant please change the status to > resolved. No, this bug is still not fixed on version 22.04.0. Both Firefox and Adobe reader found 13 instance of "tad", while Okular only find one.