Created attachment 126616 [details] Test file constructed from excerpt of PDF document. SUMMARY The Find function misses target occurrences that wrap to the next line of text. STEPS TO REPRODUCE 1. Control-f to initiate Find. Type in "one-third". The "one-third" in mid sentence is found. The "one-third" in "not less than one-[carriage return] third that of the larger conductor" is not. This also does not find other phrases used to describe numerical fractions. For example, two-thirds, three-quarters, etc. 2. Copy "when not[carriage return] part of the wiring". Control-f to initiate Find. Paste string into Find. Press Next. Text is found. Control-f to Find. Type "when not part of the wiring" in Find. Press Next. Text is not found. Okular appears to be miss any text string that is typed in if it wraps from one line to the next. 3. OBSERVED RESULT Misses occurrences of target string that wrap to next line of text. EXPECTED RESULT Should find target text strings that wrap from one line to the next. SOFTWARE/OS VERSIONS Windows: Windows 10, build 1909. macOS: Linux/KDE Plasma: (available in About System) KDE Plasma Version: KDE Frameworks Version: Qt Version: ADDITIONAL INFORMATION See attached test file.
Proposed patch at https://invent.kde.org/kde/okular/-/merge_requests/139
Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid. Committed on 28/03/2020 at 13:59. Pushed by aacid into branch 'master'. Let the user type the hyphen if he wants when searching It happens that sometimes the hypen is actually "part of the word" like in one-third, so if there's one- at the end of a line and third at the beginning of the next, we should still match and not force the user to type onethird, even we will also match onethird since there's no way to know if "hyphen at end of line" is supposed to be part of the word or not M +16 -0 autotests/searchtest.cpp M +129 -114 core/textpage.cpp https://invent.kde.org/kde/okular/commit/9694113a961cb5a5d6ef18ce0beeaa975a8c6db3
*** Bug 376692 has been marked as a duplicate of this bug. ***
(In reply to Albert Astals Cid from comment #2) > Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid. > Committed on 28/03/2020 at 13:59. > Pushed by aacid into branch 'master'. > > Let the user type the hyphen if he wants when searching > > It happens that sometimes the hypen is actually "part of the word" like > in one-third, so if there's one- at the end of a line > and third at the beginning of the next, we should still match and not > force the user to type onethird, even we will also match onethird since > there's no way to know if "hyphen at end of line" is supposed to be part > of the word or not > > M +16 -0 autotests/searchtest.cpp > M +129 -114 core/textpage.cpp > > https://invent.kde.org/kde/okular/commit/ > 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 This contribution is excellent, thanks! However, justified text which is the case in most papers/articles/etc most frequently introduce hyphenation, I think having few false positives (if this change would apply to hyphens) justifies the expected high number false negatives when ommitting hyphens: it's just much more likely that a word is "di-vided" (and all of them would still be ommitted) than searching for two independent hyphenated words, such as one-third. Not asking to make it default, but based on that, could you please give that option so users are able to omit endline hyphens? That would help some of us greatly
di-vided has worked and still works, have you even tried it?
(In reply to avlas from comment #4) > (In reply to Albert Astals Cid from comment #2) > > Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid. > > Committed on 28/03/2020 at 13:59. > > Pushed by aacid into branch 'master'. > > > > Let the user type the hyphen if he wants when searching > > > > It happens that sometimes the hypen is actually "part of the word" like > > in one-third, so if there's one- at the end of a line > > and third at the beginning of the next, we should still match and not > > force the user to type onethird, even we will also match onethird since > > there's no way to know if "hyphen at end of line" is supposed to be part > > of the word or not > > > > M +16 -0 autotests/searchtest.cpp > > M +129 -114 core/textpage.cpp > > > > https://invent.kde.org/kde/okular/commit/ > > 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 > > This contribution is excellent, thanks! > > However, justified text which is the case in most papers/articles/etc most > frequently introduce hyphenation, I think having few false positives (if > this change would apply to hyphens) justifies the expected high number false > negatives when ommitting hyphens: it's just much more likely that a word is > "di-vided" (and all of them would still be ommitted) than searching for two > independent hyphenated words, such as one-third. > > Not asking to make it default, but based on that, could you please give that > option so users are able to omit endline hyphens? That would help some of us > greatly To be more specific, under that option: - typing divided would find divided, di- \n vided, divid- \n ed, etc - typing one-third would find one-third and one- \n third - typing onethird would find onethird and one- \n third
(In reply to Albert Astals Cid from comment #5) > di-vided has worked and still works, have you even tried it? I use okular 19.12 and searching proposed detects proposed but does not detect pro- \n posed Also searching pro-posed does not detect pro- \n posed
(In reply to avlas from comment #7) > (In reply to Albert Astals Cid from comment #5) > > di-vided has worked and still works, have you even tried it? > > I use okular 19.12 and searching proposed detects proposed but does not > detect pro- \n posed > > Also searching pro-posed does not detect pro- \n posed Ok, I tested this in a second pdf and works as you mentioned. The problem I had was specific of a 2-column pdf file that is wrongly considered as 1 column. I assume the problem is with the specific pdf format as other 2-column pdfs work just fine in okular. I assume there is no simple heuristic to workaround these wrongly formatted pdfs, which highly affect features such as searching, highlighting and selecting/extracting text. But that's an entirely different issue than the one fixed here. Again, thanks for your contribution!
(In reply to avlas from comment #8) > (In reply to avlas from comment #7) > > (In reply to Albert Astals Cid from comment #5) > > > di-vided has worked and still works, have you even tried it? > > > > I use okular 19.12 and searching proposed detects proposed but does not > > detect pro- \n posed > > > > Also searching pro-posed does not detect pro- \n posed > > Ok, I tested this in a second pdf and works as you mentioned. > > The problem I had was specific of a 2-column pdf file that is wrongly > considered as 1 column. I assume the problem is with the specific pdf format > as other 2-column pdfs work just fine in okular. > > I assume there is no simple heuristic to workaround these wrongly formatted > pdfs, which highly affect features such as searching, highlighting and > selecting/extracting text. > > But that's an entirely different issue than the one fixed here. Again, > thanks for your contribution! Further investigating that wrongly formatted pdf file I found the following behavior when searching for "circum there": https://i.imgur.com/92SWRjo.png Does it mean okular detects a line break and nevertheless it jumps to the different column instead of staying on the same column and jump to the next line? I assume this is a problem of the pdf and not of okular, but the behavior seems very strange, I thought the same line covered the two columns (no line break in between), but the hyphen is ommitted which only happens in line breaks, right?
> I assume there is no simple heuristic to workaround these > wrongly formatted pdfs, which highly affect features such > as searching, highlighting and selecting/extracting text. It’s that TextEntity reordering thing. @avlas Can you search for will overshadowing would apply (in the Thumbnails panel, not in the search bar), so we can see the geometry of the TextEntity objects? If the words are cleary separated between the columns, its a problem with Okular. Okular breaks the document appart in single letters, and then reorders them based on their positions. It uses XY-Cut to separate colums, so it needs some horizontal space between them. Thats pretty useful for many PDFs which are arround in the web (like MeanWell datasheets...), but sometimes doesn’t work. It looks like it’s a scanned paper. If it isn’t aligned perfectly vertical, the columns overlap, and XY-Cut fails. https://phabricator.kde.org/source/okular/browse/master/core/textpage.cpp;9694113a961cb5a5d6ef18ce0beeaa975a8c6db3$1890 if you are interested... Of course it may still be a problem with the PDF. To check that, you can open it in e. g. Firefox and select some text.
(In reply to David Hurka from comment #10) > > I assume there is no simple heuristic to workaround these > > wrongly formatted pdfs, which highly affect features such > > as searching, highlighting and selecting/extracting text. > > It’s that TextEntity reordering thing. > > @avlas Can you search for > > will overshadowing would apply > > (in the Thumbnails panel, not in the search bar), so we can see the geometry > of the TextEntity objects? If the words are cleary separated between the > columns, its a problem with Okular. > > Okular breaks the document appart in single letters, and then reorders them > based on their positions. It uses XY-Cut to separate colums, so it needs > some horizontal space between them. Thats pretty useful for many PDFs which > are arround in the web (like MeanWell datasheets...), but sometimes doesn’t > work. > > It looks like it’s a scanned paper. If it isn’t aligned perfectly vertical, > the columns overlap, and XY-Cut fails. > > https://phabricator.kde.org/source/okular/browse/master/core/textpage.cpp; > 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3$1890 if you are interested... > > Of course it may still be a problem with the PDF. To check that, you can > open it in e. g. Firefox and select some text. Please see: https://i.imgur.com/OV7BLRx.png I checked it in Chromium and seems to work fine. Please see the previous example when typing "circumstances": https://i.imgur.com/8vn1Kpp.png This is an official paper from a journal that I downloaded, but the paper is from 1975, so not sure about the underlying technicalities of the pdf. Yet, text management seems to work just fine (selecting, highlighting, etc). All that does not consider line breaks and columns, which fail in okular but seem to work just fine in chromium. So it might be the heuristic in okular compared to that in chromium, perhaps.
The bug still occurs if there is no hyphenation (checked ePUB and fb2).
Created attachment 145232 [details] bug reproduced with a simple word-wrap in ePUB
Another person, that noticed this bug: https://forum.kde.org/viewtopic.php?f=251&t=173120&sid=750110aba8447386711dbb49d12a1bf5 (with examples)