Bug 418520 - Find function misses occurrences of target string that wrap from one line to next line of document text.
Summary: Find function misses occurrences of target string that wrap from one line to ...
Status: REOPENED
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: 1.9.2
Platform: Microsoft Windows Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
: 376692 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-03-06 00:18 UTC by Richard Ferguson
Modified: 2022-05-22 10:59 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In: 20.08.0


Attachments
Test file constructed from excerpt of PDF document. (26.92 KB, application/pdf)
2020-03-06 00:18 UTC, Richard Ferguson
Details
bug reproduced with a simple word-wrap in ePUB (136.62 KB, image/webp)
2022-01-08 16:55 UTC, soshial
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Ferguson 2020-03-06 00:18:26 UTC
Created attachment 126616 [details]
Test file constructed from excerpt of PDF document.

SUMMARY
The Find function misses target occurrences that wrap to the next line of text.

STEPS TO REPRODUCE
1. Control-f to initiate Find. Type in "one-third".  The "one-third" in mid sentence is found.  The "one-third" in "not less than one-[carriage return]
third that of the larger conductor" is not.  This also does not find other phrases used to describe numerical fractions.  For example, two-thirds, three-quarters, etc.
2. Copy "when not[carriage return]
part of the wiring".  Control-f to initiate Find.  Paste string into Find.  Press Next.  Text is found.
Control-f to Find.  Type "when not part of the wiring" in Find.  Press Next.  Text is not found.  Okular appears to be miss any text string that is typed in if it wraps from one line to the next.
3. 

OBSERVED RESULT
Misses occurrences of target string that wrap to next line of text.

EXPECTED RESULT
Should find target text strings that wrap from one line to the next.

SOFTWARE/OS VERSIONS
Windows: Windows 10, build 1909.
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION
See attached test file.
Comment 1 Albert Astals Cid 2020-03-14 10:46:41 UTC
Proposed patch at https://invent.kde.org/kde/okular/-/merge_requests/139
Comment 2 Albert Astals Cid 2020-03-28 14:35:13 UTC
Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid.
Committed on 28/03/2020 at 13:59.
Pushed by aacid into branch 'master'.

Let the user type the hyphen if he wants when searching

It happens that sometimes the hypen is actually "part of the word" like
in one-third, so if there's one- at the end of a line
and third at the beginning of the next, we should still match and not
force the user to type onethird, even we will also match onethird since
there's no way to know if "hyphen at end of line" is supposed to be part
of the word or not

M  +16   -0    autotests/searchtest.cpp
M  +129  -114  core/textpage.cpp

https://invent.kde.org/kde/okular/commit/9694113a961cb5a5d6ef18ce0beeaa975a8c6db3
Comment 3 Nate Graham 2020-03-28 18:15:55 UTC
*** Bug 376692 has been marked as a duplicate of this bug. ***
Comment 4 avlas 2020-03-29 13:40:54 UTC
(In reply to Albert Astals Cid from comment #2)
> Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid.
> Committed on 28/03/2020 at 13:59.
> Pushed by aacid into branch 'master'.
> 
> Let the user type the hyphen if he wants when searching
> 
> It happens that sometimes the hypen is actually "part of the word" like
> in one-third, so if there's one- at the end of a line
> and third at the beginning of the next, we should still match and not
> force the user to type onethird, even we will also match onethird since
> there's no way to know if "hyphen at end of line" is supposed to be part
> of the word or not
> 
> M  +16   -0    autotests/searchtest.cpp
> M  +129  -114  core/textpage.cpp
> 
> https://invent.kde.org/kde/okular/commit/
> 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3

This contribution is excellent, thanks!

However, justified text which is the case in most papers/articles/etc most frequently introduce hyphenation, I think having few false positives (if this change would apply to hyphens) justifies the expected high number false negatives when ommitting hyphens: it's just much more likely that a word is "di-vided" (and all of them would still be ommitted) than searching for two independent hyphenated words, such as one-third.

Not asking to make it default, but based on that, could you please give that option so users are able to omit endline hyphens? That would help some of us greatly
Comment 5 Albert Astals Cid 2020-03-29 13:44:51 UTC
di-vided has worked and still works, have you even tried it?
Comment 6 avlas 2020-03-29 13:49:13 UTC
(In reply to avlas from comment #4)
> (In reply to Albert Astals Cid from comment #2)
> > Git commit 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3 by Albert Astals Cid.
> > Committed on 28/03/2020 at 13:59.
> > Pushed by aacid into branch 'master'.
> > 
> > Let the user type the hyphen if he wants when searching
> > 
> > It happens that sometimes the hypen is actually "part of the word" like
> > in one-third, so if there's one- at the end of a line
> > and third at the beginning of the next, we should still match and not
> > force the user to type onethird, even we will also match onethird since
> > there's no way to know if "hyphen at end of line" is supposed to be part
> > of the word or not
> > 
> > M  +16   -0    autotests/searchtest.cpp
> > M  +129  -114  core/textpage.cpp
> > 
> > https://invent.kde.org/kde/okular/commit/
> > 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3
> 
> This contribution is excellent, thanks!
> 
> However, justified text which is the case in most papers/articles/etc most
> frequently introduce hyphenation, I think having few false positives (if
> this change would apply to hyphens) justifies the expected high number false
> negatives when ommitting hyphens: it's just much more likely that a word is
> "di-vided" (and all of them would still be ommitted) than searching for two
> independent hyphenated words, such as one-third.
> 
> Not asking to make it default, but based on that, could you please give that
> option so users are able to omit endline hyphens? That would help some of us
> greatly

To be more specific, under that option:

- typing divided would find divided, di- \n vided, divid- \n ed, etc
- typing one-third would find one-third and one- \n third
- typing onethird would find onethird and one- \n third
Comment 7 avlas 2020-03-29 13:51:35 UTC
(In reply to Albert Astals Cid from comment #5)
> di-vided has worked and still works, have you even tried it?

I use okular 19.12 and searching proposed detects proposed but does not detect pro- \n posed

Also searching  pro-posed does not detect pro- \n posed
Comment 8 avlas 2020-03-29 14:14:37 UTC
(In reply to avlas from comment #7)
> (In reply to Albert Astals Cid from comment #5)
> > di-vided has worked and still works, have you even tried it?
> 
> I use okular 19.12 and searching proposed detects proposed but does not
> detect pro- \n posed
> 
> Also searching  pro-posed does not detect pro- \n posed

Ok, I tested this in a second pdf and works as you mentioned.

The problem I had was specific of a 2-column pdf file that is wrongly considered as 1 column. I assume the problem is with the specific pdf format as other 2-column pdfs work just fine in okular.

I assume there is no simple heuristic to workaround these wrongly formatted pdfs, which highly affect features such as searching, highlighting and selecting/extracting text.

But that's an entirely different issue than the one fixed here. Again, thanks for your contribution!
Comment 9 avlas 2020-03-29 14:23:12 UTC
(In reply to avlas from comment #8)
> (In reply to avlas from comment #7)
> > (In reply to Albert Astals Cid from comment #5)
> > > di-vided has worked and still works, have you even tried it?
> > 
> > I use okular 19.12 and searching proposed detects proposed but does not
> > detect pro- \n posed
> > 
> > Also searching  pro-posed does not detect pro- \n posed
> 
> Ok, I tested this in a second pdf and works as you mentioned.
> 
> The problem I had was specific of a 2-column pdf file that is wrongly
> considered as 1 column. I assume the problem is with the specific pdf format
> as other 2-column pdfs work just fine in okular.
> 
> I assume there is no simple heuristic to workaround these wrongly formatted
> pdfs, which highly affect features such as searching, highlighting and
> selecting/extracting text.
> 
> But that's an entirely different issue than the one fixed here. Again,
> thanks for your contribution!


Further investigating that wrongly formatted pdf file I found the following behavior when searching for "circum there":

https://i.imgur.com/92SWRjo.png

Does it mean okular detects a line break and nevertheless it jumps to the different column instead of staying on the same column and jump to the next line?

I assume this is a problem of the pdf and not of okular, but the behavior seems very strange, I thought the same line covered the two columns (no line break in between), but the hyphen is ommitted which only happens in line breaks, right?
Comment 10 David Hurka 2020-03-29 18:51:20 UTC
> I assume there is no simple heuristic to workaround these
> wrongly formatted pdfs, which highly affect features such
> as searching, highlighting and selecting/extracting text.

It’s that TextEntity reordering thing.

@avlas Can you search for

    will overshadowing would apply

(in the Thumbnails panel, not in the search bar), so we can see the geometry of the TextEntity objects? If the words are cleary separated between the columns, its a problem with Okular.

Okular breaks the document appart in single letters, and then reorders them based on their positions. It uses XY-Cut to separate colums, so it needs some horizontal space between them. Thats pretty useful for many PDFs which are arround in the web (like MeanWell datasheets...), but sometimes doesn’t work.

It looks like it’s a scanned paper. If it isn’t aligned perfectly vertical, the columns overlap, and XY-Cut fails.

https://phabricator.kde.org/source/okular/browse/master/core/textpage.cpp;9694113a961cb5a5d6ef18ce0beeaa975a8c6db3$1890 if you are interested...

Of course it may still be a problem with the PDF. To check that, you can open it in e. g. Firefox and select some text.
Comment 11 avlas 2020-03-29 19:10:05 UTC
(In reply to David Hurka from comment #10)
> > I assume there is no simple heuristic to workaround these
> > wrongly formatted pdfs, which highly affect features such
> > as searching, highlighting and selecting/extracting text.
> 
> It’s that TextEntity reordering thing.
> 
> @avlas Can you search for
> 
>     will overshadowing would apply
> 
> (in the Thumbnails panel, not in the search bar), so we can see the geometry
> of the TextEntity objects? If the words are cleary separated between the
> columns, its a problem with Okular.
> 
> Okular breaks the document appart in single letters, and then reorders them
> based on their positions. It uses XY-Cut to separate colums, so it needs
> some horizontal space between them. Thats pretty useful for many PDFs which
> are arround in the web (like MeanWell datasheets...), but sometimes doesn’t
> work.
> 
> It looks like it’s a scanned paper. If it isn’t aligned perfectly vertical,
> the columns overlap, and XY-Cut fails.
> 
> https://phabricator.kde.org/source/okular/browse/master/core/textpage.cpp;
> 9694113a961cb5a5d6ef18ce0beeaa975a8c6db3$1890 if you are interested...
> 
> Of course it may still be a problem with the PDF. To check that, you can
> open it in e. g. Firefox and select some text.

Please see:

https://i.imgur.com/OV7BLRx.png

I checked it in Chromium and seems to work fine. Please see the previous example when typing "circumstances":

https://i.imgur.com/8vn1Kpp.png

This is an official paper from a journal that I downloaded, but the paper is from 1975, so not sure about the underlying technicalities of the pdf. Yet, text management seems to work just fine (selecting, highlighting, etc). All that does not consider line breaks and columns, which fail in okular but seem to work just fine in chromium. So it might be the heuristic in okular compared to that in chromium, perhaps.
Comment 12 soshial 2022-01-08 16:52:50 UTC
The bug still occurs if there is no hyphenation (checked ePUB and fb2).
Comment 13 soshial 2022-01-08 16:55:40 UTC
Created attachment 145232 [details]
bug reproduced with a simple word-wrap in ePUB
Comment 14 soshial 2022-01-16 05:47:33 UTC
Another person, that noticed this bug: https://forum.kde.org/viewtopic.php?f=251&t=173120&sid=750110aba8447386711dbb49d12a1bf5 (with examples)