Bug 300992 - Searching for a phrase does not find occurences split by a line break / newline in Pdf
Summary: Searching for a phrase does not find occurences split by a line break / newli...
Status: RESOLVED FIXED
Alias: None
Product: okular
Classification: Applications
Component: PDF backend (show other bugs)
Version: 0.14.3
Platform: openSUSE Linux
: NOR wishlist
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-01 11:47 UTC by Volker Lukas
Modified: 2016-11-25 05:00 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Pdf containg two copies of the same phrase, one spreads to second line (7.37 KB, application/pdf)
2012-06-04 11:49 UTC, Volker Lukas
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Volker Lukas 2012-06-01 11:47:54 UTC
When pressing Ctrl + f to search in a Pdf document, when searching for two words separated by a space in the search bar, those occurences of that phrase in the document which span over the end of a line are not found.

Example: One enters "british library" in the search bar, including the space character between the words (excluding the quotes). This will highlight the phrase in the document if both words are found on a single line. But if the word "british" is the last word in a line and the word "library" follows immediateley at the beginning of the next line, the prase is not found / highlighted at that place.

Seaching for "britishlibrary" (without the space) will not find the phrase separated by a line break either.

What does work is to insert a line break character (for example by copy and paste from Kwrite) between the words in the search bar.

This is not strictly a bug, after all a newline character is different from a space character. But it makes finding of all occurences of a phrase in a document more complex than necessary.


Reproducible: Always




Using Okular version 0.14.3 (this version was not found in the list-box above)
Comment 1 Albert Astals Cid 2012-06-03 17:38:23 UTC
Can you please attach a pdf and tell us exactly what you are looking for so it's easier for us to develop/check if it works with the development version?
Comment 2 Volker Lukas 2012-06-04 11:49:21 UTC
Created attachment 71566 [details]
Pdf containg two copies of the same phrase, one spreads to second line
Comment 3 Volker Lukas 2012-06-04 12:01:47 UTC
To reproduce, open the file I just attached with Okular. Then open a search bar and type "abc uvw". This will find only one occurence of that phrase in the document, that is the first one is higlighted but clicking the next button will not find the second occurence, that which continues on the second line.

One the other hand, if "abc\nuvw" is entered in the search bar, where the "\n" signifies a newline character, the second occurence in the document is found (but not the first one).

I suggest to add an option to the search function in Okular which contracts multiple consecutive whitespace characters (including newline) into exactly one space character so that for example searching for "abc uvw" will find both "abc uvw" and "abc\nuvw" (or "abc      uvw", etc...).
Comment 4 Florian Moretz 2013-02-19 07:53:50 UTC
This bug is also present in evince. My comment in the bugreport there (https://bugzilla.gnome.org/show_bug.cgi?id=622160) fully applies to Okular as well:

The problem here, as
shown in the attached screenshot, seems to be twofold:

1.) sentences spanning across line breaks are not recognized as continuous and
aren't taken up by the inbuilt search (lower part of screenshot)

2.) single phrases spanning across line breaks aren't recognized as being
continuous, either. There does not seem to be any difference between hyphenated
and regular phrases in this. Searching for "main-tenance" in the example above
doesn't return any results, either.

Neither of these problems exist in proprietary solutions such as Adobe Reader
or Foxit. I think it can be argued that fixing this issue is quite important as
it greatly diminishes the inbuilt search capabilities.
Comment 5 Florian Moretz 2013-02-19 07:55:06 UTC
This is the screenshot I was referring to: http://bugzilla-attachments.gnome.org/attachment.cgi?id=236718

Evince's behaviour is identical to that of Okular.
Comment 6 Hans Chen 2014-08-06 00:23:35 UTC
Searching for words that span two lines works for me now (see http://tsdgeos.blogspot.com/2012/02/okular-now-with-hyphen-aware-search.html); however, searching for phrases (with spaces) or words with hyphens don't work.

Example:

This is an ex-
ample text

Searching for:
example - works
an example - doesn't work
ex-ample - doesn't work
Comment 7 Albert Astals Cid 2014-08-06 19:02:14 UTC
Hans, could you please attach the pdf you're using?
Comment 8 Christoph Feck 2014-09-13 20:34:28 UTC
Hans, if you can provide the information requested in comment #7, please add it.
Comment 9 Hans Chen 2014-09-16 16:06:23 UTC
Sorry about the late reply, I've been traveling during the summer and been generally busy. The pdf file I used was an academic paper that's behind a paywall so I can't share it here. Unfortunately I don't remember which paper I used to test, but I can't reproduce the following case anymore:

Example:

This is an ex-
ample text

Searching for:
an example - doesn't work -> actually works

So you can forget about this case unless someone gives you a test case. That leaves us with the following:

ex-ample - doesn't work

You can try this with e.g. the KDE Dev Guide: http://en.flossmanuals.net/kde-guide/_booki/kde-guide/kde-guide.pdf
Go to page 4 and you'll see the following text:
"graphics, design, communication, translations, documentation, testing, bug-reporting and bug-
hunting, system administration, and coding."
Searching for "bughunting" works, but "bug-hunting" isn't found.

Also, I can still reproduce the originally reported bug that words separated by newline but not connected by a hyphen aren't found.
Example: search for "This includes" in the KDE Dev Guide, it should be found in the third sentence on page 3.

(Okular 0.19.2)
Comment 10 Albert Astals Cid 2014-09-16 17:52:34 UTC
Why would be bug-hunting found?

The - is just there because there's a line break, so noone would search for it willingly.
Comment 11 Hans Chen 2014-09-16 18:04:19 UTC
Because some words could be spelled with a "-". An example in the KDE Dev Manual is "cross-platform", which is spelled like that consistently throughout the document. However, searching for "cross-platform" misses the word on page 14 because it's broken by a newline.

For what it's worth Adobe Reader also fails in this aspect.
Comment 12 Albert Astals Cid 2014-09-16 18:25:11 UTC
Hmmm, ok, i see what you mean, it's a bit corner case-y but i guess it should not be that hard to fix, *but* i think that should go to a different bug report since it's really an specialization of the originally reported bug. I'll open a new bug about it myself.
Comment 13 Albert Astals Cid 2014-09-16 18:27:15 UTC
https://bugs.kde.org/show_bug.cgi?id=339126
Comment 14 Hans Chen 2014-09-16 23:00:03 UTC
Agreed that it's a separate bug, thanks for creating a report for it!
Comment 15 Marduk 2016-11-13 19:25:08 UTC
This bug is still not fixed (at least in the version included in Ubuntu 16.04). Try searching for "viewed on" in this file http://www.pdf995.com/samples/pdf.pdf

Firefox's JS-based PDF viewer does find the phrase.
Comment 16 Albert Astals Cid 2016-11-14 23:20:17 UTC
Open a different bug, *this* one was fixed.
Comment 17 Marduk 2016-11-15 19:28:50 UTC
(In reply to Albert Astals Cid from comment #16)
> Open a different bug, *this* one was fixed.

I respectfully disagree. My bug report would have exactly the same title and the same description. Honestly, the discussion in this bug report does not make clear that it was fixed. What is clear is that searching for hyphenated words works, which indeed is the case, but that is not the subject of the bug report.

If for technical reasons the status or the title cannot be changed, then it makes sense to open a new bug report.
Comment 18 Marduk 2016-11-24 23:33:32 UTC
This bug bothers me so much, that I solved it.

In core/textpage.cpp I appended to the definition of CaseInsensitiveCmpFn and CaseSensitiveCmpFn the following:

if ( from.endsWith(QLatin1Char('\n')) && to.endsWith(QLatin1Char(' ')) ) {

    return true;
}

This means that a space in the query will match against a space (done by QString.compare) and a newline in the PDF.

Could you please include this in the next release?
Comment 19 Oliver Sander 2016-11-25 05:00:57 UTC
Could you please upload your patch to https://git.reviewboard.kde.org/ ?

That will make reviewing it much easier.