Summary: | Searching for a phrase does not find occurences split by a line break / newline in Pdf | ||
---|---|---|---|
Product: | [Applications] okular | Reporter: | Volker Lukas <vlukas> |
Component: | PDF backend | Assignee: | Okular developers <okular-devel> |
Status: | RESOLVED FIXED | ||
Severity: | wishlist | CC: | aacid, comeniusmar, hanswchen, mardukbp, oliver.sander |
Priority: | NOR | ||
Version: | 0.14.3 | ||
Target Milestone: | --- | ||
Platform: | openSUSE | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: | Pdf containg two copies of the same phrase, one spreads to second line |
Description
Volker Lukas
2012-06-01 11:47:54 UTC
Can you please attach a pdf and tell us exactly what you are looking for so it's easier for us to develop/check if it works with the development version? Created attachment 71566 [details]
Pdf containg two copies of the same phrase, one spreads to second line
To reproduce, open the file I just attached with Okular. Then open a search bar and type "abc uvw". This will find only one occurence of that phrase in the document, that is the first one is higlighted but clicking the next button will not find the second occurence, that which continues on the second line. One the other hand, if "abc\nuvw" is entered in the search bar, where the "\n" signifies a newline character, the second occurence in the document is found (but not the first one). I suggest to add an option to the search function in Okular which contracts multiple consecutive whitespace characters (including newline) into exactly one space character so that for example searching for "abc uvw" will find both "abc uvw" and "abc\nuvw" (or "abc uvw", etc...). This bug is also present in evince. My comment in the bugreport there (https://bugzilla.gnome.org/show_bug.cgi?id=622160) fully applies to Okular as well: The problem here, as shown in the attached screenshot, seems to be twofold: 1.) sentences spanning across line breaks are not recognized as continuous and aren't taken up by the inbuilt search (lower part of screenshot) 2.) single phrases spanning across line breaks aren't recognized as being continuous, either. There does not seem to be any difference between hyphenated and regular phrases in this. Searching for "main-tenance" in the example above doesn't return any results, either. Neither of these problems exist in proprietary solutions such as Adobe Reader or Foxit. I think it can be argued that fixing this issue is quite important as it greatly diminishes the inbuilt search capabilities. This is the screenshot I was referring to: http://bugzilla-attachments.gnome.org/attachment.cgi?id=236718 Evince's behaviour is identical to that of Okular. Searching for words that span two lines works for me now (see http://tsdgeos.blogspot.com/2012/02/okular-now-with-hyphen-aware-search.html); however, searching for phrases (with spaces) or words with hyphens don't work. Example: This is an ex- ample text Searching for: example - works an example - doesn't work ex-ample - doesn't work Hans, could you please attach the pdf you're using? Hans, if you can provide the information requested in comment #7, please add it. Sorry about the late reply, I've been traveling during the summer and been generally busy. The pdf file I used was an academic paper that's behind a paywall so I can't share it here. Unfortunately I don't remember which paper I used to test, but I can't reproduce the following case anymore: Example: This is an ex- ample text Searching for: an example - doesn't work -> actually works So you can forget about this case unless someone gives you a test case. That leaves us with the following: ex-ample - doesn't work You can try this with e.g. the KDE Dev Guide: http://en.flossmanuals.net/kde-guide/_booki/kde-guide/kde-guide.pdf Go to page 4 and you'll see the following text: "graphics, design, communication, translations, documentation, testing, bug-reporting and bug- hunting, system administration, and coding." Searching for "bughunting" works, but "bug-hunting" isn't found. Also, I can still reproduce the originally reported bug that words separated by newline but not connected by a hyphen aren't found. Example: search for "This includes" in the KDE Dev Guide, it should be found in the third sentence on page 3. (Okular 0.19.2) Why would be bug-hunting found? The - is just there because there's a line break, so noone would search for it willingly. Because some words could be spelled with a "-". An example in the KDE Dev Manual is "cross-platform", which is spelled like that consistently throughout the document. However, searching for "cross-platform" misses the word on page 14 because it's broken by a newline. For what it's worth Adobe Reader also fails in this aspect. Hmmm, ok, i see what you mean, it's a bit corner case-y but i guess it should not be that hard to fix, *but* i think that should go to a different bug report since it's really an specialization of the originally reported bug. I'll open a new bug about it myself. Agreed that it's a separate bug, thanks for creating a report for it! This bug is still not fixed (at least in the version included in Ubuntu 16.04). Try searching for "viewed on" in this file http://www.pdf995.com/samples/pdf.pdf Firefox's JS-based PDF viewer does find the phrase. Open a different bug, *this* one was fixed. (In reply to Albert Astals Cid from comment #16) > Open a different bug, *this* one was fixed. I respectfully disagree. My bug report would have exactly the same title and the same description. Honestly, the discussion in this bug report does not make clear that it was fixed. What is clear is that searching for hyphenated words works, which indeed is the case, but that is not the subject of the bug report. If for technical reasons the status or the title cannot be changed, then it makes sense to open a new bug report. This bug bothers me so much, that I solved it. In core/textpage.cpp I appended to the definition of CaseInsensitiveCmpFn and CaseSensitiveCmpFn the following: if ( from.endsWith(QLatin1Char('\n')) && to.endsWith(QLatin1Char(' ')) ) { return true; } This means that a space in the query will match against a space (done by QString.compare) and a newline in the PDF. Could you please include this in the next release? Could you please upload your patch to https://git.reviewboard.kde.org/ ? That will make reviewing it much easier. |