Bug 103621 - searching pdf files does not find "ff" correctly
Summary: searching pdf files does not find "ff" correctly
Status: RESOLVED FIXED
Alias: None
Product: kpdf
Classification: Applications
Component: general (show other bugs)
Version: 0.4
Platform: unspecified Linux
: NOR wishlist
Target Milestone: ---
Assignee: Albert Astals Cid
URL:
Keywords:
: 126678 (view as bug list)
Depends on:
Blocks:
 
Reported: 2005-04-10 20:50 UTC by Markus Waibel
Modified: 2012-06-11 21:12 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
sample file containing some words with "ff" and "fi"... (11.82 KB, application/pdf)
2005-04-10 22:06 UTC, Markus Waibel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Markus Waibel 2005-04-10 20:50:40 UTC
Version:           0.4 (using KDE 3.4.0, Debian Package 4:3.4.0-0pre3 (3.1))
Compiler:          gcc version 3.3.5 (Debian 1:3.3.5-8)
OS:                Linux (i686) release 2.6.8-1-386

Words containing a double f (effect, efficiency, ...) are not found by the find command. I guess the reason is that tex substitutes these with a nicer, special character. If this is indeed the reason, I am sure there are other cases where the same applies - unfortunately I don't know enough about tex to help on this.
Comment 1 Thiago Macieira 2005-04-10 21:48:42 UTC
That means there is no "ff" to be found.
Comment 2 Markus Waibel 2005-04-10 22:06:44 UTC
Created attachment 10571 [details]
sample file containing some words with "ff" and "fi"...
Comment 3 Markus Waibel 2005-04-10 22:07:36 UTC
actually, the same happens with "fi" ... hope the sample file helps!
Comment 4 Thiago Macieira 2005-04-10 22:11:22 UTC
Acrobat Reader can't find "ff" in the file either. As I said, there's no two letters "f" in the file, in a row. There's just a special glyph that looks like them.

Changing to wishlist. Albert can tell you if it's possible to find those at all.
Comment 5 Markus Waibel 2005-04-10 22:55:49 UTC
ok, tx!
Would really make sense to be able to include these glyphs in the search...
Comment 6 Albert Astals Cid 2005-04-10 23:06:03 UTC
You can include these glyphs, the problem is typing them, but if you use kcharselect to copy the ff ligature (page 251 unicode char FB00) the search works
Comment 7 Markus Waibel 2005-04-12 00:10:27 UTC
tx for the tip! I already did that copying the text - it's just that this means you have to find a word manually first. plus there's not just "ff" but also "fi" (and more?)

actually, i thought more about including these special characters into the program as a feature ... i work in science and practically all ppl I know (physicists, theor. biology, but also computer science and mechatronics) use tex to create their pdfs and hence have nice, typeset unicode characters (openoffice does not suffer the same problems, incidentally). 

i stumbled upon the problem trying to search for the word efficiency - it was only after i could not find a single occurence in 13 out of 15 papers that I grew suspicious ... I haven't done a real review, but I bet more than half the scientific literature out there has an ff/fi problem...

if you do decide to consider adding support, also think about the reverse - copying out of pdfs - just had a problem with this... (bibtex does not apprechiate "efï¬ciency") 

i just switched to kpdf 2 weeks ago and am already a big fan! keep up the good work!
Comment 8 Thiago Macieira 2005-04-12 00:50:55 UTC
We had a lengthy discussion on IRC yesterday about how to properly handle these cases.

First of all, PDF makes no promises about Unicode or codepoint assignments. Since it can embed a font, the characters used in that font don't have to have any meaning. Only when the font isn't embedded can you find a relation between what's on the PDF and the real text.

The consequence of that is that the "ff" ligature, which is represented in Unicode by U+FB00, can be any character! We cannot be sure that a character of value 0xfb00 is an "ff", or that another character in the file isn't "ff". This is also quite common on non-Latin-1 PDF files: the codepoints reserved for Latin Extended-A (0x0080 to 0x00ff) are reused to the other characters in the script.

Moreover, TeX also uses some "ugly hacks" in order to produce non-ASCII text. For instance, the letter "é" will be represented by U+00B4 (ACUTE ACCENT), with an "e" on the line below. If you try to select a text like my middle name ("José"), instead of 4 characters, you'll end up with 9! (Jos', newline, three spaces, e).

Now, that doesn't happen on OpenOffice.org-generated PDFs. Those use their proper codepoints, so not only will you find "é" as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), but you will not find the ff-ligature, unless you explicitly typed it in your .sxw source.

To complicate things, some generated PDFs contain no text at all. I can reproduce that with all my KDE-generated (Qt?) PDFs, for instance. There's simply no text to be found.

Possible solutions:
- search for U+FB00 when the user types "ff" in the search field. Problem: there are many other ligatures possible.
- apply Unicode NFKC to any characters found in the Unicode ligature range, so that they are transformed into their basic compat forms. Problems: as I said, there's no guarantee that U+FB00 is actually the ff-ligature, however likely.
- apply Unicode NKFC on the whole text. Problems: same as above, but with way more problems.

Of those, I think #2 is probably the best.
Comment 9 Markus Waibel 2005-04-14 16:41:40 UTC
Thanks for the effort! I agree with your suggestions and conclusion.

I guess this would also be something to discuss with the TeX guys ... in the end they are the ones causing the problem, even if it's obviously all done with good intent! Maybe they have some productive input on the subject...
Comment 10 Robin Green 2006-04-21 02:46:13 UTC
*** This bug has been confirmed by popular vote. ***
Comment 11 Albert Astals Cid 2006-05-03 19:49:36 UTC
*** Bug 126678 has been marked as a duplicate of this bug. ***
Comment 12 Christoph 2007-03-03 06:55:34 UTC
I think this is no longer an issue. The current version of KPDF seems to handle search for ligatures fine. Copying and pasting them, is still a problem though and I've filed a new bug report for that.
Comment 13 Albert Astals Cid 2012-06-11 21:12:54 UTC
This works fine on Okular that is KDE 4 successor of KPDF. Thanks for taking time for creating this suggestion, we hope you can update to Okular and enjoy this feature.