Version: 0.4 (using KDE 3.4.0, Debian Package 4:3.4.0-0pre3 (3.1)) Compiler: gcc version 3.3.5 (Debian 1:3.3.5-8) OS: Linux (i686) release 2.6.8-1-386 Words containing a double f (effect, efficiency, ...) are not found by the find command. I guess the reason is that tex substitutes these with a nicer, special character. If this is indeed the reason, I am sure there are other cases where the same applies - unfortunately I don't know enough about tex to help on this.
That means there is no "ff" to be found.
Created attachment 10571 [details] sample file containing some words with "ff" and "fi"...
actually, the same happens with "fi" ... hope the sample file helps!
Acrobat Reader can't find "ff" in the file either. As I said, there's no two letters "f" in the file, in a row. There's just a special glyph that looks like them. Changing to wishlist. Albert can tell you if it's possible to find those at all.
ok, tx! Would really make sense to be able to include these glyphs in the search...
You can include these glyphs, the problem is typing them, but if you use kcharselect to copy the ff ligature (page 251 unicode char FB00) the search works
tx for the tip! I already did that copying the text - it's just that this means you have to find a word manually first. plus there's not just "ff" but also "fi" (and more?) actually, i thought more about including these special characters into the program as a feature ... i work in science and practically all ppl I know (physicists, theor. biology, but also computer science and mechatronics) use tex to create their pdfs and hence have nice, typeset unicode characters (openoffice does not suffer the same problems, incidentally). i stumbled upon the problem trying to search for the word efficiency - it was only after i could not find a single occurence in 13 out of 15 papers that I grew suspicious ... I haven't done a real review, but I bet more than half the scientific literature out there has an ff/fi problem... if you do decide to consider adding support, also think about the reverse - copying out of pdfs - just had a problem with this... (bibtex does not apprechiate "efï¬ciency") i just switched to kpdf 2 weeks ago and am already a big fan! keep up the good work!
We had a lengthy discussion on IRC yesterday about how to properly handle these cases. First of all, PDF makes no promises about Unicode or codepoint assignments. Since it can embed a font, the characters used in that font don't have to have any meaning. Only when the font isn't embedded can you find a relation between what's on the PDF and the real text. The consequence of that is that the "ff" ligature, which is represented in Unicode by U+FB00, can be any character! We cannot be sure that a character of value 0xfb00 is an "ff", or that another character in the file isn't "ff". This is also quite common on non-Latin-1 PDF files: the codepoints reserved for Latin Extended-A (0x0080 to 0x00ff) are reused to the other characters in the script. Moreover, TeX also uses some "ugly hacks" in order to produce non-ASCII text. For instance, the letter "é" will be represented by U+00B4 (ACUTE ACCENT), with an "e" on the line below. If you try to select a text like my middle name ("José"), instead of 4 characters, you'll end up with 9! (Jos', newline, three spaces, e). Now, that doesn't happen on OpenOffice.org-generated PDFs. Those use their proper codepoints, so not only will you find "é" as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), but you will not find the ff-ligature, unless you explicitly typed it in your .sxw source. To complicate things, some generated PDFs contain no text at all. I can reproduce that with all my KDE-generated (Qt?) PDFs, for instance. There's simply no text to be found. Possible solutions: - search for U+FB00 when the user types "ff" in the search field. Problem: there are many other ligatures possible. - apply Unicode NFKC to any characters found in the Unicode ligature range, so that they are transformed into their basic compat forms. Problems: as I said, there's no guarantee that U+FB00 is actually the ff-ligature, however likely. - apply Unicode NKFC on the whole text. Problems: same as above, but with way more problems. Of those, I think #2 is probably the best.
Thanks for the effort! I agree with your suggestions and conclusion. I guess this would also be something to discuss with the TeX guys ... in the end they are the ones causing the problem, even if it's obviously all done with good intent! Maybe they have some productive input on the subject...
*** This bug has been confirmed by popular vote. ***
*** Bug 126678 has been marked as a duplicate of this bug. ***
I think this is no longer an issue. The current version of KPDF seems to handle search for ligatures fine. Copying and pasting them, is still a problem though and I've filed a new bug report for that.
This works fine on Okular that is KDE 4 successor of KPDF. Thanks for taking time for creating this suggestion, we hope you can update to Okular and enjoy this feature.