419447 – Search does not match any accented character (diacritics)

Bug 419447 - Search does not match any accented character (diacritics)

Summary: Search does not match any accented character (diacritics)

Status:	RESOLVED NOT A BUG

Alias:	None

Product:	okular
Classification:	Applications
Component:	general (other bugs)
Version First Reported In:	1.7.0
Platform:	Mageia RPMs Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Okular developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-03-31 09:36 UTC by Jerome
Modified:	2020-04-07 07:57 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Screenshot showing search result for "e" (6.70 KB, image/png) 2020-04-01 09:12 UTC, Jerome	Details
PDF file with a series of accented characters (11.93 KB, application/pdf) 2020-04-01 09:12 UTC, Jerome	Details
Search for "à" in the document using evince (7.19 KB, image/png) 2020-04-05 19:45 UTC, Jerome	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Jerome 2020-03-31 09:36:43 UTC

SUMMARY

Searching a PDF document in Okular does not match any accented character (diacritics)

STEPS TO REPRODUCE

1. Open PDF document containing an accented character like à
2. Search for "à"

OBSERVED RESULT

Search finds no match.

EXPECTED RESULT

Find a match.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Mageia 7
KDE Frameworks Version: 5.57.0
Qt Version: Qt 5.12.6 (built against 5.12.2)

ADDITIONAL INFORMATION

This is a different problem from Bug #274933.

Comment 1 Albert Astals Cid 2020-03-31 21:57:50 UTC

file please?

Comment 2 Jerome 2020-04-01 09:12:20 UTC

Created attachment 127151 [details]
Screenshot showing search result for "e"

Searching for e in case-insensitive mode seems to show that the accented character is interpreted as two characters, with only the first one matching.

Comment 3 Jerome 2020-04-01 09:12:54 UTC

Created attachment 127152 [details]
PDF file with a series of accented characters

Comment 4 Jerome 2020-04-01 09:16:01 UTC

Another point I noticed: when using the Selection tool and selecting an accented character, the context menu offers to copy two characters.

Comment 5 Yuri Chornoivan 2020-04-01 09:18:08 UTC

All the needed data supplied.

Conflicting bug report:

https://bugs.kde.org/show_bug.cgi?id=274933

Comment 6 Jerome 2020-04-01 09:29:31 UTC

To connect this with bug #274933:

if the text contains "aé éa" then searching for "ae" matches the first word (and highlights the "a" and only one half of the "é" character as in the screenshot).
Searching for "ea" does not find a match, I assume because of the virtual, unmatchable second character of "é".

Comment 7 Albert Astals Cid 2020-04-05 17:25:36 UTC

I'm sorry but this is not a bug, the PDF is simply not created correctly and is created with an A and then a ` on top of it as two seperate caracters and not with a À character.

That's why search fails and why copy&paste gives you two characters, because there's two characters.

I have not been able to find any PDF viewer that can search à in this document (Adobe Reader cheats and since it can't find any à it says, i'm going to match all the a in the document and also matches Ä for example)

Comment 8 Jerome 2020-04-05 19:45:24 UTC

The PDF is generated by pdflatex.

I still think that is a bug, because one way I can start such a search is by copying accented characters from the document and pasting them into the search box. I don't know if that is one or two characters but whatever that string is in the PDF, I'd like to search for it.

> I have not been able to find any PDF viewer that can search à in this document

I have. It was the first one I tried: evince.

Comment 9 Jerome 2020-04-05 19:45:58 UTC

Created attachment 127309 [details]
Search for "à" in the document using evince

Comment 10 Albert Astals Cid 2020-04-06 21:06:32 UTC

(In reply to Jerome from comment #8)
> The PDF is generated by pdflatex.

I know latex has too many configuration options and one stupidly does it wrong, search because there's one that does it right and writes a single character. 

> 
> I still think that is a bug, because one way I can start such a search is by
> copying accented characters from the document and pasting them into the
> search box. I don't know if that is one or two characters but whatever that
> string is in the PDF, I'd like to search for it.
> 
> > I have not been able to find any PDF viewer that can search à in this document
> 
> I have. It was the first one I tried: evince.

Comment 11 Jerome 2020-04-07 07:57:53 UTC

Indeed that PDF document was encoded as OT1, which is not recommended, and the search works with the document encoded with T1.

What I find strange is that when copied and pasted into the search box, the pair of characters (letter + diacritic) is correctly interpreted, and I assume, converted to its unicode equivalent.