Bug 436791

Summary: baloo treats similar letters as different - follows Unicode standard
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Amanda99 <amanda99>
Component: generalAssignee: Stefan Brüns <stefan.bruens>
Status: RESOLVED NOT A BUG    
Severity: normal CC: baloo-bugs-null, nate
Priority: NOR    
Version First Reported In: 5.80.0   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description Amanda99 2021-05-08 19:36:38 UTC
SUMMARY

Hi!
Thanks for making baloo, the solid indexing features are one of the things that make linux a better suited for me. 

I do, however, have some issues with baloo indexer.

Sadly, it treats similar (national, like Polish) letters as different- sometimes. While they are different, it is not uncommon to avoid using them in filenames to save yourself some problems (admitedly, it's more of the old behaviour, as code pages were a massive PITA). Anyway, letters 'l' and 'ł' ('l' with a stroke) are not considered similar enough (even when 'l' is sometimes used when using 'ł' is inconvenient), yet letters 'e' and 'ę' ('e' with a tail) are considered similar (i.e. the search results for words with 'ę' /like: "się"/ also include phrase 'sie').

Operating System: Kubuntu 21.04
KDE Plasma Version: 5.21.4
KDE Frameworks Version: 5.80.0
Qt Version: 5.15.2
Kernel Version: 5.11.0-16-generic
OS Type: 64-bit
Graphics Platform: X11
Processors: 4 × AMD PRO A12-9800B R7, 12 COMPUTE CORES 4C+8G
Memory: 14.6 GiB of RAM
Graphics Processor: AMD Radeon R7 Graphics

kde installed from the official repository
Comment 1 Stefan Brüns 2021-05-08 19:48:03 UTC
Baloo relies on decomposition according to the Unicode standard. E.g. the letter ä has an equivalent decomposition 'a + diaresis' (diaresis: "dots"). 'ł' has no equivalent.

You can see all the equivalents either in the Unicode standard, or with KCharSelect.

If you think this is wrong, please report it to the Unicode consortium. Baloo is not able to and thus wont maintain a list of exceptions to the ever evolving Unicode standard.