Bug 484828

Summary: [Enhancement] Have Baloo split camelCase words
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Kody <kodyvonbargen>
Component: generalAssignee: baloo-bugs-null
Status: REPORTED ---    
Severity: normal CC: postix, stefan.bruens, tagwerk19
Priority: NOR    
Version First Reported In: 6.0.0   
Target Milestone: ---   
Platform: Arch Linux   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description Kody 2024-03-31 17:05:55 UTC
SUMMARY
Baloo splits and searches when files and directories are the following:
1. kebab-case
2. snake_case
3. space separated

However, Baloo does not do well with camelCase. As someone who was using camelCase for years, I just chalked it up as Baloo being bad for searching for filenames, when it was just this kind of naming scheme that it was bad at.

STEPS TO REPRODUCE
1. In Dolphin, create the following in some subdirectory: "oneFileTest.txt", "one_file_test.txt", "one-file-test.txt", and "one file test.txt"
2. Use Baloo in Dolphin via `ctrl+f` and search for the strings "one", "file", and "test", filtering them with the "Filter" feature (pressing `/` with the preview pane focused) if needed

OBSERVED RESULT
All 4 of the results show up when you search "one" but the "oneFileTest.txt" does not show up with the other searches

EXPECTED RESULT
All 4 of the results would ideally show up in all the searches

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Arch 6.8.2-arch2-1
KDE Plasma Version: 6.0.2
KDE Frameworks Version: 6.0.0
Qt Version: 6.6.2

ADDITIONAL INFORMATION
For me, Baloo had a bit of a bad rep because I'd search for sub-strings of filenames I knew existed, but it would not find because I was using camelCase for years. I feel like if Baloo would work with camelCase and PascalCase like it does the others, a lot of people would have a much better time using it.
Comment 1 tagwerk19 2024-04-14 10:17:40 UTC
What a nice idea!

A vote of support :-)
Comment 2 Stefan Brüns 2024-04-15 10:57:00 UTC
CamelCase is not actually something which can be trivially split.

Yes, it would work for the cases you presented, but there are too many cases where it would not work, e.g. mixed-case acronyms. While these are not very common for english acronyms, baloo also has to work for other languages. Also, trademark names often have mixed cases (either because they are actually acronyms, to let them stand out, or to just make them trademark-able at all).
Comment 3 tagwerk19 2024-04-15 11:30:26 UTC
(In reply to Stefan Brüns from comment #2)
> ... e.g. mixed-case acronyms ... trademark names ...
Hmmm....

So things like "iPad" and "NaN" ...
    ... or "LaTeX" :-/
    ... or "McArthur"

It probably wouldn't matter if you found "iPad" when searching for "pad". I think whatever algorithm used would still need to index "ipad", "nan" and "latex". On the plus side, the benefit of just being able to search C++ code would be remarkable.

I think a list of "awkward edge cases" (or is that awkwardEdgeCases?) would be needed to see if there are useful patterns or traps...
Comment 4 tagwerk19 2024-04-16 10:08:42 UTC
(In reply to Stefan Brüns from comment #2)
> ... While these are not very common for english acronyms ...
If I look through:
    https://en.wikipedia.org/wiki/Lists_of_acronyms
there a small handful. Haven't read it all...

I would say, provided that Baloo indexes the whole name, it could helpfully split on the "camelCase" boundaries. Would need to avoid single letters (mW, MiB, IoT etc).

The question is what "traps for unwary" look like in other languages...
Comment 5 Stefan Brüns 2024-04-16 13:13:47 UTC
(In reply to tagwerk19 from comment #4)
> (In reply to Stefan Brüns from comment #2)

> The question is what "traps for unwary" look like in other languages...

German: BAFöG, MwSt, GmbH ;-)
Comment 6 tagwerk19 2024-04-16 14:28:55 UTC
(In reply to Stefan Brüns from comment #5)
> German: BAFöG, MwSt, GmbH ;-)
OK, I'll give you MwSt :-)

If I look through:
    https://en.wikipedia.org/wiki/List_of_German_abbreviations
I also get KaDeWe, DuÖAV, HTBLuVA, KfzPflVV, StGB, StVO

If we get too many exceptions, we could have a list of "known acronyms", look this up and avoid splitting those words. An option to have in reserve perhaps...