Bug 405094 - Symbols in name break search
Summary: Symbols in name break search
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Engine (other bugs)
Version First Reported In: 5.50.0
Platform: Kubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-04 22:03 UTC by nathan.figueroa
Modified: 2025-01-23 12:09 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description nathan.figueroa 2019-03-04 22:03:05 UTC
Summary: User creates files with names that are a mix of symbols, spaces, and alphanumerics. When they type in the intuitive portions of the file name Baloo sometimes fails to retrieve the file. More specifically, it appears that Baloo does not like numbers that are part of a word


STEPS TO REPRODUCE
1. Create txt files named 
'AB - ABC - 1st'
'AB - ABC - 1 Filler'
'AB - ABC - 1st Filler'
2. Search for 'AB'
3. Search for 'AB 1'


OBSERVED RESULT
Searching for 'AB' finds #1-3
Searching for 'AB 1' finds only #2


EXPECTED RESULT
Both should return all 3.

SOFTWARE/OS VERSIONS

Kubuntu 18.10
Plasma 5.50.0

ADDITIONAL INFORMATION
There seem to be another bug surrounding how blank spaces are treated if you try to search for multiple terms in a file name without spaces (e.g. AB-ABC-1st does not show up with 'AB 1' or 'AB+1', but does show up with 'AB' or 'AB '. I intuitively feel that these are related, but I honestly have nothing to back that up.
Comment 1 Igor Poboiko 2019-06-30 12:58:27 UTC
The problem here is that short terms (of length < 3) are only matched *exactly*. This is intentional: because if we match them by prefix (like we do with longer terms) there will most likely be just too many matches.

So in your case 'AB 1' doesn't match files #1 and #3 because they don't contain "1" exactly. But all of them do contain "AB" exactly.
Comment 2 Ceaus 2020-05-02 20:57:36 UTC
I'm facing the same problem. Files are not found whilst they are there.

The logic of Igor Poboiko (comment #1) just complicated things:
1. If I search for files with very common letters in their name, such as 'a' or 'e', baloo just reports a handful, while I have hundreds in my home dir. At this point I have zero cue how to interpret the search results. I see some files, but I also see many not. How do I now interpret this list? I know the list is incorrect(because I know the file exists). But to what extend? Are there other files not in the search results? And if so, why?

2. If I search for single non-alphabet characters, such as filenames with a '-', '_' or ' '  baloo returns zero results. Which goes against the results of option (1). So now the question becomes even more difficult to answer: what is this list I am looking at?

Apparently A-Z characters are first class citizens, whereas the other characters are estranged cousins.  To me this sounds rather arbitrary. baloo should simply return them all. If there is a genuine concern for the list being too long, then why not raise a warning: "Hey, are you sure you want a list containing 137K filenames?" 

BTW: I'm on openSUSE Leap 15.1
Baloo = 5.55.0
Comment 3 Igor Poboiko 2020-05-04 10:01:30 UTC
It's not my logic, it's the logic of Baloo and its original developer :)

The logic is quite straightforward though. Most likely, user is searching for some particular document. If his search term is contained in "137K files", it wouldn't help at all - such term might as well be dropped. If those are only terms user looking for, he won't be able to find anything; if his query contains other terms, those will more likely to help Baloo identify the document user is looking for.
I believe short terms are mostly there just to be able to search over filename extensions (like "filename.jpg") and e-mail/domains (like "johndoe@example.org"). In both cases, the "exact match" logic would suffice.

> [...] Apparently A-Z characters are first class citizens, whereas the other characters are estranged cousins.  

That's intentional. Remember that Baloo provides search over file contents too. And if you have it in mind, it doesn't sound that arbitrary: letters and words (not necessarily A-Z: also numbers and other languages) contain the most information to build index upon. What are the chances user is going to search for a document that has "." or " " or "_" somewhere inside? And what are the chances it will help to identify the document uniquely?
Not to mention that by restricting itself to alphabet, it reduces the size of the index by a large factor.

If you're looking for a file with a name you know precisely, and which mostly contains non-alphanumeric characters, then "find" / KFind or any other filesystem crawler will most likely do better.

> Baloo = 5.55.0
I couldn't also help but notice that the version your distribution ships is a bit outdated. There were large number of improvements to Baloo somewhat around 5.60+ (unrelated to this particular issue, though).
Comment 4 Ceaus 2020-05-05 16:14:20 UTC
My apologies to Igor if I sounded as if was to assign blame. That is certainly not my intention.

Although I understand your logic, it not a real defense against an improvement in this area:
1.
Looking at the home page of Baloo, on the top of the architecture page it says: "Baloo is a metadata and search framework by KDE". The fact that meta is being mentioned is a give away that filenames should be supported to their maximum extent. Special characters not withstanding.

2. The 137K was a purely arbitrary number. I could also have said 5. The special characters should not be held hostage to support the argument of file content searches, or the problem of a list of results which is too long.

3. There is no mentioning at all, in any form, or in any MMI about the restrictions  of the possible search parameters. If you cannot use certain characters, or the search string must be of a certain minimum size, than it should say so. You cannot confront the end-user with search results which are incorrect, for which no explanation is given. In my case I was finally able to understand the incorrect search results (that got me here in this bug report). But it could be much worse: the end user is confronted with incorrect search results, but s/he is unaware. Which can lead detrimental consequences on her/his part: Taking action because s/he thinks the document(s) do not exist.

4. It is extremely silly to integrate Baloo in the Dolphin file manager, making it  indistinguishable from Dolphin, and then only partly support a typical task for a file manager: searching for files! Referring to a third party app (KFind) to search for files, to me is inexplicable.

5. The chosen technical solution (preferring an index over in-situ search) should not exceed the importance of a normal use case. If the size of the index becomes too big, then "we" have done something wrong on a technical level. That burden should not be put on the end user. 

6.
Now that I know that Baloo gives incorrect search result in certain circumstances, makes me question if and how I can "trust" Baloo in future times. Which undermines the whole purpose of its existence. How do I know I can trust Baloo?

My real life use case:
Last week I was called by my friend. She had to copy her home directory (XFS) to external hard drive (VFAT) for backup. That failed as VFAT does not support filenames containing ':' and '?'. My friend had about 25 of those files. As she does not have root access to the laptop, KFind was not an option for her to install. Over the phone we had to divert to a very complicated session trying to explain her how to use 'find' (command line) and how to rename the listed files. I would rather never, ever do that again.
Comment 5 tagwerk19 2021-03-30 08:39:02 UTC
Can confirm that this is still the case, repeating the test

cd ~/Documents
echo "Hello Penguin" > 'AB - ABC - 1st'
echo "Hello Penguin" > 'AB - ABC - 1 filler' 
echo "Hello Penguin" > 'AB - ABC - 1st filler'

balooshow -x 'AB - ABC - 1st'

    1010f90000fc01 64513 1052921 AB - ABC - 1st [/home/test/Documents/AB - ABC - 1st]
            Mtime: 1617093004 2021-03-30T10:30:04
            Ctime: 1617093004 2021-03-30T10:30:04
            Cached properties:
                    Line Count: 1

    Internal Info
    Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
    File Name Terms: F1st Fab Fabc 
    XAttr Terms: 
    lineCount: 1

balooshow -x 'AB - ABC - 1 filler' 

    1019d70000fc01 64513 1055191 AB - ABC - 1 filler [/home/test/Documents/AB - ABC - 1 filler]
            Mtime: 1617093011 2021-03-30T10:30:11
            Ctime: 1617093011 2021-03-30T10:30:11
            Cached properties:
                    Line Count: 1

    Internal Info
    Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
    File Name Terms: F1 Fab Fabc Ffiller 
    XAttr Terms: 
    lineCount: 1

balooshow -x 'AB - ABC - 1st filler' 

    102bb60000fc01 64513 1059766 AB - ABC - 1st filler [/home/test/Documents/AB - ABC - 1st filler]
            Mtime: 1617093015 2021-03-30T10:30:15
            Ctime: 1617093015 2021-03-30T10:30:15
            Cached properties:
                    Line Count: 1

    Internal Info
    Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
    File Name Terms: F1st Fab Fabc Ffiller 
    XAttr Terms: 
    lineCount: 1

baloosearch AB
    /home/test/Documents/AB - ABC - 1st filler
    /home/test/Documents/AB - ABC - 1 filler
    /home/test/Documents/AB - ABC - 1st
    Elapsed: 1,9261 msecs

baloosearch 'AB 1'
    /home/test/Documents/AB - ABC - 1 filler
    Elapsed: 0,269976 msecs

See also
    Bug 434589

With
    Neon Testing
    Plasma: 5.21.3
    Frameworks : 5.81.0
    Qt : 5.15.2