Bug 457522

Summary: Filesearch runner does not find files that don't have any category assigned
Product: [Plasma] krunner Reporter: Schlaefer <openmail+kde>
Component: filesearchAssignee: baloo-bugs-null
Status: RESOLVED FIXED    
Severity: normal CC: alexander.lohnau, bonirar716, natalie_clarius, nate, plasma-bugs, tagwerk19, yvan
Priority: NOR    
Version: 5.25.4   
Target Milestone: ---   
Platform: Other   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=464583
Latest Commit: Version Fixed In: 6.0

Description Schlaefer 2022-08-05 12:19:09 UTC
It seems that the baloo runner is doing additional filtering for certain types [1]. So if a file doesn't fall into one of those categories it never shows up in a search.

A checkbox to toggle that behavior on and off would be great.

STEPS TO REPRODUCE
1. Navigate to a directory indexed by baloo
2. run "touch myfile.foobar"
3. run "baloosearch myfile" --> file is found by baloo
4. start krunner and type "myfile"

OBSERVED RESULT

No search result shows up in krunner.

EXPECTED RESULT

The file "myfile.foobar" should be found by krunner.


SOFTWARE/OS VERSIONS
Operating System: NixOS 22.11
KDE Plasma Version: 5.25.3
KDE Frameworks Version: 5.96.0
Qt Version: 5.15.5
Kernel Version: 5.15.58 (64-bit)
Graphics Platform: Wayland
Graphics Processor: AMD Radeon RX 5500 XT

ADDITIONAL INFORMATION

[1] https://invent.kde.org/plasma/plasma-workspace/-/blob/40d6e18b0f153464a64b3e21c1224e13511632d2/runners/baloo/baloosearchrunner.cpp#L85
Comment 1 Alexander Lohnau 2022-08-09 12:08:49 UTC
I do not know how to fix this issue since I do not see a way to get all results and find the category for them. We for sure want to use the categories to group the results in the UI.

Maybe sb. who knows baloo better can give this a look.
Comment 2 Natalie Clarius 2022-08-09 14:03:46 UTC
I think the problem is that Baloo does not assign the file any category, and consequently it won't show up in the file search results that specifically query the individual categories in the code you linked to.

Note the information in Terms:

> $ touch myfile.foobar
> $ balooshow -x myfile.foobar
> 94995300010303 66307 9738579 yourfile.foobar [/home/natalie/myfile.foobar]
>         Mtime: 1660051514 2022-08-09T15:25:14
>         Ctime: 1660051514 2022-08-09T15:25:14
> 
> Internal Info
> Terms: Mapplication Moctet Mstream 
> File Name Terms: Ffoobar Fmyfile 
> XAttr Terms: 

As opposed to a .txt file:

> $ touch theirfile.txt
> $balooshow -x theirfile.txt
> 94984b00010303 66307 9738315 test.foobar [/home/natalie/theirfile.txt]
>         Mtime: 1660050158 2022-08-09T15:02:38
>         Ctime: 1660050158 2022-08-09T15:02:38
> 
> Internal Info
> Terms: Mplain Mtext T5 T8 
> File Name Terms: Ftheirfile Ftxt 
> XAttr Terms: 

"T5" and "T8" indicate categories Document and Text, see https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/types.h#L20. 

"Mplain" and "Mtext" indicate the mime type.

> file --mime-type theirfile.txt
> theirfile.txt text/plain

When a file is created empty with an unknown file extension, it gets mime type "inode/x-empty":

> $ file --mime-type myfile.foobar
> myfile.foobar: inode/x-empty

This does not get mapped to any category by Baloo: https://github.com/KDE/baloo/blob/509dcd8da6b2e21723838f27003ac72d9a267a1a/src/file/basicindexingjob.cpp#L63

When the file has text content, even with an unknown file extension, it gets mime type "text/plain":

> $ echo "test" > myfile.foobar
> $ file --mime-type myfile.foobar
> myfile.foobar: text/plain

But for some reason Baloo does not represent this in the term information, the *.foobar has mime type ("M") "application/octet/stream" and no categories ("T"), even if the file is created non-empty and has mime type text/plain to begin with.

So something seems to go wrong already on the side of Baloo for not setting the terms information correctly.  

But independently of that, it may be useful to be able to also retrieve results that do not have any category. Either an additional query in the runner for Baloo results with empty type, if such a query is possible. Or by adding to Baloo a generic fallback type for any file that has not received any other type, and adding that in the runner.
Comment 3 Bug Janitor Service 2022-08-09 14:55:37 UTC
A possibly relevant merge request was started @ https://invent.kde.org/plasma/plasma-workspace/-/merge_requests/2006
Comment 5 Alexander Lohnau 2022-08-09 17:22:19 UTC
*** Bug 442898 has been marked as a duplicate of this bug. ***
Comment 6 Bug Janitor Service 2022-08-13 17:37:10 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/kfilemetadata/-/merge_requests/62
Comment 7 Bug Janitor Service 2022-08-13 17:37:15 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/85
Comment 8 tagwerk19 2022-08-20 15:57:40 UTC
There's an extra twist...

You did a:
    touch myfile.foobar
    balooshow -x myfile.foobar
and got
    Terms: Mapplication Moctet Mstream

If I try this (on Neon), I get:
    Terms: Mapplication Mx Mzerosize

If I do the same for myfile.txt
    touch myfile.txt
    balooshow -x myfile.txt
I still get:
    Terms: Mapplication Mx Mzerosize

The "twist" seems to be if baloo is doing content indexing it flags all(?) empty files as "application/x-zerosize". If I purge and reindex without content, I get "application/octet-stream" for the empty myfile.foobar (and "text/plain" for myfile.txt)

If I look a bit closer with kmimetypefinder, I get:

    $ kmimetypefinder myfile.txt
    text/plain
    $ kmimetypefinder myfile.foobar
    application/x-zerosize

and also:

    $ echo "Hello Penguin" > myfile.foobar
    $ kmimetypefinder myfile.foobar
    text/plain

and that seems reasonable: For an empty file, if the filename indicates a mimetype, use it; if not, say application/x-zerosize.

I'd say baloo ought to give the same results here irrespective of whether it is content indexing or not and it would probably make sense if it follows kmimetypefinder logic, so:

    "text/plain" for an empty myfile.txt
    "application/x-zerosize" for an empty unrecognised filetype (myfile.foobar, in this case)
    "text/plain" for an unrecognised filetype with (text) content

Krunner would then list an empty myfile.txt but not an empty myfile.foobar. Maybe this is good enough? or am I missing something?
Comment 9 tagwerk19 2022-08-30 14:56:19 UTC
(In reply to Natalie Clarius from comment #4)
> Probably the same:  
> https://bugs.kde.org/show_bug.cgi?id=442898
> https://bugs.kde.org/show_bug.cgi?id=420339
Yes, I think so.

Thanks, I'll flag as a duplicate.

*** This bug has been marked as a duplicate of bug 420339 ***
Comment 10 Schlaefer 2022-09-15 09:12:11 UTC
Is this a duplicate? The original issues doesn't depend on an empty file, that's just an coincidence of the simplified example?
Comment 11 Natalie Clarius 2022-09-16 00:07:50 UTC
Yes. The issue is that the files don't have a category assigned. One of the cases where this happens is for empty files; matlab, IPython notebook files and the like are are other instances of the same problem: Baloo doesn't assign a T term, so KRunner doesn't retrieve them.
Comment 12 tagwerk19 2022-09-22 19:37:54 UTC
I think there should be an easy way to "open up" the search criteria in krunner to show all results, something like "Show more?". It makes me a bit uncomfortable that krunner and baloosearch can give different sets of answers, for me that goes against the principle of "least surprise".

The behaviour with empty files muddles the issue and it would be nice to sort out (within baloo)
Comment 13 tagwerk19 2022-09-22 20:01:12 UTC
(In reply to Natalie Clarius from comment #11)
> ... matlab, IPython notebook files and the like ...
At the risk of going down rabbit holes and on the assumption that others have a better understanding of what's happening:

Matlab files:

    Ought to be recognised although "*.m" also matches "text/x-objcsrc" in the freedesktop.org
    mimetype list. kmimefiletype depends on "magic"
    For a file "test.m", baloo indexes content and baloosearch finds the file by name and content.
    Krunner lists it as text (Neon Testing)

IPython Notebook files:

    kmimefiletype shows a "test.ipynb" file as "application/x-ipynb+json"
    Baloo does not index content and baloosearch only finds the file by name. Krunner
    does not list the file.
Comment 14 Natalie Clarius 2022-09-22 20:09:41 UTC
That's what we're trying to do. That some files are missing is a bug which the open MR is intended to solve, not an intentional restriction which should be kept or worked around by complicating the UI.(In reply to tagwerk19 from comment #12)
> I think there should be an easy way to "open up" the search criteria in
> krunner to show all results, something like "Show more?". It makes me a bit
> uncomfortable that krunner and baloosearch can give different sets of
> answers, for me that goes against the principle of "least surprise".
> 
> The behaviour with empty files muddles the issue and it would be nice to
> sort out (within baloo)

That's what we're trying to do. That some files are missing is a bug which the open MR is intended to solve, not an intentional restriction which should be kept or worked around by further complicating the UI.
Comment 15 Natalie Clarius 2022-09-22 20:12:31 UTC
Of course there is still the fact that KRunner will cap the overall amount of results shown, but that's not specific to the Baloo runner and a topic for a different thread if at all
Comment 16 Natalie Clarius 2022-09-23 03:19:05 UTC
(In reply to tagwerk19 from comment #13)

> Matlab files:
> 
>     Krunner lists it as text (Neon Testing)

Ah, right, that was a different bug (files of category "text" not found) that got fixed with https://invent.kde.org/plasma/plasma-workspace/-/merge_requests/1658.
Comment 17 tagwerk19 2022-09-23 07:46:20 UTC
(In reply to Natalie Clarius from comment #16)
> ... Ah, right, that was a different bug (files of category "text" not found) ...
Think there might still be some bits to untangle...

For a "test.m" file, an empty one to start with, and without indexing file content, in an up to date Neon Testing:
    Plasma: 5.25.5
    Frameworks: 5.97.0
    Qt: 5.15.5

I get:

    $ touch test.m
    $ kmimetypefinder test.m
    text/x-matlab
    $ balooshow -x test.m
    Terms: Mmatlab Mtext Mx T8
    $ krunner test.m
    Listed. Categorised as "Text"

Then a file that does not match any of the "magic" in the mimetypes list:

    $ echo "Hello Penguin" > test.m
    $ kmimefiletype test.m
    text/x-objcsrc
    $ balooshow -x test.m
    Terms: Mmatlab Mtext Mx T8 (*1)
    $ krunner test.m
    Listed. Categorised as "Text"

Then one that matches the "magic":

    $ echo "##Hello Penguin" > test.m
    $ kmimefiletype test.m
    text/x-matlab
    $ balooshow -x test.m
    Terms: Mmatlab Mtext Mx T8
    $krunner test.m
    Listed. Categorised as "Text"

Not sure what baloo is doing in "*1" above but the rest seems OK.

Purging and reindexing with file content: for an empty "test.m" file:

    $ rm test.m; touch test.m
    $ kmimetypefinder test.m
    text/x-matlab
    $ balooshow -x test.m
    Terms: Mapplication Mx Mzerosize (*2)
    $ krunner test.m
    Not Listed, except as one of the "Recent Files"

Then a file that does not match any of the "magic" in the mimetypes list:

    $ echo "Hello Penguin" > test.m
    $ kmimefiletype test.m
    text/x-objcsrc
    $ balooshow -x test.m
    Terms: Mmatlab Mtext Mx T8 X20-1 hello penguin (*3)
    $ krunner test.m
    Not Listed, except as one of the "Recent Files" (*4)

Then one that does match the "magic":

    $ echo "##Hello Penguin" > test.m
    $ kmimefiletype test.m
    text/x-matlab
    $ balooshow -x test.m
    Terms: Mmatlab Mtext Mx T8 X20-1 hello penguin
    $krunner test.m
    Not Listed, except as one of the "Recent Files" (*4)

I think the application/x-zerosize (the *2) the baloo seems to add is a shame, I think this needs a fix but can see in this case that krunner wouldn't list the list the file (as it's not text)

Not sure what's happening with "*3", it's the same behaviour as "*1" further up. There might be some double guessing going on. I think I'd be happier trusting the mime type data.

Also not sure what's happening with "*4", that also doesn't seem right. According to Comment 2, the T8 implies the file is Text and my assumption is that Krunner should then list it. It probably doesn't matter  about the ambiguity with the mime type (text/x-objcsrc or text/x-matlab) as they are both "text". It might matter in other cases. It's disturbing that krunner gives you better results if baloo is not indexing content 8-]

I realise this is an edge case and this writeup is a bit long but I've deliberately dug down as it might pinpoint something underlying. I would be happy to repeat with other examples and see if there are patterns (I think that's part of triaging...)
Comment 18 Natalie Clarius 2022-09-23 12:29:15 UTC
For the files that baloo doesn't assign a T-term like application/x-zerosize, this is what the currently open MR would fix.

For the files that are only listed as recent files, this doesn't mean that the baloo runner doesn't find them. It's just that KRunner filters out duplicates, and the same file found by both by baloo and among recent files is such a case. I'm not sure about the logic which of the two results (the baloo runner one or the recent files one) wins, but that would be a separate issue.  The man point is that KRunner overall will find the file.
Comment 19 Natalie Clarius 2022-09-23 12:35:20 UTC
I haven't actually done a test run with file content indexing disabled but you could test my recent files hypothesis (i.e. the file is found by the baloo runner, it just gets outrun by the recent files match) by disabling the recent files plugin and seeing if then the file shows up as a text file result.

Thanks for the help in figuring this out!
Comment 20 tagwerk19 2022-09-23 14:28:24 UTC
(In reply to Natalie Clarius from comment #19)
> ... disabling the recent files plugin and seeing if then the file shows up as a text file ...
Good catch.

Yes, if I disable the Recent Files plugin, I see the test.m file listed as Text.

However, I would expect "Recent Files" to work the same way, independent of whether baloo is indexing content or not. We might have explained "*4" but maybe we now have a "*5" :-)
Comment 21 tagwerk19 2022-09-23 14:45:15 UTC
(In reply to Natalie Clarius from comment #18)
> For the files that baloo doesn't assign a T-term like application/x-zerosize, this is what
> the currently open MR would fix...
I'll stick with baloo filename search results should not depend on whether content indexing is enabled or not. In this instance, baloo should consider an empty "test.txt" as "Text".

However Krunner does seem to be doing something extra:
    $ krunner test.txt
lists an empty "text.txt" as "Document" ...
Comment 22 Natalie Clarius 2022-09-23 18:32:23 UTC
The baloo and recent files runner plugins don't change their behavior depending on whether content indexing is enabled. If there are differences in the runner results, it's due to Baloo sending different matches.

If context indexing is enabled, Baloo may find matches other than the text file, which can change the relative ranking of the file match, and might explain why in this situation it loses against the recent files result. 

I'm not sure it's unexpected that in general, type assignment can be influenced by also taking content into account. Specifically that "application/x-zerosize" is preferred over "text/plain" for an empty text file is perhaps less ideal. That's an issue on the side of the indexing service rather than the runner plugin though, so if you think that's an issue I would suggest filing a bug report for Baloo.

But in any event, the runner plugin doesn't do anything extra to the type assignment. If it reports a file as being type document, then that's information it got from Baloo.
Comment 23 tagwerk19 2022-10-03 09:30:10 UTC
(In reply to Natalie Clarius from comment #22)
> ... If there are differences in the runner results, it's due to Baloo sending
> different matches... If it reports a file as being type document, then that's
>  information it got from Baloo ...
Is this something that can be seen by setting debugs flags? I tried creating

    ~/.config/QtProject/qtlogging.ini

with:

    [rules]
    kf.*.debug=true

This gave some information but not results from a baloo "lookup".
Comment 24 Natalie Clarius 2022-10-03 22:32:32 UTC
How much debug output you see depends on how many debug statement have been set in the source code; Baloo and the runner are currently not very verbose in that respect, so if you want to dig deeper and generate more info about what's going on, you'd have to build baloo and plasma-workspace from source and set some debug statements yourself.

The types you can get from the T-terms with balooshow -x, as you've already done. "Document" is type #5 (i.e. T5); see https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/types.h#L20.  

For the matches Baloo finds (and reports to the runner plugin), you can run `baloosearch`.
Comment 25 Natalie Clarius 2023-01-27 18:09:06 UTC
*** Bug 464583 has been marked as a duplicate of this bug. ***
Comment 26 Natalie Clarius 2023-02-08 23:07:56 UTC
Fixed with https://invent.kde.org/plasma/plasma-workspace/-/merge_requests/2006