Bug 422085 - Files excluded by mimetype could have their filenames indexed by default without issue
Summary: Files excluded by mimetype could have their filenames indexed by default with...
Status: RESOLVED FIXED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.70.0
Platform: Arch Linux Linux
: NOR wishlist
Target Milestone: ---
Assignee: Stefan Brüns
URL:
Keywords: usability
Depends on:
Blocks:
 
Reported: 2020-05-26 08:47 UTC by olivier
Modified: 2020-06-10 23:27 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In: 5.72


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description olivier 2020-05-26 08:47:38 UTC
Hi,
baloo doesn't index some file by default, like cource code (cpp files...) text...
I know that it is possible to remove the mimedatafile exclusion to make it work, but it is not the normal behavior someone expects from a normal search.
There should be a reason for that behavior, maybe.
I think by default everything should be indexed, maybe not the contents of file, but at least file names.
Sincerely
Comment 1 Nate Graham 2020-06-10 22:01:01 UTC
The devil's in the details here. You probably do not want the filenames of all the millions of hidden files in the .git directories of your git repos indexed, for example.

But I agree that we could perhaps index the filenames of all the filetypes excluded by mimetype, as they are typically excluded because their contents are useless to index, or so large that they blow up the index.
Comment 2 Stefan Brüns 2020-06-10 22:15:59 UTC
@Nate: https://phabricator.kde.org/D29207
Comment 3 Stefan Brüns 2020-06-10 22:18:40 UTC
.git would be excluded by default, as long as the users does not change the config deliberately.
Comment 4 Nate Graham 2020-06-10 22:24:31 UTC
How did I miss that!?
Comment 5 Stefan Brüns 2020-06-10 23:16:37 UTC
Git commit 24b1392e0094a954bb15c99d71cb0ccf527e88ea by Stefan Brüns.
Committed on 10/06/2020 at 23:16.
Pushed by bruns into branch 'master'.

[Indexers] Ignore name-based mimetype for initial indexing decisions

Summary:
The name based mime type is inaccurate, so it should not be used to
decide if a file should be indexed. In case a specific extension should
be skipped this can still be done accurately by the name based filters,
e.g.  instead of "image/png" "*.png" can be used, or the whole directory
can be excluded.

This inaccuracy is also confusing for the user, as a file without
extension will be added to the index, but adding an extension removes
the file from the index. The file extension may also be ambiguous.

This also matches the current list of excluded mime types, which are
source files for various languages. These blow up the full text index
and thus should be excluded (by default), but just adding the file names
increases the index size only marginally.

The 'inability' to find files is a recurring user complaint.

Depends on D28932

Reviewers: #baloo, ngraham

Reviewed By: #baloo, ngraham

Subscribers: kde-frameworks-devel

Tags: #frameworks, #baloo

Differential Revision: https://phabricator.kde.org/D29207

M  +2    -0    src/file/extractor/app.cpp
M  +0    -3    src/file/firstrunindexer.cpp
M  +0    -3    src/file/modifiedfileindexer.cpp
M  +0    -3    src/file/newfileindexer.cpp
M  +0    -3    src/file/unindexedfileiterator.cpp

https://invent.kde.org/frameworks/baloo/commit/24b1392e0094a954bb15c99d71cb0ccf527e88ea
Comment 6 Nate Graham 2020-06-10 23:27:52 UTC
This is effectively all fixed by the above commit.

Great job, Stefan!