| Summary: | Baloo scans content from too many files | ||
|---|---|---|---|
| Product: | [Frameworks and Libraries] frameworks-baloo | Reporter: | Dave <sunny.bed7466> |
| Component: | Baloo File Daemon | Assignee: | Dave <sunny.bed7466> |
| Status: | RESOLVED NOT A BUG | ||
| Severity: | normal | CC: | heri+kde, tagwerk19 |
| Priority: | NOR | ||
| Version First Reported In: | unspecified | ||
| Target Milestone: | --- | ||
| Platform: | Other | ||
| OS: | Linux | ||
| See Also: | https://bugs.kde.org/show_bug.cgi?id=427455 | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
Baloo does restrict itself. It indexes plain text files. If some files are wrongly classified as mimetype text/plain, this is not baloos fault, but an omission from the (cross-desktop, system-wide) shared-mime-info database. File a bug report with shared-mime-info upstream. Wait, so does Baloo restrict itself to the system shared mime info database? Does it not use the user mimetype database? Because that file is registered by Wine. See: $ xdg-mime query filetype "Wine/Daz/drive_c/users/Public/Documents/My DAZ 3D Library/data/DAZ 3D/Genesis 8/Male/Morphs/DAZ 3D/Base Pose Head/alias_head_eCTRLEyelidsUpperUp-DownL.dsf" application/x-wine-extension-dsf .local/share/mime/application/x-wine-extension-dsf.xml .local/share/mime/packages/x-wine-extension-dsf.xml $ cat .local/share/mime/packages/x-wine-extension-dsf.xml <?xml version="1.0" encoding="UTF-8"?> <mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info"> <mime-type type="application/x-wine-extension-dsf"> <glob pattern="*.dsf"/> <comment>DSON Support File</comment> </mime-type> </mime-info> $ cat .local/share/mime/application/x-wine-extension-dsf.xml <?xml version="1.0" encoding="utf-8"?> <mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="application/x-wine-extension-dsf"> <!--Created automatically by update-mime-database. DO NOT EDIT!--> <glob pattern="*.dsf"/> <comment>DSON Support File</comment> </mime-type> Furthermore, text/plain probably should be handled as a special case. shared-mime-info isn't anywhere close to be a complete database. Baloo does not "restrict itself". It uses QMimeDatabase. I'm sorry, I don't know Baloo internals. I explain my experience from my point of view. I may have a little experience programming and know a tiny bit of KDE libraries but in this situation, I am but a final user. I don't have any mean intent. Baloo, by means of QMimeDatabase, I stand corrected, seems to ignore the user mime database. From the point of view of a final user this seems to be an oversight. I think that you might be thinking I should file a report against QMimeDatabase about this. The problem is I am not an adecuate person for that. I have no idea of what QMimeDatabase is, how it is used, how to develop with it. Please do what you think is best in this case, I trust your judgement better than my own. By making bug reports I hope to better the Baloo experience for myself and everyone else. I think this has the potential to help a lot of people. Furthermore, I want to kindly reiterate. Plain text doesn't imply meaningful user readable data. Please reconsider accepting all of text/plain data as indexable content. The text/plain mimetype reach is too broad. Excluding given mime types works, adding an:
exclude mimetypes=text/x-csrc
to the .config/baloofilerc and reindexing means that baloo stops finding *.c files.
C.f. creating
.local/share/mime/packages/x-mime-extension-dsf.xml
and adding application/x-wine-extension-dsf to excluded mimetypes:
exclude mimetypes=text/x-csrc,application/x-wine-extension-dsf
and reindexing. This does not work
Some ad-hoc troubleshooting...
Copying the 'x-csrc' specific xml from
/usr/share/mime/packages/freedesktop.org.xml
into a local test file
.local/share/mime/packages/test.xml
for comparison and editing the files so the two definitions converge, purging and reindexing after each change.
When:
<sub-class-of type="text/plain"/>
is copied to the wine extension definition, giving:
<?xml version="1.0" encoding="UTF-8"?>
<mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info">
<mime-type type="application/x-wine-extension-dsf">
<sub-class-of type="text/plain"/>
<glob pattern="*.dsf"/>
<comment>DSON Support File</comment>
</mime-type>
</mime-info>
results in the dsf files not being indexed.
Looks like it's not an issue with local versus system wide definitions; QMimeDatabase does seem to be picking up user specific info. Also looks like xdg-mime and the QMimeDatabase can give different results.
Neon Testing
Plasma: 5.20.90
Frameworks: 5.79.0
Qt: 5.15.2
Disclaimer - no idea if the "sub-class-of" affects anything else. Need to ask someone who knows...
By omitting the 'sub-class-of=text/plain', the mimetype declaration essentially tells "this is a binary file". Obviously this is not true for the affected files, so 'text/plain' is the better match here. By adding the sub-class-of, the content and extension no longer contradict each other, and the more-specific mime-type is chosen. Can I ask why is the file content is being checked? If the user says it is a DSON file, it is. The user says it is by means of the entry in the user mime type database. Why is the accuracy of the type subclass important? Wine has no means to get to know this information. This omission doesn't make it incorrect, it simply is missing its ideal place in the text/plain type tree. The shared mime type info spec recommends to first look for types by glob matching and if the result isn't ambiguous, take it. If the mime type definition for the file exists and it is found, in what moment does Baloo decide to check the file content anyway? This is an expensive operation, it should be avoided as much as necessary. If it isn't, it might be a source of considerable disk reading. (In reply to David Palacio from comment #8) > ... This omission doesn't make it incorrect ... I think the omission means (here) that you cannot choose whether to index that particular mimetype or not ... As a head up, different distributions have different defaults for indexing. I know in Fedora you get: folders[$e]=$HOME/Documents/,$HOME/Music/,$HOME/Pictures/,$HOME/Videos/ So, just the four "well known folders". I think there are other distributions that skip the full text indexing. Because globs are to ambiguous. Just because you have an entry in the mimetype database which says *.dsf is some format it does not mean *all* files with an dsf extension adhere to this format. It *may* be a file from DAZ, it may be something completely different. This is essentially the case outlined in SMI as "if multiple globs match, continue with content based detection". Using globs only works only if the output is purely informational. Past experience shows relying on globs is broken. Also note, the rationale for using globs in the first place does not apply, as the indexer *does* open the files anyway. The determined mimetype text/plain is completely correct - dsf files are json files, which are a subclass or ecmascript, which is a subclass of text/plain. If you want a different behavior, submit a proper mimetype to SMI. EOD. (In reply to Stefan Brüns from comment #10) > Because globs are to ambiguous. Globs *may* be ambiguous but not always and usually not. This is taken into account in the instructions I referred to in my previous comment. > Just because you have an entry in the > mimetype database which says *.dsf is some format it does not mean *all* > files with an dsf extension adhere to this format. It *may* be a file from > DAZ, it may be something completely different. How is this a problem? Having the type defined in SMI doesn't eliminate this potential problem. In fact, if the user database has the type defined with the correct subclass then, Baloo is already happy to discard this type as content indexable, as evidenced by comment #6. > Using globs only works only if the output is purely informational. Past > experience shows relying on globs is broken. Also note, the rationale for > using globs in the first place does not apply, as the indexer *does* open > the files anyway. This is wrong. I wonder if Baloo opens every single file in my home folder. It would be no surprise because it takes several hours to scan it. > The determined mimetype text/plain is completely correct - dsf files are > json files, which are a subclass or ecmascript, which is a subclass of > text/plain. > > If you want a different behavior, submit a proper mimetype to SMI. If I want a different behavior all I need to do is add a subclass to the type. It doesn't matter if the type is text readable, or a script, it has been shown that Baloo will ignore a type it doesn't know about even if it inherits text/plain. So what difference does it make if it doesn't inherit it? You say that Baloo should scan a file because it is text/plain but Baloo won't do it if it is a type that inherits text/plain. So what you are say is wrong or what Baloo does is wrong. Whatever ... Fedora restricts Baloo to the music, pictures, movies, documents directories: https://src.fedoraproject.org/rpms/kf5-baloo/blob/f37/f/baloo-5.67.0-baloofile_config.patch |
I have disabled file content indexing because it not only takes a great toll on I/O disk usage in my system, but it scans and indexes useless program data files content. I have a few Wine prefixes in plain view in unhidden folders in my home, so quite a lot of data files are accessible to Baloo with a default configuration. I have caught Baloo scanning and indexing keywords of a Daz Studio data file. For example: $ balooshow -x "/home/user/Wine/Daz/drive_c/users/Public/Documents/My DAZ 3D Library/data/DAZ 3D/Genesis 8/Male/Morphs/DAZ 3D/Base Pose Head/alias_head_eCTRLEyelidsUpperUp-DownL.dsf" 425600051801229316 2052 99092734 /home/user/Wine/Daz/drive_c/users/Public/Documents/My DAZ 3D Library/data/DAZ 3D/Genesis 8/Male/Morphs/DAZ 3D/Base Pose Head/alias_head_eCTRLEyelidsUpperUp-DownL.dsf Mtime: 1503348208 2017-08-21T15:43:28 Ctime: 1567044300 2019-08-28T21:05:00 Cached properties: Line Count: 44 Internal Info Terms: 0.2784314 0.3254902 0.3764706 0.6.0.0 06 1 1.0 2017 203d 208 20head 20pose 21t23 27 34z 3d Mplain Mtext T5 T8 X20-44 alias asset author base channel colors com contributor controls data daz daz3d description down downl dsf ectrleyelidsupperup email eyelids eyes file genesis genesis8male group head http icon id info label large left library male modified modifier modifiers morphs name parent pose presentation revision scene support target type up upper url value version website www File Name Terms: Falias Fdownl Fdsf Fectrleyelidsupperup Fhead XAttr Terms: lineCount: 44 I can't imagine the amount of program data it might have indexed from my home folder. In my opinion, Baloo should restrict itself to a very limited selection of files to extract keywords from. There's bug #358098 that is related to this issue. I disagree strongly with it. Sure, it might interest a few people to scan more files but that is a potentially harmful default for most users. Unknown data should be skipped, source code should be skipped. There should be a more simple default. A extension blacklist isn't the appropiate solution, a whitelist is. SOFTWARE/OS VERSIONS Linux/KDE Plasma: Debian unstable KDE Plasma Version: 5.78.0 KDE Frameworks Version: 5.20.5 Qt Version: 5.15.2