Summary: | Baloo generating bloated indexes for .obj files | ||
---|---|---|---|
Product: | [Frameworks and Libraries] frameworks-baloo | Reporter: | contact |
Component: | Baloo File Daemon | Assignee: | baloo-bugs-null |
Status: | RESOLVED FIXED | ||
Severity: | major | CC: | nate, postix, tagwerk19 |
Priority: | NOR | ||
Version: | 6.3.0 | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | 6.8 | |
Sentry Crash Report: | |||
Attachments: | an example obj |
(In reply to contact from comment #0) > ... obj files are a legacy yet still commonly used File format to exchange 3D > Models. As it is a plain text format, baloo will generate huge indexes of > these files ... For me, the .obj file is recognised as mime-type "model/obj" kmimetypefinder suzanna.obj The section from the freedesktop definitions that identifies it is here (this is from Fedora, this segment doesn't appear in Neon) <mime-type type="model/obj"> <comment>OBJ 3D model</comment> <sub-class-of type="text/plain"/> <magic> <match type="string" value=" OBJ File: '" offset="0:64"/> <match type="string" value="mtllib " offset="0:256"/> </magic> <glob pattern="*.obj"/> </mime-type> The logic is pretty basic (the file contains "mtllib " in the first 256 characters). It seems to fit though although there definitely a chance of false positives. I see there also another ".obj" with different "magic" that gives "application/x-tgif". "model/obj' is flagged as a subclass of "text/plain" and I'm guessing that's enough for Baloo to index it. As Fedora and Neon have different versions of the freedesktop mime-type definitions, the behaviour may be distro dependent. You'll need to see what Nixos does. Ways of fixing... Baloo has a set of filename exclusions, files it does not index. That appears in the ~/.config/baloofilerc file as an "exclude filters" line. I've checked a couple of systems and they have ".obj" in the exclusions, but not "*.obj". Maybe that's a bug or at least strange. However Baloo can also exclude mime types, see https://community.kde.org/Baloo/Configuration. Edit your ~/.config/baloofilerc and add a line under [General]: exclude mimetypes=model/obj Don't know doing this declaratively but you can script it with: balooctl config add excludeMimetypes model/obj This seems to let baloo index the filename but not index the content, > DESIRED RESULT > The file name, maybe the names of objects included in the files and some metadata. If you want to parse the files, that's a bigger job, a dedicated KFileMetadata extractor... > Baloo has a set of filename exclusions, files it does not index. That > appears in the ~/.config/baloofilerc file as an "exclude filters" line. I've > checked a couple of systems and they have ".obj" in the exclusions, but not > "*.obj". > > Maybe that's a bug or at least strange. Hm has this whole rabbit hole been caused by a singe typo? > However Baloo can also exclude mime types, see > https://community.kde.org/Baloo/Configuration. Edit your > ~/.config/baloofilerc and add a line under [General]: > > exclude mimetypes=model/obj Thanks for the advice. I may fix the configuration for my setup. But what is really important to me is making sure no other users than me have to realise their baloo has been generating bloated indexes. > If you want to parse the files, that's a bigger job, a dedicated KFileMetadata extractor... I mostly want it not to duplicate the numerical value of every data point in an obj, as I suspect this has been one of the factors contributing to the weirdly huge indexes in #488446. Object names was just a thing I could imagine being useful to include in an index of .obj files to be useful for human search. I do not have strong feelings on this and would consider it a low priority enhancement. The contents of these files indeed do not make sense to index, and that should be prevented somehow. (In reply to contact from comment #2) > Hm has this whole rabbit hole been caused by a singe typo? Don't know... I don't know when the exclusion was added. A plain ".obj" exclusion in the baloofilerc *also* stops Baloo indexing any ".obj" directories and I'm not able to say whether any dev tools create/use/expect these. I don't think there's any chance a folder named .obj would have any content worth indexing in it. (In reply to Nate Graham from comment #5) > I don't think there's any chance a folder named .obj would have any content > worth indexing in it. Not sure how much cleaning and tidying is needed here... There are a few "exclude filters" like ".obj" in baloofilerc; that is just a plain ".obj" and not a "*.obj". These don't (seem to) exclude files with the given extension. Having this as a deliberate choice seems unlikely but who knows... A "*.obj" exclusion seems to exclude a "file.obj" file and a ".obj" folder. An exclusion based on the mimetype has the side effect that the filename is indexed but the content not. That's somehow nice but there may be reasons to avoid "exclude mimetype" - there may be history. (In reply to Nate Graham from comment #5) > I don't think there's any chance a folder named .obj would have any content > worth indexing in it. I stumbled across the source... https://invent.kde.org/frameworks/baloo/-/blob/master/src/file/fileexcludefilters.cpp It seems that adding ".obj" as a filter for directories was deliberate Also shows a load of quite well hidden filters on mimetypes. The good news here is that excluding based on mimetype seems quite proper so we can add "model/obj" to the list. I think not as suggested in comment 1, by editing baloofilerc file by hand, but by using the: balooctl config add excludeMimetypes model/obj as this adds an exclusion (and saves the whole, updated, list to baloofilerc) A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/208 Git commit 3e74d49259ffdd519d14dd56888bb1d1c6a1be94 by Christoph Cullmann, on behalf of Archaeopteryx Lithographica. Committed on 11/10/2024 at 17:01. Pushed by cullmann into branch 'master'. [excludeMimeTypes] Exclude model/obj and text/rust from content indexing 3D models, mimetype model/obj, are subtypes of plain/text and are content indexed by default. These files can be large and full of numbers. Rust source seems be to content indexed by default, by comparison C and C++ source is excluded on the basis of their mime types. Excluding these by mimetype means that Baloo will continue to index the filenames but not the content (The model/obj mimetype seems to be a recent addtion to the freedesktop.org.xml list, it may not be present in LTS distros) M +7 -3 src/file/fileexcludefilters.cpp https://invent.kde.org/frameworks/baloo/-/commit/3e74d49259ffdd519d14dd56888bb1d1c6a1be94 |
Created attachment 170525 [details] an example obj SUMMARY obj files are a legacy yet still commonly used File format to exchange 3D Models. As it is a plain text format, baloo will generate huge indexes of these files. This results it generating indexes that can grow multiple GB in size, causing baloo to generate bloated indexes, run out of memory, never finish indexing and using up tons of compute. I assume obj files are not the only type of file that can cause this issue. This is one of the culprits for issue #488446 STEPS TO REPRODUCE 1. add the attatched .obj to a directory indexed by baloo 2. wait for indexing 3. $ baloosearch6 -i suzanne.obj 4. copy the ID 5. $ balooshow6 -x [the ID you copied] OBSERVED RESULT a huge stack of numbers DESIRED RESULT The file name, maybe the names of objects included in the files and some metadata. SOFTWARE/OS VERSIONS Operating System: NixOS 24.11 KDE Plasma Version: 6.0.5 KDE Frameworks Version: 6.2.0 Qt Version: 6.7.1 Kernel Version: 6.9.3 (64-bit) Graphics Platform: Wayland Processors: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics Memory: 27.2 GiB of RAM Graphics Processor: AMD Radeon Graphics ADDITIONAL INFORMATION I know this is technically expected behavior. I know how to configure baloo so this does not happen on my machine. I still hope it is understandable why I think this needs to be resolved upstream.