Bug 488533

Summary: Baloo generating bloated indexes for .obj files
Product: [Frameworks and Libraries] frameworks-baloo Reporter: contact
Component: Baloo File DaemonAssignee: baloo-bugs-null
Status: RESOLVED FIXED    
Severity: major CC: nate, postix, tagwerk19
Priority: NOR    
Version: 6.3.0   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed In: 6.8
Sentry Crash Report:
Attachments: an example obj

Description contact 2024-06-15 12:40:37 UTC
Created attachment 170525 [details]
an example obj

SUMMARY
obj files are a legacy yet still commonly used File format to exchange 3D Models. As it is a plain text format, baloo will generate huge indexes of these files. This results it generating indexes that can grow multiple GB in size, causing baloo to generate bloated indexes, run out of memory, never finish indexing and using up tons of compute.

I assume obj files are not the only type of file that can cause this issue.

This is one of the culprits for issue #488446

STEPS TO REPRODUCE
1. add the attatched .obj to a directory indexed by baloo
2. wait for indexing
3. $ baloosearch6 -i suzanne.obj
4. copy the ID
5. $ balooshow6 -x [the ID you copied]

OBSERVED RESULT
a huge stack of numbers

DESIRED RESULT
The file name, maybe the names of objects included in the files and some metadata.

SOFTWARE/OS VERSIONS
Operating System: NixOS 24.11
KDE Plasma Version: 6.0.5
KDE Frameworks Version: 6.2.0
Qt Version: 6.7.1
Kernel Version: 6.9.3 (64-bit)
Graphics Platform: Wayland
Processors: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Memory: 27.2 GiB of RAM
Graphics Processor: AMD Radeon Graphics

ADDITIONAL INFORMATION
I know this is technically expected behavior. I know how to configure baloo so this does not happen on my machine. I still hope it is understandable why I think this needs to be resolved upstream.
Comment 1 tagwerk19 2024-06-15 20:50:38 UTC
(In reply to contact from comment #0)
> ... obj files are a legacy yet still commonly used File format to exchange 3D
> Models. As it is a plain text format, baloo will generate huge indexes of
> these files ...
For me, the .obj file is recognised as mime-type "model/obj"

    kmimetypefinder suzanna.obj

The section from the freedesktop definitions that identifies it is here (this is from Fedora, this segment doesn't appear in Neon)

    <mime-type type="model/obj"> 
      <comment>OBJ 3D model</comment>
      <sub-class-of type="text/plain"/>
      <magic> 
        <match type="string" value=" OBJ File: '" offset="0:64"/>
        <match type="string" value="mtllib " offset="0:256"/>
      </magic>
      <glob pattern="*.obj"/> 
    </mime-type>
    
The logic is pretty basic (the file contains "mtllib " in the first 256 characters). It seems to fit though although there definitely a chance of false positives. I see there also another ".obj" with different "magic" that gives "application/x-tgif". "model/obj' is flagged as a subclass of "text/plain" and I'm guessing that's enough for Baloo to index it.

As Fedora and Neon have different versions of the freedesktop mime-type definitions, the behaviour may be distro dependent. You'll need to see what Nixos does.

Ways of fixing...

Baloo has a set of filename exclusions, files it does not index. That appears in the ~/.config/baloofilerc file as an "exclude filters" line. I've checked a couple of systems and they have ".obj" in the exclusions, but not "*.obj".

Maybe that's a bug or at least strange.

However Baloo can also exclude mime types, see https://community.kde.org/Baloo/Configuration. Edit your ~/.config/baloofilerc and add a line under [General]:
    
   exclude mimetypes=model/obj
        
Don't know doing this declaratively but you can script it with:
    
   balooctl config add excludeMimetypes model/obj
        
This seems to let baloo index the filename but not index the content,

> DESIRED RESULT
> The file name, maybe the names of objects included in the files and some metadata.
If you want to parse the files, that's a bigger job, a dedicated KFileMetadata extractor...
Comment 2 contact 2024-06-15 22:39:35 UTC
> Baloo has a set of filename exclusions, files it does not index. That
> appears in the ~/.config/baloofilerc file as an "exclude filters" line. I've
> checked a couple of systems and they have ".obj" in the exclusions, but not
> "*.obj".
> 
> Maybe that's a bug or at least strange.

Hm has this whole rabbit hole been caused by a singe typo?
 
> However Baloo can also exclude mime types, see
> https://community.kde.org/Baloo/Configuration. Edit your
> ~/.config/baloofilerc and add a line under [General]:
>     
>    exclude mimetypes=model/obj

Thanks for the advice. I may fix the configuration for my setup.

But what is really important to me is making sure no other users than me have to realise their baloo has been generating bloated indexes.

> If you want to parse the files, that's a bigger job, a dedicated KFileMetadata extractor...

I mostly want it not to duplicate the numerical value of every data point in an obj, as I suspect this has been one of the factors contributing to the weirdly huge indexes in #488446.

Object names was just a thing I could imagine being useful to include in an index of .obj files to be useful for human search. I do not have strong feelings on this and would consider it a low priority enhancement.
Comment 3 Nate Graham 2024-06-17 18:00:46 UTC
The contents of these files indeed do not make sense to index, and that should be prevented somehow.
Comment 4 tagwerk19 2024-06-25 06:56:20 UTC
(In reply to contact from comment #2)
> Hm has this whole rabbit hole been caused by a singe typo?
Don't know...

I don't know when the exclusion was added. A plain ".obj" exclusion in the baloofilerc *also* stops Baloo indexing any ".obj" directories and I'm not able to say whether any dev tools create/use/expect these.
Comment 5 Nate Graham 2024-06-26 19:15:26 UTC
I don't think there's any chance a folder named .obj would have any content worth indexing in it.
Comment 6 tagwerk19 2024-06-27 08:15:07 UTC
(In reply to Nate Graham from comment #5)
> I don't think there's any chance a folder named .obj would have any content
> worth indexing in it.
Not sure how much cleaning and tidying is needed here...

There are a few "exclude filters" like ".obj" in baloofilerc; that is just a plain ".obj" and not a "*.obj". These don't (seem to) exclude files with the given extension. Having this as a deliberate choice seems unlikely but who knows...
    
A "*.obj" exclusion seems to exclude a "file.obj" file and a ".obj" folder.
        
An exclusion based on the mimetype has the side effect that the filename is indexed but the content not. That's somehow nice but there may be reasons to avoid "exclude mimetype" - there may be history.
Comment 7 tagwerk19 2024-07-13 07:14:31 UTC
(In reply to Nate Graham from comment #5)
> I don't think there's any chance a folder named .obj would have any content
> worth indexing in it.
I stumbled across the source...
    https://invent.kde.org/frameworks/baloo/-/blob/master/src/file/fileexcludefilters.cpp
It seems that adding ".obj" as a filter for directories was deliberate

Also shows a load of quite well hidden filters on mimetypes. The good news here is that excluding based on mimetype seems quite proper so we can add "model/obj" to the list. I think not as suggested in comment 1, by editing baloofilerc file by hand, but by using the:
    balooctl config add excludeMimetypes model/obj
as this adds an exclusion (and saves the whole, updated, list to baloofilerc)
Comment 8 Bug Janitor Service 2024-10-04 05:30:31 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/208
Comment 9 Christoph Cullmann 2024-10-11 17:10:21 UTC
Git commit 3e74d49259ffdd519d14dd56888bb1d1c6a1be94 by Christoph Cullmann, on behalf of Archaeopteryx Lithographica.
Committed on 11/10/2024 at 17:01.
Pushed by cullmann into branch 'master'.

[excludeMimeTypes] Exclude model/obj and text/rust from content indexing

3D models, mimetype model/obj, are subtypes of plain/text and are content
indexed by default. These files can be large and full of numbers.

Rust source seems be to content indexed by default, by comparison
C and C++ source is excluded on the basis of their mime types.

Excluding these by mimetype means that Baloo will continue to index
the filenames but not the content

(The model/obj mimetype seems to be a recent addtion to the
freedesktop.org.xml list, it may not be present in LTS distros)

M  +7    -3    src/file/fileexcludefilters.cpp

https://invent.kde.org/frameworks/baloo/-/commit/3e74d49259ffdd519d14dd56888bb1d1c6a1be94