Bug 345406

Summary: Baloo attempts to index MPEG TS file as text
Product: [Unmaintained] Baloo Reporter: Pontus Johannesson <hydrardraconis>
Component: GeneralAssignee: Vishesh Handa <me>
Status: RESOLVED FIXED    
Severity: normal    
Priority: NOR    
Version First Reported In: 4.13   
Target Milestone: ---   
Platform: Gentoo Packages   
OS: Linux   
Latest Commit: Version Fixed/Implemented In: 5.3.1
Sentry Crash Report:

Description Pontus Johannesson 2015-03-21 21:42:58 UTC
I downloaded this UHD demo file to benchmark H265 decoding on my machine http://demo-uhd3d.com/files/uhd4k/Demo_Samsung_2014_-_Iceland.zip (800MB .ts file)
Which as soon as extracted made baloo start indexing it and taking 100% CPU, with RAM usage gradually growing towards 1.7-1.8GB before I killed it, only have 4GB in this machine so did not want it to get swap locked.
file shows it as "MPEG Transport stream", dolphin says it's a "Message catalog" which sounds off

Reproducible: Always

Steps to Reproduce:
1. Download the file and extract it somewhere baloo is indexing
2. Check top and look for baloo_file_extractor using resources

Actual Results:  
baloo_file_extractor starts eating 1.5+GB of RAM and 100% CPU

Expected Results:  
file extractor should quit within one or two seconds since it's a binary file
Comment 1 Vishesh Handa 2015-03-23 19:40:04 UTC
Confirmed. This is even a problem with Qt5

Fast Mimetype: text/vnd.trolltech.linguist
Slow Mimetype: text/vnd.trolltech.linguist

The fix will probably need to go into Qt.
Comment 2 Vishesh Handa 2015-05-13 14:07:57 UTC
Git commit c19b7a9ded994009c49007d8336afe92acf513cd by Vishesh Handa.
Committed on 13/05/2015 at 14:07.
Pushed by vhanda into branch 'Plasma/5.3'.

Only use the file's content during mimetype detection

During the first indexing phase, we only use the filename as we do not
want the overhead of reading the contents of the file.

During the second indexing phase, we are actually going to be indexing
the contents of the file. At this time, it's perfectly fine to read the
file's contents to determine the mimetype. We were using
QMimeDatabase::mimeTypeForFile with its default settings which takes
both the filename and file contents into consideration. This results in
interesting cases where if a file ends with '.ts' it is detected as a
'linguist' file, even though the magic byte mapping failed.

We want the mimetype to be as exact as possible. We now only use the
files contents, and not the filename.
Related: bug 342312
FIXED-IN: 5.3.1

M  +1    -1    src/file/extractor/app.cpp
M  +2    -2    src/file/tests/indexerconfigtest.cpp

http://commits.kde.org/baloo/c19b7a9ded994009c49007d8336afe92acf513cd