345406 – Baloo attempts to index MPEG TS file as text

Bug 345406 - Baloo attempts to index MPEG TS file as text

Summary: Baloo attempts to index MPEG TS file as text

Status:	RESOLVED FIXED

Alias:	None

Product:	Baloo
Classification:	Unmaintained
Component:	General (other bugs)
Version First Reported In:	4.13
Platform:	Gentoo Packages Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Vishesh Handa

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-03-21 21:42 UTC by Pontus Johannesson
Modified:	2015-05-13 14:07 UTC (History)
CC List:	0 users

See Also:
Latest Commit:	http://commits.kde.org/baloo/c19b7a9ded994009c49007d8336afe92acf513cd
Version Fixed/Implemented In:	5.3.1
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Pontus Johannesson 2015-03-21 21:42:58 UTC

I downloaded this UHD demo file to benchmark H265 decoding on my machine http://demo-uhd3d.com/files/uhd4k/Demo_Samsung_2014_-_Iceland.zip (800MB .ts file)
Which as soon as extracted made baloo start indexing it and taking 100% CPU, with RAM usage gradually growing towards 1.7-1.8GB before I killed it, only have 4GB in this machine so did not want it to get swap locked.
file shows it as "MPEG Transport stream", dolphin says it's a "Message catalog" which sounds off

Reproducible: Always

Steps to Reproduce:
1. Download the file and extract it somewhere baloo is indexing
2. Check top and look for baloo_file_extractor using resources

Actual Results:  
baloo_file_extractor starts eating 1.5+GB of RAM and 100% CPU

Expected Results:  
file extractor should quit within one or two seconds since it's a binary file

Comment 1 Vishesh Handa 2015-03-23 19:40:04 UTC

Confirmed. This is even a problem with Qt5

Fast Mimetype: text/vnd.trolltech.linguist
Slow Mimetype: text/vnd.trolltech.linguist

The fix will probably need to go into Qt.

Comment 2 Vishesh Handa 2015-05-13 14:07:57 UTC

Git commit c19b7a9ded994009c49007d8336afe92acf513cd by Vishesh Handa.
Committed on 13/05/2015 at 14:07.
Pushed by vhanda into branch 'Plasma/5.3'.

Only use the file's content during mimetype detection

During the first indexing phase, we only use the filename as we do not
want the overhead of reading the contents of the file.

During the second indexing phase, we are actually going to be indexing
the contents of the file. At this time, it's perfectly fine to read the
file's contents to determine the mimetype. We were using
QMimeDatabase::mimeTypeForFile with its default settings which takes
both the filename and file contents into consideration. This results in
interesting cases where if a file ends with '.ts' it is detected as a
'linguist' file, even though the magic byte mapping failed.

We want the mimetype to be as exact as possible. We now only use the
files contents, and not the filename.
Related: bug 342312
FIXED-IN: 5.3.1

M  +1    -1    src/file/extractor/app.cpp
M  +2    -2    src/file/tests/indexerconfigtest.cpp

http://commits.kde.org/baloo/c19b7a9ded994009c49007d8336afe92acf513cd