Bug 342312 - baloo_file_extractor missdetect a binary shader file of 125Mb as text/calendar
Summary: baloo_file_extractor missdetect a binary shader file of 125Mb as text/calendar
Status: RESOLVED FIXED
Alias: None
Product: Baloo
Classification: Unmaintained
Component: General (other bugs)
Version First Reported In: 0.1
Platform: Kubuntu Linux
: NOR minor
Target Milestone: ---
Assignee: Vishesh Handa
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-12-29 15:42 UTC by Juan Ases García
Modified: 2015-05-13 14:07 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In: 5.3.1
Sentry Crash Report:


Attachments
Truncated offending shader file (4.88 KB, text/calendar)
2014-12-29 15:47 UTC, Juan Ases García
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Juan Ases García 2014-12-29 15:42:56 UTC
Ballo uses 100% cpu trying to  index a binary shader file of 125Mb
The file ("dearesther/platform/shaders/fxc/phong_ps20b.360.vcs") is from a game.
Seems that is detected as "text/calendar"

Reproducible: Always

Steps to Reproduce:
1. Install the game Dear Esther to get the conflicting file
2. Log Out and Log In
3. Make sure baloo is enabled

Actual Results:  
baloo_file_extractor go to 100% CPU usage on every log in for an undetermined period of time

Expected Results:  
The file is detected correctly and its contents are ignored
Comment 1 Juan Ases García 2014-12-29 15:47:32 UTC
Created attachment 90158 [details]
Truncated offending shader file

The file has been truncated by truncate --size=5000 phong_ps20b.360.vcs
Comment 2 Juan Ases García 2014-12-29 17:00:18 UTC
The complete file (not truncated) can be downloaded from this link: https://app.box.com/s/qdx5bwrm34izvashtvvh

PS: Output of file comand from a real vCalendar and the "Valve Left 4 Dead Shader file"

file *
phong_ps20b.360.vcs: data
vcalendar.vcs:       vCalendar calendar file

file --mime-type *
phong_ps20b.360.vcs: application/octet-stream
vcalendar.vcs:       text/calendar
Comment 3 Vishesh Handa 2015-05-13 14:07:57 UTC
Git commit c19b7a9ded994009c49007d8336afe92acf513cd by Vishesh Handa.
Committed on 13/05/2015 at 14:07.
Pushed by vhanda into branch 'Plasma/5.3'.

Only use the file's content during mimetype detection

During the first indexing phase, we only use the filename as we do not
want the overhead of reading the contents of the file.

During the second indexing phase, we are actually going to be indexing
the contents of the file. At this time, it's perfectly fine to read the
file's contents to determine the mimetype. We were using
QMimeDatabase::mimeTypeForFile with its default settings which takes
both the filename and file contents into consideration. This results in
interesting cases where if a file ends with '.ts' it is detected as a
'linguist' file, even though the magic byte mapping failed.

We want the mimetype to be as exact as possible. We now only use the
files contents, and not the filename.
Related: bug 345406
FIXED-IN: 5.3.1

M  +1    -1    src/file/extractor/app.cpp
M  +2    -2    src/file/tests/indexerconfigtest.cpp

http://commits.kde.org/baloo/c19b7a9ded994009c49007d8336afe92acf513cd