Bug 136955 - image files are imported by filename only, not by filetype
Summary: image files are imported by filename only, not by filetype
Status: RESOLVED FIXED
Alias: None
Product: kphotoalbum
Classification: Applications
Component: Backend (show other bugs)
Version: SVN (KDE3 branch)
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: KPhotoAlbum Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-06 17:14 UTC by Falk Krönert
Modified: 2012-02-13 20:29 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
checks filetype by mime, not filename (501 bytes, patch)
2006-11-12 04:35 UTC, Falk Krönert
Details
Optionally find new images by mimetype only. (6.28 KB, patch)
2012-02-12 22:48 UTC, Johannes Zarl-Zierl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Falk Krönert 2006-11-06 17:14:02 UTC
Version:           SVN (using KDE 3.5.5 "release 19.1" , openSUSE )
Compiler:          Target: i586-suse-linux
OS:                Linux (i686) release 2.6.13-15.12-default

files named pict000.jpg or Akademy2005.png etc. are imported correctly, even if they are wrongly named and in reality both image/tiff, but Sol-2002.01.04 or Judy.image are not (both image/png).
Comment 1 Falk Krönert 2006-11-12 04:35:06 UTC
Created attachment 18505 [details]
checks filetype by mime, not filename

This fixes it for me as far as I can tell.
Comment 2 Shawn Willden 2007-05-11 15:41:05 UTC
Falk, in case you're not on the mailing list, your patch is being discussed there.  There is a concern about the performance implications of doing the MIME-based examination, which will require reading at least the first part of each file being scanned.

Jesper proposed a modification that first checks the file name and only bothers to check the MIME type if the file name check doesn't identify the file as an image.  That helps, but it's not enough because some users have lots of files (tens of thousands) in their image directories that should live with the image files but should not be organized by kphotoalbum.   At present that's not an issue, because KPA looks only at the file names to determine that the files should be ignored.  With a MIME-based type check, those files would be opened and read during every KPA startup.  We're constantly fighting to reduce KPA startup time, because it's already pretty costly.

What might allow MIME-based lookups to be used is some mechanism that tracks files to be excluded from KPA.  Then, KPA could test each filename it finds against that list and only do the MIME lookup if the file isn't excluded.  If the MIME analysis determines that the file is not an image, then perhaps that filename could be added to the exclusion list so it won't be examined again -- or maybe the exclusion list handling should be manual.

Finally, it's questionable whether or not it's worth so much effort just so users don't have to use standard file extensions.

If you have something to add to the discussion, respond here or, better yet, on the mailing list.
Comment 3 Falk Krönert 2007-05-20 19:01:27 UTC
Intention for the patch was, I have tons of dvds with ill-named images mostly in YEARMMDD.NUM filename format. Remastering them just to correct the names is not feasible. I linked /media to /home/pics so with the patch kpa is able to find the images, just change dvd and rescan. It's faster than anything else.
Comment 4 Risto H. Kurppa 2010-07-18 07:56:05 UTC
So what's the result?
Comment 5 Johannes Zarl-Zierl 2012-02-12 22:48:17 UTC
Created attachment 68738 [details]
Optionally find new images by mimetype only.

Adds a configuration setting to ignore file extensions.
Comment 6 rlk 2012-02-12 23:10:59 UTC
This is OK (as long as the default stays false), but the performance implications of this are really substantial, particularly on rotating storage.

Looking at just the filename requires only reading the directory, which is already done.  Actually accessing the file requires at least two additional I/O operations:

1) Reading the inode from disk, to find the location of the file on disk.

2) Reading at least the first block of the file.

A typical disk these days has maybe 10 millisecond average access time, so this will require 20 milliseconds extra per file.  If you keep a lot of additional files around, that time could really add up in a hurry.

Performance is one of the big advantages of KPA over a lot of other tools.
Comment 7 Falk Krönert 2012-02-13 00:20:49 UTC
I like that someone else sees the need for this as well.

And what good is performance if the files you want to use aren't found? Since this will only scan for files not yet recognized (and not blocked by name) and normally you won't have that many non-image files along your images, it won't really affect speed. And on top of that it's optional.

This is also an advantage of the competitor (digiKam), there I proposed a patch similar to the original KPA one and it wasn't even considered by the developers.
Comment 8 rlk 2012-02-13 00:55:12 UTC
Again, I have no problem with this being an option, as long as it's not the default.

It may not take many non-image files to affect cold startup performance.  If you use the common directory scheme of 100 files per directory (what many cameras do), and there's one non-image file per directory, and you have 50,000 images, that would be 500 non-image files.  At 20 milliseconds per file, that would be an extra 10 seconds at cold startup, which is not negligible.  At warm start, of course, that time would be negligible.
Comment 9 Miika Turkia 2012-02-13 16:52:16 UTC
Git commit 1a526f8ac9c7e8f56a3ec5f1570c531c5bcac884 by Miika Turkia, on behalf of Johannes Zarl.
Committed on 12/02/2012 at 23:44.
Pushed by mturkia into branch 'master'.

Optionally find new images by mimetype only.

M  +9    -0    Settings/FileVersionDetectionPage.cpp
M  +1    -0    Settings/FileVersionDetectionPage.h
M  +1    -0    Settings/SettingsData.cpp
M  +1    -0    Settings/SettingsData.h
M  +2    -1    Utilities/Util.cpp

http://commits.kde.org/kphotoalbum/1a526f8ac9c7e8f56a3ec5f1570c531c5bcac884
Comment 10 Johannes Zarl-Zierl 2012-02-13 19:38:12 UTC
(In reply to comment #8)

I guess I have to clarify things a bit:

> Again, I have no problem with this being an option, as long as it's not the
> default.
You can be at ease -- the default behaviour stays exactly the same as before. And will stay that way because of the performance implications (see below) ...

> It may not take many non-image files to affect cold startup performance.  If
> you use the common directory scheme of 100 files per directory (what many
> cameras do), and there's one non-image file per directory, and you have 50,000
> images, that would be 500 non-image files.  At 20 milliseconds per file, that
> would be an extra 10 seconds at cold startup, which is not negligible.  At warm
> start, of course, that time would be negligible.

Today I tested this again by creating a directory structure with ~50000 non-image files in ~500 directories. The new-image-search then takes about 10 seconds on my pc. Good news: on a multi-core PC you probably won't notice, because this happens in a background-thread. Bad news: Every search has to examine those files again (they are neither in the DB, nor blocked), so every new-image-search with this feature enabled will be slow.

Conclusion:
I guess that the set of people needing this feature and the set of people having lots of non-image files in their folders are mostly disjunct. Most people won't notice the difference.

The use-case in comment #3 can only work with this feature in place. For those people with non-image files enabling this by default would certainly feel like a performance bug.
-> Add the feature, but disable on default.
Comment 11 Andreas Neustifter 2012-02-13 20:09:25 UTC
Sorry to be so late with my comments, but my weekend was blocked...

I do not necessarily oppose the idea of scanning the files with ill-named extensions for their type but adding more and more options is exactly why new and un-experienced users are confused by KPhotoAlbum.

Would it be possible to do all the checks in increasing order of cost and return the result early if possible? 

Or alternatively we have to think of a way to bury all the options which could be assigned a sane default way deeper into the settings.
Comment 12 Johannes Zarl-Zierl 2012-02-13 20:29:46 UTC
(In reply to comment #11)
> Would it be possible to do all the checks in increasing order of cost and
> return the result early if possible? 

No, that's not possible. The earliest result which can be relied upon (and also the most costly) is the Mime type. The only difference here is that we either use so-called "fast mode" and ignore unlikely extensions from the start, or don't use fast mode and check the Mime type for every file.

> Or alternatively we have to think of a way to bury all the options which could
> be assigned a sane default way deeper into the settings.

That could be a good way to go. The settings dialog could use some love, for sure. Maybe discuss this topic on the mailing list?