Bug 376661 - When importing ~200,000 video files Digikam crashes in about 2-5 seconds of starting.
Summary: When importing ~200,000 video files Digikam crashes in about 2-5 seconds of s...
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Unclassified
Component: Import-Albums (show other bugs)
Version: 5.5.0
Platform: Microsoft Windows Microsoft Windows
: NOR crash (vote)
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-19 08:15 UTC by Poz
Modified: 2018-02-28 10:45 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In: 6.0.0


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Poz 2017-02-19 08:15:24 UTC
I added a bunch of folders that contain ~200,000 video files and hit refresh to scan them into the database. Digikam crashes after about 2-5 seconds. This is repeatable. Digikam will not add 200,000 video files.
Comment 1 caulier.gilles 2017-02-19 09:55:52 UTC
Reproducible with 5.5.0pre release ?

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Gilles Caulier
Comment 2 Poz 2017-02-20 17:39:49 UTC
Yes, same thing happens with 5.5.0pre release.
Comment 3 caulier.gilles 2017-02-20 17:44:01 UTC
Maik,

which solution can we apply to fix this entry :

1/ Disable autocompletion in tree search field. Report this problem to Qt team to open API of QCompleter in goal to use current private methods.
2/ re-use KCompletion to backport classes in digiKam core with ajusted API for digiKam

Gilles
Comment 4 Maik Qualmann 2017-02-20 19:50:49 UTC
Gilles,

I think you mean Bug 368468. This bug here has a different cause, possibly crash in Exiv2.

To Bug 36846:
The QCompleter is not the performance problem. This is fixed by a QTimer. The main problem is the ever slower adding of items to the QTreeView.

Maik
Comment 5 Maik Qualmann 2017-02-20 19:55:49 UTC
An edit function for the first minutes after the comment would not be bad...

Maik
Comment 6 caulier.gilles 2017-02-21 09:59:22 UTC
Poz,

We need a debugger backtrace to investigate in details.

See this page for details :

https://www.digikam.org/contrib

Gilles Caulier
Comment 7 caulier.gilles 2017-02-21 10:02:20 UTC
Maik,

In comment #4 you talk about a slower adding of items to the QTreeView.

Where is located the problem exactly ? Did you profile execution time with Valgrind ? In Digikam treeviewitem widget implementation ? In Digikam model populated by the DB ? In DB interface to get data to host in widget ? In Qt5 implementation ?

Gilles
Comment 8 caulier.gilles 2017-02-21 10:07:34 UTC
MAik,

In my office i write a fast shared memory mapping viewer in Qt5 using QTreeview/item classes. I create item in treeview with no data, and i populate all items in a separated thread because it take a lot of time.
At end i call a treeview update in main thread (X11 is not re-entrant). It's very fast. The amount of item in treeview is very huge (more than 1000 entries).

Can we do the same in digiKam ?

Gilles
Comment 9 Poz 2017-02-22 05:12:36 UTC
Still running the 5.5.0pre
Okay so I went to the https://www.digikam.org/contrib and tryed a few things with limited success, I will try more tomorrow.
First, the gdb in windows, not working well. I type in 'catch throw', and get back 'Catchpoint 1 (throw)', seems good. Then I type in 'run' and get back:
-
Starting program:
No executable specified, use `target exec'.
-
Not sure what to do here??

Second thing I tried is the third party debug tool from system internals:
https://technet.microsoft.com/en-us/sysinternals/bb896647.aspx
Looks like some bad stuff happening for about 10.2 seconds before it crashes:
00000009	1.02899146	[17040] digikam.general: Trying to load Embedded preview with libraw	
00000010	1.02921200	[17040] digikam.rawengine: Failed to load embedded RAW preview	
00000011	1.02923596	[17040] digikam.general: Trying to load half preview with libraw	
00000012	1.02927971	[17040] digikam.general: Trying to load Embedded preview with Exiv2	
00000013	1.04443121	[17040] digikam.dimg: "Removed file path and name"  : QIMAGE file identified	
00000014	1.04464126	[17040] digikam.dimg.qimage: Can not load " "Removed file path and name" " using DImg::QImageLoader!	
00000015	1.04492271	[17040] digikam.general: mimetype =  ""  ext =  "MOV"	
00000016	1.04507148	[17040] digikam.general: Cannot create thumbnail for  "Removed file path and name"	
00000017	1.04512084	[17040] digikam.general: Thumbnail is null for  "Removed file path and name"

I removed the file path and name for privacy reasons.
this repeats for various videos until crash, takes about 2/10ths of a second per loop? (looks like from that snipit I gave you). video file types are various, avi, flv, mov, mp4, and more, the example above is just mov. 

This happens before the loops start when I hit refresh:
00000005	0.91890234	[17040] digikam.general: Using  8  CPU core to run threads	
00000006	0.91933465	[17040] digikam.general: Action Thread run  1  new jobs	
00000007	0.93396312	[17040] digikam.general: Cancel Main Thread	
00000008	0.93400776	[17040] digikam.general: One job is done

I will try to get more info tomorrow.

Also two other questions, I turned off the album sync when it starts because it was crashing. How do I start it artificially, I thought that is what refresh does, not apparently refresh only updates the thumbnails. 
Also is it possible to do the FUZZY search on the thumb nails to file potential duplicates? This is my real intent. I want to cut that 200,000 videos down to 100,000. 
If not, is this a future feature? Can it be one? High demand I think.
Comment 10 Poz 2017-02-23 04:29:24 UTC
Spent some more time trying to figure out how to provide more data. while running the debugger I also found this line:
[11624] digikam.metaengine: Exiv2 ( 3 ) :  Xmp.video.Metadata dataLength was found to be larger than 5000  entries considered invalid; not read.


If there is anything else I can do to help debug this, let me know! Thank you.
Comment 11 caulier.gilles 2017-02-23 07:53:49 UTC
The xmp warning is not the problem.

But it's know that Exiv2 have many problem with video files.

I recommend to not try to scan your huge collection in one time.

Start with a fresh database and add video files by chunks step by step until crash appear. To goal is to isolate the file which introduce the dysfunction.

After that, report the problem to Exiv2 bugzilla with the identified video file for investigations. As DK windows installer include current Exiv2 source code, we can rebuild a DK for windows with last fix from Exiv2.

For your problem with GDB under Windows, if command line version won't to start digiKam (even if it work on my VM with Windows 7), you need to open a console and go to the directory where gdb and digikam excutable are installed (it's the same dir).

After that it's simple. Look the generic page for details :

http://stackoverflow.com/questions/4671900/how-do-i-use-the-mingw-gdb-debugger-to-debug-a-c-program-in-windows

Gilles Caulier
Comment 12 caulier.gilles 2017-02-23 21:08:05 UTC
>Also is it possible to do the FUZZY search on the thumb nails to file >potential duplicates? This is my real intent. I want to cut that 200,000 >videos down to 100,000. 
>If not, is this a future feature? Can it be one? High demand I think.

Poz,

The Fuzzy Search work only with Still Image currently.

To see a similar function for video, this will need an algorithm to create a fingerprint of the first frame of video, in goal to compare later with DB.

This is how the fuzzy tool work actually. A simplified wavelets matrix is computed with still image. We compare matrix together to found similarities.

For video we need a new matrix with the spacial information of video. Not impossible but complex to write and test.

Gilles Caulier
Comment 13 Poz 2017-02-24 00:55:38 UTC
Are the thumbnails not readily available to do the fuzzy search on? I know they are not the biggest but I think they are big enough, or if there is a setting to render them a slightly higher resolution... That is how I imagined it would work anyways, since the thumbnails would already be generated, half the work is already done to fuzzy search videos...
Comment 14 Mario Frank 2017-02-24 07:27:03 UTC
(In reply to Poz from comment #13)
> Are the thumbnails not readily available to do the fuzzy search on? I know
> they are not the biggest but I think they are big enough, or if there is a
> setting to render them a slightly higher resolution... That is how I
> imagined it would work anyways, since the thumbnails would already be
> generated, half the work is already done to fuzzy search videos...

Hey Poz,

Sadly, it is not this easy.
The fuzzy search creates a signature from images. This does not hold for videos. Videos are quite more complex as the signature creation must be uniformly done for all videos. But if videos have black frames in the beginning, the search would lead to results which are, let's say, rubbish. The most stable way I see is to take the first frame from every video that is not plain, i.e. single-coloured. But this means we would have to generate images until we find the first appropriate frame. This would slow down the fingerprints generation significantly. 
A stable implementation is not trivial here. I will think about a way more closely over the weekend. 

Best,
Mario
Comment 15 Mario Frank 2017-02-24 08:32:48 UTC
Hi again,

This will be a quite long text - sorry. But I want to make the problems as clear as possible.

I thought about the fuzzy search for videos a bit more during my train travel.
In fact, even the first non-plain frame is worthless. If a user really wants to use digiKam as catalog for videos (which is not the scope of digiKam in first place IMHO), he will potentially have videos that have the same beginning, i.e. intro but are different videos. Thus, also the first non-plain frame will potentially lead to rubbish. I remember that I found some tools to find video duplicates. The process they applied was to take the first n images of a video and compare it to all others. A quite bad process IMO as with m videos you generate n*m images and then have to make a comparison. This is awfully bad from the view of complexity theory. And in practice, this process is, as can be expected, awfully slow.

Nevertheless, the process is the probably best way to really recognise duplicate videos. So, a way could be to generate a fingerprint over the first or last n images (slows down fingerprint generation extremely). This still is not robust as many videos may have the same intro (at least the first m seconds, e.g. about m*25 frames. Usual intros take many seconds. So a *rather* stable approach would be to take 1000 frames. As you can imagine, this is a big amount of data to compute fingerprints for. Just imagine your 200,000 videos. Fingerprinting them would mean to generate 200,000,000 images. Every image must be generated which is no const-time process but at least linear time. So, even with 1000 videos, i would expect computation time to be in measure of hours, not minutes.

Let's take a look from the other side, outros are far more distinct than intros. So, a lower number n can be taken, e.g. 100. This reduces the time quite a lot. But is probably still not satisfying.

If no or only short intros/outros are there, only few images should be sufficient and the process could work quite good.

But we cannot estimate, how the videos are structured. The FPS count may/will differ from video to video. So, woking on frames explicitly may again lead to low-quality results. So, the best way would be to take the n first/last seconds and then the complexity cannot really be estimated here.
Also, I think, users should decide themselves, how many seconds are taken (configuration) and if beginning or ending should be taken (configuration again). 

So, *if* this feature should be implemented, I see the following options for users:
1) Take the first non-plain frame for fingerprinting (fast, probable not suitable for e.g. cinema movies)
2) Take the n first seconds for fingerprinting (probably awfully slow, may be suitable for e.g. cinema movies, overkill for self-produced movies)
3) Take the n last seconds for fingerprinting (probably slow, probably suitable for e.g. cinema movies, less overkill for self-produced movies)

In a more precise algorithmic way, we would need an adoption of the fingerprints maintenance stage:
Option 1: take the first non-plain frame for video fingerprints
Option 2: take the Option(number n) Option(first,last) seconds for video fingerprinting.
Changing the current options *must* trigger delete the current fingerprints of videos as otherwise, different
fingerprintings would coexist which leads to wrong results - except rebuild all fingerprints is chosen.

Then, the fuzzy search could probably work without adoptions - but I am not completely sure if it would work out of the box.

Best,
Mario
Comment 16 caulier.gilles 2017-02-24 09:54:00 UTC
Mario,

In my office we capture Infrared plan sequence of events in a Tokamak to catch physical dysfunctions while experience.

video can take more than 2 minutes in HD, no more. More than 20 experiences can be done in a day. All video are lossless stored in a database.

There is no camera movements. Only the plasma inside the machine change the contents. Depending of the experience parameters, the video contents willbe different.

We have a process to recognize similar video into the database. It written in Matlab. As i know the process cut the first frames where there is nothing (black hole) until the light begin. After that a wavelets fingerprints is computed with a flat image taken from some frames inside the video. Not whole video is analyzed, but the algorithm try to detect the edge of change and adjust the fingerprint, by parsing a section of the movie. This is how the spacial (temporal) dimension is processed.

For each file, the fingerprint can give the average of similarity of video comparing to others. When physicians want to look in experiences, they just compare a video made with Tokamak settings and look if another one is similar. The goal is to see if physical events are similar even if parameters are different.

Of course, it's a special use case, as video are static plan with changing contents, but i think the process is not too bad if we want to apply it on a small section of DSC movies.

Note : I know just the theory. The code is not available of course.
Comment 17 Poz 2017-02-25 23:50:20 UTC
Wow the discussion here is fantastic. Thank you for the time and thought!

So yes, the approach I suggested of just using the thumbnails is clearly not robust enough given the wide array of video content out there.
I think a lot of the problems come from very uniform videos, for example standard intros or outros. My case has very non uniform videos (without any intro or outros) where I can run through windows explorer and find duplicates myself from simply looking at the thumbnails so I know at least 20% are duplicates just from simple observation. The problem is that it is to much to go through that many files and click each one individually. I have used Digikam before on photos for duplicates and was amazed at how well it worked so naturally I thought, 'man, I wish I could get digikam to access these thumbnails for me, I could get rid of +95% of these duplicates in a day'. I know there could be false positives, but I could live with 1% or something like that. To further get rid of false positives there could be a video length option of +-X seconds (default at 2 or something).

I currently use http://www.alldup.de/alldup_help/alldup.php
The content method works very well, I would say less then 0.001% false positives. But it misses so very very much. It can take up for 48 hour to run, but builds a database so it only compares new files added into the search. I even use the file size method, for large files, this works very well. Smaller files (<10 mb?) tend to have more false positives. Unfortunately due to different compression and file types this does not catch them all either.

I think in the end, until computer hardware is faster, video duplicate searches will require a number of different methods and some user input. Until then that is what we have to work with/ around. I was just hoping for another way to slim down on this video database. Thumbnail seemed like low hanging fruit.
Comment 18 caulier.gilles 2017-03-11 13:57:28 UTC
Ok,

I disabled video metadata support in Exiv2 shared library used with windows installer. New version can be downloaded in GDrive repository in few minutes :

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Can you reproduce the problem with this version ?

Typically, the video file will be registered in database, but video metadata will not be parsed to populate the database.

Thanks in advance for your feedback

Gilles Caulier
Comment 19 Poz 2017-03-11 20:10:29 UTC
I tried the version with disabled video metadata support in Exiv2 shared library that you just posted.
It allows me to import all of the video files! Success! However they all appear to be gray boxes with no thumbnails. Perhaps this is a separate issue?
Comment 20 caulier.gilles 2017-03-11 21:34:11 UTC
Typically no.

I recommend to stop digiKam and drop the thumbnail-digikam.db file and restart it.

Force to rebuild thumbnails with F5 key when you are in album with video. This fix the problem ?

Gilles Caulier
Comment 21 Poz 2017-03-12 01:39:12 UTC
I had Digi closed deleted the thumbnail-digikam.db and started Digi. I hit F5 and it rebuild the thumbnails in a few mins. Everything flashed and then still only gray video boxes.
Comment 22 caulier.gilles 2017-03-12 07:46:21 UTC
Ok,

I suspect a possible ffmpeg codec missing for your video.

Can you share some video sample through the cloud to reproduce the problem ?

Can you install debugview program and run digiKam to press F5 in a video album ? debugview will capture all debug statements from digiKam. Se  this page for details :

https://www.digikam.org/contrib

Gilles Caulier
Comment 23 Poz 2017-03-12 21:37:11 UTC
These are the codecs I have installed: https://www.codecguide.com/download_kl.htm
The mega version and updated.

I will look into sharing some video sample through the cloud to reproduce the problem. However I believe it is a shear numbers problem as sometimes a few thumbnails load, even up to 20-30 videos show thumbnails. As soon as I scroll, they flip back to gray. Not a particular video in the group that is causing the issue.

Here is the output from debugview:
I edited path name and video locations for privacy reasons. Also this is a snip it of starting digi cam, with thumbnail data base removed, and after hitting F5. For each video file it simply repeats making a very large file where the only thing different is the file name.

http://pastebin.com/VEwQHS4x

Let me know if you have trouble viewing that pastebin and I will copy the text here.
Comment 24 caulier.gilles 2017-03-12 21:41:47 UTC
No. Your codecs that you have installed are not used by digiKam

We compile ffmpeg codecs for QtAV player used in digiKam.

This kind of error is explicit :

avcore\npsm\localprovider\baseprovider\lib\baseprovider.cpp(604)\NPSMDesktopProvider.dll!00007FFA59332140: (caller: 00007FFA593326E5) ReturnHr(497) tid(2de8) 80070490 Element not found.  

It miss a codec for your video. avcore come from libav codecs into ffmpeg.

Which kind of video type you use exactly ?

Gilles Caulier
Comment 25 Poz 2017-03-12 21:51:25 UTC
The file video file types are various, avi, flv, mov, mp4, wmv, and more... A big mix.

Sorry, I do have those errors as well, I thought they were part of a different problem I am having with Oculus Rift cameras because they occur at about the same time.

00000351	2.07417393	[10636] avcore\npsm\localprovider\baseprovider\lib\baseprovider.cpp(604)\NPSMDesktopProvider.dll!00007FFA59332140: (caller: 00007FFA593326E5) ReturnHr(495) tid(2de8) 80070490 Element not found. 	
00000352	2.07450104	[10636] shell\explorer\taskband2\taskband2.cpp(4148)\explorer.exe!00007FF6BC80792A: (caller: 00007FFA75E67DE3) ReturnHr(588) tid(2e28) 80004005 Unspecified error 	
00000353	5.56119251	[9784] shell\lib\bindctx.cpp(128)\explorerframe.dll!00007FFA3647F200: (caller: 00007FFA364A24EA) ReturnHr(21) tid(828) 80070057 The parameter is incorrect. 	
00000354	5.56124878	[9784] shell\lib\bindctx.cpp(128)\explorerframe.dll!00007FFA3647F200: (caller: 00007FFA364A24EA) ReturnHr(22) tid(bf0) 80070057 The parameter is incorrect. 	

How do I ensure I have the codes installed correctly?
Comment 26 caulier.gilles 2017-03-12 21:53:58 UTC
The codecs are in digiKam, included at compilation time.

There is no way for the moment to know which codecs are available. I know that only GPL2 licensed codecs are installed. No GPL3 and no patented codecs are compiled for legal reasons.

Gilles Caulier
Comment 27 caulier.gilles 2017-04-16 20:19:09 UTC
new 5.6.0 pre-release as bundle is available here :

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Please check if this problem still reproducible with these versions.

Thanks in advance

Gilles Caulier
Comment 28 caulier.gilles 2017-06-22 21:42:41 UTC
digiKam 5.6.0 is now released and available as bundle for Linux, MacOS and Windows.

https://www.digikam.org/news/2017-06-21-5.6.0-release-announcement/

Can you check if problem still exists with this version ?

Thanks in advance

Gilles Caulier
Comment 29 caulier.gilles 2017-07-23 18:27:50 UTC
New digiKam 5.7.0 are built with current implementation as pre-release bundles:

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

Problem still reproducible ?
Comment 30 caulier.gilles 2018-02-28 10:45:59 UTC
With 6.0.0, we have now a FFMpeg low level metadata parser based on libav C API for video files database registration.

The Exiv2 video support is not used anymore as this code is buggous and nobody sound motivated in Exiv2 to finalize the code.

The original post for this file must be fixed now and video metadata support with ffmpeg must be enough to populate database entries.

Gilles Caulier