Bug 425723

Summary: Digikam crashes after a while when detecting faces using multiple CPU cores
Product: [Applications] digikam Reporter: Simon Westersund <simon.westersund>
Component: Faces-DetectionAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: crash CC: caulier.gilles, metzpinguin
Priority: NOR    
Version: 7.0.0   
Target Milestone: ---   
Platform: Manjaro   
OS: Linux   
Latest Commit: Version Fixed In: 7.2.0
Sentry Crash Report:
Attachments: gdb backtrace after the crash
gdb backtrace with digikam built from source
gdb 'thread apply all backtrace' with digikam built from source
single-threaded gdb backtrace with digikam built from source
gdb backtrace with patch b3baed6
gdb backtrace with latest master 00883ba
gdb backtrace with gsoc20-facesengine-recognition 477fd19
gdb backtrace with gsoc20 branch + b61d547a patch
gdb backtrace with gsoc20 branch + b61d547a and b3baed64 patches
gdb backtrace with gsoc20 branch + latest master 2fb8679
gdb backtrace with latest master 5102e32
gdb backtrace with latest gsoc20 branch c1861ba
gdb backtrace with latest master fb76325
gdb backtrace with latest master a06b3c5

Description Simon Westersund 2020-08-23 19:34:32 UTC
Created attachment 131132 [details]
gdb backtrace after the crash

SUMMARY

Digikam crashes when detecting faces using multiple CPU cores

STEPS TO REPRODUCE

Not sure if this is reproducible by all, but this is what I did.

1. Add a large collection of photos (over 27000 JPEG files) from an external hard-drive.
2. In the People tab, change the setting "Work on all processor cores" to true.
3. For workflow, use "Detect faces" and "Skip images already scanned"
4. Then start the scanning by clicking "Scan collection for faces".

OBSERVED RESULT

Digikam crashes after a while. The GUI showed nothing helpful. The window died.

EXPECTED RESULT

Digikam should not crash. If there are problems with a photo, or something else, it should probably just be skipped, so that the scan can complete.


SOFTWARE/OS VERSIONS
Windows: -
macOS: -
Linux/KDE Plasma: Linux kernel 5.8.1-3-MANJARO
(available in About System)
KDE Plasma Version: 5.19.4
KDE Frameworks Version: 5.73.0
Qt Version: 5.15.0


ADDITIONAL INFORMATION

The scan of my collection crashed digikam several times, so I restarted and tried again. Some scans could proceed further than others. E.g. the first scan crashed after going through less than 5% of the collection. The other scans covered some 20-30% each.

The crash comes from a segmentation fault. On the latest rerun, I used GDB to get a backtrace. See the backtrace attached. 

The crash did not seem to be due to memory pressure. I have 16 GB of memory and during the scan, the whole system only used around 4 GB.

I did not test the scanning without the "Work on all processor cores" yet.
Comment 1 Maik Qualmann 2020-08-23 20:02:24 UTC
The backtrace is missing, after the crash enter "bt" + Return key in the terminal.

Maik
Comment 2 Simon Westersund 2020-08-23 20:44:31 UTC
@Maik, thanks for pointing this out. I tried to follow the instructions here: https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports#Retrieving_a_backtrace_with_GDB and only ran "thread apply all backtrace" and posted the output of that into the attachment.

I will rerun the detection another day and provide a better backtrace.
Comment 3 Simon Westersund 2020-08-24 20:04:26 UTC
Created attachment 131155 [details]
gdb backtrace with digikam built from source

I decided to try and compile digikam from source to provide the best possible backtrace.
So I built digikam from the GitLab sources' master branch (commit fa696a690fccf81a3b1ff70f1271e1cb0b19339e) with the bootstrap.local script. I made the following changes to the bootstrap.linux options, to avoid a build problem with a test, which I encountered:
 -DBUILD_TESTING=OFF
 -DENABLE_MYSQLSUPPORT=OFF
 -DENABLE_INTERNALMYSQL=OFF

If you need any extra info, please let me know :)

The backtrace is from a crash where I used the "Scan again and merge results" and "Work on all processor cores" options on the same collection that I scanned previously. The crash appeared at 6% completion.
Comment 4 Simon Westersund 2020-08-24 20:06:33 UTC
Created attachment 131156 [details]
gdb 'thread apply all backtrace' with digikam built from source
Comment 5 Simon Westersund 2020-08-24 20:09:36 UTC
Comment on attachment 131156 [details]
gdb 'thread apply all backtrace' with digikam built from source

This is the full print from when the face detection crashed and I ran 'thread apply all backtrace' in gdb.
Comment 6 Simon Westersund 2020-08-24 20:33:24 UTC
Created attachment 131158 [details]
single-threaded gdb backtrace with digikam built from source

I tested without the "Work on all processor cores" option too, and still got the crash, so it might not even be the multi-threading which is the cause? Or might it be my build which is broken?
Comment 7 Maik Qualmann 2020-08-24 20:39:38 UTC
Git commit b3baed64731a75097d2a50f53691c8c954eac26a by Maik Qualmann.
Committed on 24/08/2020 at 20:38.
Pushed by mqualmann into branch 'master'.

try without loading notification

M  +4    -0    core/libs/threadimageio/fileio/loadingcache.cpp

https://invent.kde.org/graphics/digikam/commit/b3baed64731a75097d2a50f53691c8c954eac26a
Comment 8 Maik Qualmann 2020-08-24 20:41:49 UTC
The backtraces are very well known to us. Only the cause is not clear and why we cannot reproduce it. Please try the commit to see if the problem can still be reproduced.

Maik
Comment 9 Simon Westersund 2020-08-25 18:31:55 UTC
Created attachment 131176 [details]
gdb backtrace with patch b3baed6

Thanks for your patch suggestion! I applied the patch (b3baed64731a75097d2a50f53691c8c954eac26a), rebuilt, and reran the face detection.

This time the face detection crashed at 8% instead of 6%, like yesterday, so something was maybe improved? Assuming, of course, that the face detection happens in roughly the same order each time :)

I attached a new backtrace from the crash.
Comment 10 Simon Westersund 2020-08-25 18:54:28 UTC
Created attachment 131178 [details]
gdb backtrace with latest master 00883ba

I went ahead and built the latest master branch (up until commit 00883ba716b80cbd0c419278a04ba5612c367e38), and that also crashes. This time at 6% again.
Comment 11 caulier.gilles 2020-08-25 18:56:12 UTC
Simon,

I currently working with the student who improve the face engine for the detection and the recognition workflow.

For the detection, the student who develop under Ubuntu, as introduced plenty of mutex to protect all OpenCV calls in algorithm, as the relevant OpenCV API is not re-entrant.

The student work on a dedicated branch. Branch is stable and work like a charm about detection. I parsed my huge collection without any crash and he detect more than 30.000 face to identify.

So perhaps you crash problem is fixed in this development branch name "gsoc20-facesengine-recognition"

It will interesting to see if this code continue to crash on your computer or not. to switch from git master to the student branch, you can process like this :

git checkout -b gsoc20-facesengine-recognition remotes/origin/gsoc20-facesengine-recognition

Git will checkout the branch. Reconfigure, recompile and reinstall. Test and report.

To switch back to git master, just run "git checkout master". To go back to the development branch, run "git checkout gsoc20-facesengine-recognition"

Thanks in advance

Gilles Caulier
Comment 12 Simon Westersund 2020-08-25 19:45:25 UTC
Created attachment 131182 [details]
gdb backtrace with gsoc20-facesengine-recognition 477fd19

Thanks for the suggestion Gilles!

I did what you suggested and built the gsoc20-facesengine-recognition branch. Sadly, also this branch crashes for me. However, the backtrace looks more informative this time, so hopefully it will help you and the GSOC student :)

The crash occurred around 7% completion this time.
Comment 13 Maik Qualmann 2020-08-25 20:03:46 UTC
Git commit b61d547aac23ba68ee355d24960c1f148ece23d2 by Maik Qualmann.
Committed on 25/08/2020 at 20:02.
Pushed by mqualmann into branch 'master'.

move static functions from private shared data to a separate class

M  +7    -7    core/libs/dimg/dimg_fileio.cpp
M  +27   -20   core/libs/dimg/dimg_p.h
M  +2    -2    core/libs/dimg/dimg_props.cpp
M  +0    -4    core/libs/threadimageio/fileio/loadingcache.cpp

https://invent.kde.org/graphics/digikam/commit/b61d547aac23ba68ee355d24960c1f148ece23d2
Comment 14 Simon Westersund 2020-08-25 20:48:14 UTC
Created attachment 131183 [details]
gdb backtrace with gsoc20 branch + b61d547a patch

I applied the patch which Maik commented, and now I got the crash with the pure virtual method call again. I could try and disable the notifications with the previously mentioned patch and try again.
Comment 15 Simon Westersund 2020-08-25 21:15:43 UTC
Created attachment 131186 [details]
gdb backtrace with gsoc20 branch + b61d547a and b3baed64 patches

So I tried to apply Maik's earlier patch to get around the pure virtual method call, but now it seems like I got a deadlock situation, so that's no good. Basically, digikam did nothing for 3-4 minutes when it reached around 9%. CPU usage was down to "nothing".
Anyway, this was just my experiment, so the result can be disregarded if you like :) I'll attach my backtrace with the interrupted execution.
Comment 16 Maik Qualmann 2020-08-26 17:15:57 UTC
Git commit 0308e3639955c0a07a9ec7100cdc6527841b6a12 by Maik Qualmann.
Committed on 26/08/2020 at 17:14.
Pushed by mqualmann into branch 'master'.

check if there is a loading thread

M  +1    -2    core/libs/dimg/dimg_p.h
M  +5    -2    core/libs/threadimageio/fileio/loadsavetask.cpp

https://invent.kde.org/graphics/digikam/commit/0308e3639955c0a07a9ec7100cdc6527841b6a12
Comment 17 Simon Westersund 2020-09-01 19:15:03 UTC
Created attachment 131350 [details]
gdb backtrace with gsoc20 branch + latest master 2fb8679

I tried this again with the gsoc20-facesengine-recognition branch + merged the latest master (2fb8679) on top of it. I still get the same pure virtual function call, see the attached backtrace.

If I'm testing this the wrong way, please let me know!
Comment 18 Maik Qualmann 2020-09-01 20:08:46 UTC
Git commit dd40e0f1912277dfef3967b56f32d44c01c05874 by Maik Qualmann.
Committed on 01/09/2020 at 20:07.
Pushed by mqualmann into branch 'master'.

protect loading QMap with a recursive mutex

M  +11   -1    core/libs/threadimageio/fileio/loadingcache.cpp

https://invent.kde.org/graphics/digikam/commit/dd40e0f1912277dfef3967b56f32d44c01c05874
Comment 19 Simon Westersund 2020-09-07 19:50:34 UTC
Created attachment 131477 [details]
gdb backtrace with latest master 5102e32

I retested the master branch. This time my collection was processed until 26% and then crashed, so some improvements can definitely be seen compared to the earlier crashes :) I'll also test the latest gsoc20-facesengine-recognition branch.
Comment 20 Maik Qualmann 2020-09-07 19:56:31 UTC
In the gsoc20-facesengine-recognition branch there is no more development, everything has been added to the master branch.

Maik
Comment 21 Maik Qualmann 2020-09-07 20:07:24 UTC
Git commit 7619d77c98e3b3f955274805ba6720cd88fd714d by Maik Qualmann.
Committed on 07/09/2020 at 20:06.
Pushed by mqualmann into branch 'master'.

disable for a test ImageMagick

M  +4    -0    core/dplugins/dimg/imagemagick/dimgimagemagickplugin.cpp

https://invent.kde.org/graphics/digikam/commit/7619d77c98e3b3f955274805ba6720cd88fd714d
Comment 22 Simon Westersund 2020-09-07 21:27:10 UTC
Created attachment 131483 [details]
gdb backtrace with latest gsoc20 branch c1861ba

Thank you for the information Maik! I read your comment too late, so I tested it anyway.
The backtrace was quite different from the other ones. Apparently there was a malloc() error this time. The crash came at around 58% completion of my collection.
Comment 23 caulier.gilles 2020-09-08 03:50:13 UTC
Hi Simon,

Please read also the story form this bug:

https://bugs.kde.org/show_bug.cgi?id=426175

This touch code from git/master, for next 7.2.0 release.

Best

Gilles Caulier
Comment 24 Maik Qualmann 2020-09-08 06:09:30 UTC
Right, we call a pure virtual function in this function, the "LoadingProcess() = 0. This LoadingProcess was overloaded by a SharedLoadingTask. So why are we in the pure virtual function with this crash? Sorry, something is right not with Ubuntu with the vtables. By the way, this function is constantly called during the image loading process and only sometimes there is a crash.

0x00007ffff40e2b65 in __cxxabiv1::__cxa_pure_virtual() () at /build/gcc/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
#6  0x00007ffff648391f in Digikam::LoadingCache::notifyNewLoadingProcess(Digikam::LoadingProcess*, Digikam::LoadingDescription const&) (this=0x555555dbf6c0, process=0x5555e8c8d3a0, description=...)
    at /home/simon/Development/kde/digikam/core/libs/threadimageio/fileio/loadingcache.cpp:264

Maik
Comment 25 Maik Qualmann 2020-09-08 10:40:11 UTC
Git commit cde3403938b82e0cfcccb1b557b6c2319ac8557e by Maik Qualmann.
Committed on 08/09/2020 at 10:37.
Pushed by mqualmann into branch 'master'.

add base classes initialization explicit in the constructor
Related: bug 423632, bug 426175

M  +2    -0    core/libs/threadimageio/fileio/loadsavetask.cpp
M  +2    -0    core/libs/threadimageio/fileio/loadsavetask.h

https://invent.kde.org/graphics/digikam/commit/cde3403938b82e0cfcccb1b557b6c2319ac8557e
Comment 26 Simon Westersund 2020-09-08 19:16:28 UTC
(In reply to caulier.gilles from comment #23)
> Hi Simon,
> 
> Please read also the story form this bug:
> 
> https://bugs.kde.org/show_bug.cgi?id=426175
> 
> This touch code from git/master, for next 7.2.0 release.
> 
> Best
> 
> Gilles Caulier

Hi Gilles,

I read through the bug report you referred to. I have not encountered any memory noticeable leaks on my system. It's possible that there has been something minor, but I just haven't noticed since I have 16 GB of memory available.

Note that my system is Manjaro KDE (and not Ubuntu). I received a bunch of updates today, so I can't say what versions I had before, but now my system opencv is at 4.4.0-1.

If you wish that I test something more specific than the face detection, which I've been doing so far, do let me know! :)
Comment 27 Simon Westersund 2020-09-08 19:20:24 UTC
Created attachment 131491 [details]
gdb backtrace with latest master fb76325

I tested with the latest master, which includes the latest fixes which Maik referred to. This time the execution crashed within 30 seconds, so there seems to have been some regression.
Comment 28 Maik Qualmann 2020-09-08 20:06:22 UTC
Git commit a06b3c5dcc32a2f95c5fbf0ac2fa898931524cea by Maik Qualmann.
Committed on 08/09/2020 at 20:05.
Pushed by mqualmann into branch 'master'.

add static cast for loading notifikation
Related: bug 423632, bug 426175

M  +2    -1    core/libs/threadimageio/fileio/loadingcache.cpp

https://invent.kde.org/graphics/digikam/commit/a06b3c5dcc32a2f95c5fbf0ac2fa898931524cea
Comment 29 Simon Westersund 2020-09-09 05:01:03 UTC
Created attachment 131500 [details]
gdb backtrace with latest master a06b3c5

With the a06b3c5 commit from yesterday the face detection was able to proceed for a longer time again. I don't know an exact percentage, since I left digikam running on its own, but it was at least over 30%.

I noticed the "warning: Source file is more recent than executable." in the backtrace is probably because I did some branch switching while gdb was running. I can guarantee that the file digikam/core/libs/threadimageio/fileio/loadingcache.cpp where the backtrace points, was not modified while gdb was running.
Comment 30 Maik Qualmann 2020-09-09 05:57:34 UTC
Ok, my guess is that the pointer we call from the QMap is no longer valid and the task was already been deleted...

Maik
Comment 31 Maik Qualmann 2020-09-09 19:57:38 UTC
Git commit 901227fa96db807e02b71a84c933d34b97ce3ec3 by Maik Qualmann.
Committed on 09/09/2020 at 19:56.
Pushed by mqualmann into branch 'master'.

changes to setStatus() function
Related: bug 423632, bug 426175

M  +1    -1    core/libs/threadimageio/fileio/loadingcache.cpp
M  +19   -12   core/libs/threadimageio/fileio/loadsavetask.cpp
M  +1    -2    core/libs/threadimageio/thumb/thumbnailtask.cpp

https://invent.kde.org/graphics/digikam/commit/901227fa96db807e02b71a84c933d34b97ce3ec3
Comment 32 Maik Qualmann 2020-10-03 19:33:32 UTC
Fixed with bug 426175.

Maik