Bug 496691 - duplicate search has become much slower on the release version of digikam 8.5
Summary: duplicate search has become much slower on the release version of digikam 8.5
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Applications
Component: Searches-Similarity (show other bugs)
Version: 8.5.0
Platform: Arch Linux Linux
: NOR normal
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-26 02:40 UTC by garfer
Modified: 2024-12-20 04:09 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In: 8.6.0
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description garfer 2024-11-26 02:40:20 UTC
SUMMARY
duplicate search has become much slower on the release version of digikam 8.5

STEPS TO REPRODUCE
1. right click on album
2. select  find duplicates

OBSERVED RESULT
the search takes much longer than on previous versions of digikam up to at least the weekly image "8.5.0-20240831T112634" (that's the latest weekly build i have for 8.5, some of the later ones might still work fine)

EXPECTED RESULT
much faster duplicate search

SOFTWARE/OS VERSIONS
KDE Plasma Version: 6.2.3
KDE Frameworks Version: 6.8.0
Qt Version: 6.8.0

ADDITIONAL INFORMATION

using an NVIDIA GPU

so far i've tested the repo, flatpak and appimage versions of 8.5.0 and the latest weekly build of 8.6.0 both with sqlite and mariadb all with the same results, i get no errors on the terminal output and the only obvious visible difference between the old and new versions is that on the new versions the thumbnails of the images being processed are shown on the progress bar when expanded (not sure if that's relevant since turning the status bar off doesn't help)

on the same system with the same settings, database and duplicate search parameters all versions of digikam up to at least the weekly build 8.5.0-20240831T112634 are a lot faster; a duplicate search that takes only a couple of seconds on the old versions can take over 10 minutes on the new ones

also if instead of using the "find duplicates" tool i right click on an image and use the "find similar" tool the speed seems to be on par with the old versions
Comment 1 garfer 2024-11-26 03:21:42 UTC
doing a debug trace on the newer versions i get statements such as:

digikam.dimg.jpeg: RST0  ( 3 )
digikam.dimg.jpeg: RST1  ( 3 )
digikam.dimg.jpeg: RST2  ( 3 )
digikam.dimg.jpeg: RST3  ( 3 )
digikam.dimg.jpeg: RST4  ( 3 )
digikam.dimg.jpeg: RST5  ( 3 )
digikam.dimg.jpeg: RST6  ( 3 )
digikam.dimg.jpeg: RST7  ( 3 )

this doesn't happen with the older versions
Comment 2 garfer 2024-11-26 03:54:30 UTC
i think i found the issue, sorry for the inconvenience

looking at the bugtrace it seemed that digikam was recalculating the phash on the images and the sound of my HDD confirmed it, sure enough, recalculating the fingerprints seems to solve the issue
Comment 3 garfer 2024-12-12 23:07:08 UTC
i was wrong, the issue returns after closing and opening the program, that's to say if i search two times for duplicates among the same set of images within one session the first time will be extremely slow (in my test case over 10 minutes) and the second time will take under 5 seconds, if i then close and reopen digikam it will take again over 10 minutes; in the same conditions the previous versions of digikam always run the same search in under 5 seconds

also something i noticed is that when looking for duplicates on the newer version i can hear digikam accessing the images on the HDD while older versions seem to run the search using only the data on the DB without ever directly accessing the images (i have the DBs on an SSD and the images on an HDD so it's very noticeable)

all the previous stuff still applies but this time i'm using two databases, one of them for the weekly appimage "8.5.0-20240831T112634" and other for the release version of 8.5 just in case it had something to do with the new unique file hash
Comment 4 Maik Qualmann 2024-12-13 06:59:14 UTC
It can only have something to do with the new unique hash to a limited extent, it is not slower than the previous one. If you have already switched to the new unique hash by executing the option in the digiKam setup, only the core database and thumbnail database will be ported. The database with the fingerprints will not be ported. So you have to update the fingerprints.

Maik
Comment 5 garfer 2024-12-13 19:24:29 UTC
i tried making a new database altogether and calculated the fingerprints using both the maintenance tool and the "update fingerprints" button on the duplicate search interface using the release version of 6.5 and the problem still persists, the first time i run the search is very slow and with very high disk read speeds (not only on the drive with the DBs but in the one with the images too) after restarting digikam i have to do it again

looking more in depth at the logs in addition to the previously mentioned messages i also get some other patterns that i'm appending below, if i'm not mistaken it seems that it's reading the metadata from the images as it goes so that could explain both the high disk utilization and the low speeds, for now i tried disabling anything related to metadata on digikam and at one point i also tried removing access to exiftool to no avail (in that attempt the exiftool related messages disappeared but the exiv2 ones and the "digikam.dimg.jpeg: RST..." remained); to be clear on the previous digikam versions i didn't get those logs

here are some log chunks that seem to be fairly representative of what i get, note that i changed the proper file paths for {imagepath} because of privacy reasons

digikam.metaengine: Loading metadata with "Exiv2" backend from "{imagepath}"

digikam.general: Try to get preview from "{imagepath}"
digikam.general: Preview quality:  0
digikam.metaengine: ExifToolProcess::readOutput(): ExifTool command completed
digikam.metaengine: ExifTool complete command for action "Load Chunks" with elasped time (ms): 23
digikam.metaengine: EXV chunk size: 0
digikam.metaengine: ExifTool parsed command for action "Load Chunks" 1 properties decoded
digikam.metaengine: ExifTool complete "Load Chunks" for "{imagepath}"
digikam.metaengine: Metadata chunk loaded with ExifTool
digikam.metaengine: Metadata chunk loaded with ExifTool has no data
digikam.metaengine: Check ExifTool availability: true
digikam.metaengine: ExifTool "Load Chunks" "-TagsFromFile {imagepath} -all:all -icc_profile -o -.exv"
digikam.metaengine: Loading metadata with "Exiv2" backend from "{imagepath}"
digikam.general: Try to load DImg preview from: "{imagepath}"
digikam.metaengine: Check ExifTool availability: true

digikam.dimg.jpeg: Adobe APP14 marker: version 100, flags 0x4000 0x0000, transform 1  ( 1 )
digikam.dimg.jpeg: Define Quantization Table 0  precision 0  ( 1 )
digikam.dimg.jpeg: RST3  ( 3 )
digikam.dimg.jpeg: Define Quantization Table 1  precision 0  ( 1 )
digikam.dimg.jpeg: Start Of Frame 0xc0: width=1200, height=1800, components=3  ( 1 )
digikam.dimg.jpeg:     Component 1: 1hx1v q=0  ( 1 )
digikam.dimg.jpeg:     Component 2: 1hx1v q=1  ( 1 )
digikam.dimg.jpeg:     Component 3: 1hx1v q=1  ( 1 )
digikam.dimg.jpeg: Define Restart Interval 150  ( 1 )
digikam.dimg.jpeg: Define Huffman Table 0x00  ( 1 )
digikam.dimg.jpeg:           0   0   7   1   1   1   1   1  ( 2 )
digikam.dimg.jpeg:           0   0   0   0   0   0   0   0  ( 2 )
digikam.dimg.jpeg: Define Huffman Table 0x01  ( 1 )
digikam.dimg.jpeg: RST4  ( 3 )
digikam.dimg.jpeg:           0   2   2   3   1   1   1   1  ( 2 )
digikam.dimg.jpeg:           1   0   0   0   0   0   0   0  ( 2 )
digikam.dimg.jpeg: Define Huffman Table 0x10  ( 1 )
digikam.dimg.jpeg:           0   2   1   3   3   2   4   2  ( 2 )
digikam.dimg.jpeg:           6   7   3   4   2   6   2 115  ( 2 )
digikam.dimg.jpeg: Define Huffman Table 0x11  ( 1 )
digikam.dimg.jpeg:           0   2   2   1   2   3   5   5  ( 2 )
digikam.dimg.jpeg:           4   5   6   4   8   3   3 109  ( 2 )
digikam.dimg.jpeg: Start Of Scan: 3 components  ( 1 )
digikam.dimg.jpeg:     Component 1: dc=0 ac=0  ( 1 )
digikam.dimg.jpeg:     Component 2: dc=1 ac=1  ( 1 )
digikam.dimg.jpeg: RST5  ( 3 )
digikam.dimg.jpeg:     Component 3: dc=1 ac=1  ( 1 )
digikam.dimg.jpeg:   Ss=0, Se=63, Ah=0, Al=0  ( 1 )
igikam.dimg.jpeg: RST6  ( 3 )
digikam.dimg.jpeg: RST0  ( 3 )
digikam.dimg.jpeg: RST1  ( 3 )
digikam.dimg.jpeg: RST7  ( 3 )
digikam.dimg.jpeg: RST2  ( 3 )
digikam.dimg.jpeg: RST3  ( 3 )
digikam.dimg.jpeg: RST0  ( 3 )
digikam.dimg.jpeg: RST4  ( 3 )
digikam.dimg.jpeg: Start of Image  ( 1 )
digikam.dimg.jpeg: JFIF APP0 marker: version 1.01, density 350x350  1  ( 1 )
digikam.dimg.jpeg: Define Quantization Table 0  precision 0  ( 1 )
digikam.dimg.jpeg: RST5  ( 3 )
digikam.dimg.jpeg: Define Quantization Table 1  precision 0  ( 1 )
digikam.dimg.jpeg: Start Of Frame 0xc0: width=700, height=1049, components=3  ( 1 )
digikam.dimg.jpeg: RST1  ( 3 )
digikam.dimg.jpeg: RST6  ( 3 )

*i noticed my system was using exiftool 13.07 so i downgraded to 12.99 but the problem persists
Comment 6 Maik Qualmann 2024-12-13 19:36:28 UTC
I did not say that you should create a new core database.
The log shows a scan process that does not occur in the duplicate search. This means that your collection and the thumbnails have not been completely scanned, the process is still running. You must wait until it is completed before performing a duplicate search. Note that scanning a large image collection can take hours or, in the case of network drives, days.

Maik
Comment 7 garfer 2024-12-17 13:56:01 UTC
makes sense, i'll try recreating the thumbnails to see if that solves the issue, in any case i'd like to clarify that by "creating a new DB" i meant both creating the core DB and running the appropriate maintenance tasks including finding new items, creating the thumbnails and calculating the fingerprints (i also calculated the fingerprints once more after i noticed it was still going slow but didn't think about doing the same with the thumbs)
Comment 8 garfer 2024-12-18 14:44:55 UTC
nothing, same speed, logs and disk usage, i'm pretty sure thumbnails are created and working given the performance while browsing the galleries and the size of the thumbnails DB

as a sidenote testing on 8.4  with the fingerprints calculated but omitting the creation of thumbnails is still way faster, it just takes a couple of extra seconds to display the results, disk usage is negligible on the drive with the images during the search itself but it spikes once the search is completed and digikam is trying to display the results
Comment 9 garfer 2024-12-18 20:10:29 UTC
i've been compiling several versions of digikam trying to find the point at which this started happening and i've narrowed down to the commits made on Sep 29 2024 and all i can say with certainty is that on my systems commit b2981ed4 is several times faster at finding duplicates than commit a07cfdb7 i suspect commit f41fde1d "Progress Manager - Find duplicates : show current item thumbnails, name, and album path. " of being the cause of the degradation in performance

in case that this turns out to be the issue (i'm not familiar at all with C++ or digikam's codebase) and since this commit was made in response to a feature request (375521) meaning it could be considered a feature and not a bug i'd like to submit a request for the addition of an option to turn this feature off given the possible impact that it can have both in performance and HDD degradation, at least on certain setups
Comment 10 caulier.gilles 2024-12-18 20:18:05 UTC
For this commit appending the thumbnail, path and name of the current item processed by the find duplicates tool, i think the most important one is the thumb extract. string (path and name) are already available in source code.

The thumbnails extraction uses the cache mechanism already present in digiKam, so it must be fast enough. The same is done with the face detection tool... So i'm surprised by the degradation of performances. At least here, when i check this features, i don't seen any time latency.
Comment 11 Maik Qualmann 2024-12-18 20:42:04 UTC
Hi Gilles, we load a DImg from each image, most of them will not be in a cache and will be loaded from the DB or created when a new collection is added. You have a fast computer, but it is clear that with the thousands of images that the tool compares, the performance drops significantly.

Maik
Comment 12 Maik Qualmann 2024-12-18 20:46:49 UTC
It is completely unnecessary to display every thumbnail in the progress bar, maybe only 10% of an item list. It is not visible to the eye. Not just with the duplicates tool, but in general. We also have render time losses in the GUI with so many events.

Maik
Comment 13 garfer 2024-12-19 14:21:33 UTC
since this seems to be the issue 'll add some more specific details

CPU: intel i5-12600KF

time to find duplicates on a gallery containing 5994 images (similarity and thumbnail DBs created and updated):

-- commit a07cfdb7, DB on SSD + images on HDD
total time: 1min 58s
peak ram usage: 1900mb
HDD read (kB/s): 12572
SSD read (kB/s): 10120

-- commit a07cfdb7, DB + images on SSD
total time: 22s
peak ram usage: 1800mb
peak HDD read (kB/s): 0
peak SSD read (kB/s): 65640

-- commit b2981ed4, DB on SSD + images on HDD
total time: 2s
peak ram usage: 750mb
peak HDD read (kB/s): 0
peak SSD read (kB/s): 9595

-- commit b2981ed4, DB + images on SSD
total time: 2s
peak ram usage: 750mb
peak HDD read (kB/s): 0
peak SSD read (kB/s): 9466

*the usage stats are all approximate due to the way they were gathered but fairly consistent, disk stats were taken using iostat running on 1s intervals, ram usage was taken from fastfetch

from those numbers it seems that the usage of the database is consistent across versions but after commit a07cfdb7 digikam has started accessing the images themselves instead of only the DB

i'm including ram stats because i also noticed a very high ram usage on the newer version when running a duplicate search on a large set of images, i think i got to over 9GB before i stopped the program, i disregarded it at the time but it might be a thing seeing some sort of cache being mentioned, please note the i didn't do any specific testing on that regard so it might be nothing
Comment 14 caulier.gilles 2024-12-20 04:09:22 UTC
Git commit 1452c08850eacccf41a2ead9d605efcde73276a9 by Gilles Caulier.
Committed on 20/12/2024 at 04:07.
Pushed by cgilles into branch 'master'.

Fix performance issue with loaded preview while duplicates search tool is running.
FIXED-IN: 8.6.0

M  +5    -6    core/libs/database/haar/haariface.cpp
M  +0    -1    core/libs/database/haar/haariface_p.h
M  +2    -2    core/utilities/maintenance/tools/duplicates/duplicatesfinder.cpp

https://invent.kde.org/graphics/digikam/-/commit/1452c08850eacccf41a2ead9d605efcde73276a9