Bug 497938 - Text-based Image Search
Summary: Text-based Image Search
Status: REPORTED
Alias: None
Product: digikam
Classification: Applications
Component: Searches-Advanced (show other bugs)
Version: 8.6.0
Platform: Other Other
: NOR wishlist
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-27 11:17 UTC by chair-tweet-decal
Modified: 2025-03-21 04:54 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description chair-tweet-decal 2024-12-27 11:17:06 UTC
SUMMARY

The proposal is to add a text-based image search feature in digiKam. The idea is to allow users to input a text query (e.g., "cat on a couch") and retrieve images that match based on their visual content, rather than relying on tags or metadata. This would enable more flexible and intuitive searches, particularly in large image collections where textual information may be limited or missing.

ADDITIONAL INFORMATION

To implement this feature, each image in the library would be associated with an embedding calculated when it is added to the database (or in batch). This calculation could be done using an AI model like CLIP (or similar models), which generates a vector representation of the image based on its visual content. These embeddings could be stored in the image files themselves (e.g., in EXIF metadata or an associated database), ensuring that this information is preserved over time.

For the search, the idea would be to leverage a vector search engine that compares the stored image embeddings with those generated from the text queries. This would require the integration of a vector database like FAISS or similar, enabling fast and scalable search within large image collections. When a user submits a text query, an embedding is generated for the description and compared to the pre-calculated image embeddings, with the most relevant images returned based on vector similarity.

The interface could include a new option in the search section of digiKam, allowing users to enter a textual description and see the corresponding images based on their content. This approach would enable a more powerful search system, going beyond simple keywords or tags associated with the images.


The technical details provided here are intended to illustrate what the feature might look like. However, as I am not familiar with the internal structure of digiKam, this is purely a conceptual framework, not a detailed action plan. Further discussion and adjustments will be required to fit the specific architecture of the software.
Comment 1 caulier.gilles 2024-12-27 11:29:21 UTC
Hi,

Sound like a Chat GPT search engine wish...

In all cases, we will _NEVER_ open the door to the search web services and share the private contents from end users to the big data companies. Photo are proprietary contents and Web IA stuff can burn in the center of the earth.

If we decide to write something like that, the DNN models + engines must be implemented locally, as the face management.  

Best

Gilles Caulier
Comment 2 chair-tweet-decal 2024-12-27 11:40:43 UTC
The CLIP model works locally, so there’s no need for an external system. CLIP is used in Stable Diffusion, so I assume its license would allow its use in DigiKam (though this needs to be confirmed, as I don't have experience with licensing).
Comment 3 Michael Miller 2024-12-27 11:49:39 UTC
(In reply to chair-tweet-decal from comment #2)
> The CLIP model works locally, so there’s no need for an external system.
> CLIP is used in Stable Diffusion, so I assume its license would allow its
> use in DigiKam (though this needs to be confirmed, as I don't have
> experience with licensing).

There are LLMs that can run locally.  The main problem with implementing a LLM search is that the LLM itself is typically absolutely huge.  Digikam is already quite large, and introducing a LLM will make it even bigger.  Even if we download the trained LLM after digikam is installed, the LLM code is large by itself.

CLIP is nice, but it's Python based, and digiKam is C++.  While it's possible to integrate the two, we don't want to introduce another dependency into digiKam.

As an AI/ML fan myself, I'm always looking for ways to assist the creative photography process through modern tools (but not generative AI).  If it becomes feasible to introduce LLM searching in digiKam, I will definitely look into it.

Cheers,
Mike
Comment 4 chair-tweet-decal 2024-12-27 12:18:36 UTC
Indeed, the fact that it's Python might complicate things. It would likely be possible to export the model to ONNX, but that would require hosting the converted model, as I don't think it's available in that format. Also, the ONNX runtime can range from 50MB to 300MB depending on the OS and whether there's GPU support.

Not to mention, using ONNX could add extra complexity.

There’s also LibTorch, which could be used to run the model without ONNX, but you would still need to convert the model, which adds another dependency.
Comment 5 Michael Miller 2024-12-27 12:22:58 UTC
(In reply to chair-tweet-decal from comment #4)
> Indeed, the fact that it's Python might complicate things. It would likely
> be possible to export the model to ONNX, but that would require hosting the
> converted model, as I don't think it's available in that format. Also, the
> ONNX runtime can range from 50MB to 300MB depending on the OS and whether
> there's GPU support.
> 
> Not to mention, using ONNX could add extra complexity.
> 
> There’s also LibTorch, which could be used to run the model without ONNX,
> but you would still need to convert the model, which adds another dependency.

Running an ONNX model isn't a problem.  We already use several ONNX models in the face recognition engine, and soon to be release image classification engine.  Since OpenCV is built into digiKam, we use the OpenCV ONNX (and other model types like Caffee and DarkNet) runtime.

Cheers,
Mike
Comment 6 caulier.gilles 2024-12-27 17:10:05 UTC
Hi Michael,

I'm not favorable to mix different programing language in digiKam. This will increase the entropy and make the maintenance a hell.

On possible solution is to create a 3rd party plugin. Krita do it already for example, but in digiKam the plugin interface (especially the database) is not yet ready to be interrogated by Python code. It's technically possible of course, but it's not yet the priority for the moment.

Gilles
Comment 7 Michael Miller 2024-12-27 17:13:41 UTC
(In reply to caulier.gilles from comment #6)
> Hi Michael,
> 
> I'm not favorable to mix different programing language in digiKam. This will
> increase the entropy and make the maintenance a hell.
> 
> 
> Gilles

Hi Gilles, yes, I agree we shouldn't mix programming languages.  I think the OP is talking about turning the LLM into an ONNX model that we can use natively in digiKam without using a different programming language.  It's an interesting thought.

Cheers,
Mike
Comment 8 chair-tweet-decal 2024-12-30 16:26:07 UTC
I've tried some experiments on my side, here is the result, unfortunately inconclusive, but if it can help for the next steps.

For your information, I’m not very experienced in machine learning, so I’m probably missing some things.

What I hadn’t anticipated:

Before using the models, there is the preprocessing of the inputs.

For the image part, no matter which model is used, the CLIP preprocessing seems to be the same: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor
Handling images is fairly simple, the difficulty doesn’t seem to be here.

For the text part, there is tokenization.
This is more complicated and requires additional libraries and configuration files.
A large number of models available on Hugging Face use https://huggingface.co/openai/clip-vit-large-patch14/tree/main, which requires vocab.json and merges.txt.

The M-CLIP models (https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus/tree/main, multilingual) propose using the SentencePiece tokenizer and provide the sentencepiece.bpe.model.

In both cases, padding and truncation need to be managed.
Model management:
I’ve tried to have one model for the image and another for the text to allow them to be used separately, and some models (M-CLIP) already use this approach.

This project (https://github.com/Lednik7/CLIP-ONNX) seems to split the model if the base one combines both.
This ONNX function seems to be able to extract part of a model: https://onnx.ai/onnx/api/utils.html#extract-model.

I haven’t tested either of these methods because I found this ModelZoo, which offers several pre-split models: https://github.com/jina-ai/clip-as-service/blob/main/server/clip_server/model/clip_onnx.py.

For preprocessing, I wanted to include them in the model to simply have an ONNX model that handles everything and avoid having to add libraries.

I tried with https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/Example%20usage%20of%20the%20PrePostProcessor.md, which seems to offer what’s necessary for image and text preprocessing and only requires including a library for custom operators when creating the ONNX session.
The tokenizers don’t seem to handle padding and truncation.

I haven’t managed to get a functional model.

For tokenization in C++, https://github.com/google/sentencepiece seems interesting because it would only require having the .bpe.model (and handling padding and truncation), which is the case for some models, but not all.
It explains how to train a BPE model, but I don’t know how to convert vocab.json + merges.txt into the model.

For the choice of the model, the M-CLIP models seem interesting because they offer BPE for the tokenizer, which seems simpler to use. They are multilingual and have good performance. However, they are large, and I’m not sure if they would work on a "lightweight" computer, or what the inference speed would be.
I also don’t have this information for other models.
Comment 9 Michael Miller 2025-01-25 15:28:31 UTC
I've been looking into your idea of adding NLP search for images.  The hardest part is proper classification of the images.  There is a new Autotagging engine in 8.6.0 that's much more accurate when using YOLOv11.  These tags plus user entered tags will be the foundation of the search for now.

I think for a first try, I'll tokenize existing tags (including autotags).  That should at least get us to the point where we can test a simple NLP search model.  We can iterate from there by tokenizing other metadata, including dates.

Preprocessing and classifying images isn't something that can be done at the time of the search.  For large libraries, this can take hours and hours.  All of the images would have to be pre-classified for the search to be usable.  This is why I'm staying away (at least for now) from some of the models you mentioned.

Cheers,
Mike
Comment 10 caulier.gilles 2025-01-25 16:26:25 UTC
Hi Michael,

+1 about the NPL search support in digiKam...

Gilles
Comment 11 Srisharan V S 2025-03-20 16:09:39 UTC
Hi everyone,

I am Srisharan, and I have been contributing to KDE since last December, primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas and found this project particularly interesting.

I have experience in C++ and am knowledgeable in machine learning and LLMs. Additionally, I have just completed a research paper on image processing. Given my background, I am eager to contribute to this project for GSoC'25.

Could anyone guide me on how to get started?
Comment 12 Michael Miller 2025-03-20 16:16:36 UTC
(In reply to Srisharan V S from comment #11)
> Hi everyone,
> 
> I am Srisharan, and I have been contributing to KDE since last December,
> primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas
> and found this project particularly interesting.
> 
> I have experience in C++ and am knowledgeable in machine learning and LLMs.
> Additionally, I have just completed a research paper on image processing.
> Given my background, I am eager to contribute to this project for GSoC'25.
> 
> Could anyone guide me on how to get started?

Hi Srisharan,
I'm the dev who focuses on the AI/ML aspects of digiKam., Send an email to digikam-users@kde.org with your request.  I'll start a private email chain with myself and the other digiKam devs.  

Cheers,
Mike
Comment 13 caulier.gilles 2025-03-20 16:26:38 UTC
Hi Srisharan,

Ask Michael said, please contact the developers by private mail. email are listed in the project ideas list from KDE site.

To contribute on this topic, to be short, you must write a proposal to this project, with the list of :

- Concepts to implement,
- Problematic to solve.
- Technical solution to apply
- A release plan
- A details to your skill
- and more...

This paper must be published when ready to to Google Summer of Code  web site to be review by all KDE mentors in goal to be selected for the event.

By private mail, we will respond to all your questions on the project.

Best

Gilles Caulier
Comment 14 Srisharan V S 2025-03-21 03:31:43 UTC
(In reply to Michael Miller from comment #12)
> (In reply to Srisharan V S from comment #11)
> > Hi everyone,
> > 
> > I am Srisharan, and I have been contributing to KDE since last December,
> > primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas
> > and found this project particularly interesting.
> > 
> > I have experience in C++ and am knowledgeable in machine learning and LLMs.
> > Additionally, I have just completed a research paper on image processing.
> > Given my background, I am eager to contribute to this project for GSoC'25.
> > 
> > Could anyone guide me on how to get started?
> 
> Hi Srisharan,
> I'm the dev who focuses on the AI/ML aspects of digiKam., Send an email to
> digikam-users@kde.org with your request.  I'll start a private email chain
> with myself and the other digiKam devs.  
> 
> Cheers,
> Mike

Hi Mike,
I have sent en email to digikam-users@kde.org. Apparently, a moderator needs to approve my mail request since, I am not a member of the group. Hopefully, someone can approve it soon.
Cheers, 
Srisharan
Comment 15 caulier.gilles 2025-03-21 04:54:18 UTC
no. Please don't waste time, send mail in private, not to the mailing list. CC Me, Michael, and Maik, as listed in the project idea :

https://community.kde.org/GSoC/2025/Ideas#Project:_Interface_the_database_search_engine_to_an_AI_based_LLM