Summary: | Text-based Image Search | ||
---|---|---|---|
Product: | [Applications] digikam | Reporter: | chair-tweet-decal |
Component: | Searches-Advanced | Assignee: | Digikam Developers <digikam-bugs-null> |
Status: | REPORTED --- | ||
Severity: | wishlist | CC: | caulier.gilles, chair-tweet-decal, michael_miller, srisharan.psgtech |
Priority: | NOR | ||
Version: | 8.6.0 | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Other | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
chair-tweet-decal
2024-12-27 11:17:06 UTC
Hi, Sound like a Chat GPT search engine wish... In all cases, we will _NEVER_ open the door to the search web services and share the private contents from end users to the big data companies. Photo are proprietary contents and Web IA stuff can burn in the center of the earth. If we decide to write something like that, the DNN models + engines must be implemented locally, as the face management. Best Gilles Caulier The CLIP model works locally, so there’s no need for an external system. CLIP is used in Stable Diffusion, so I assume its license would allow its use in DigiKam (though this needs to be confirmed, as I don't have experience with licensing). (In reply to chair-tweet-decal from comment #2) > The CLIP model works locally, so there’s no need for an external system. > CLIP is used in Stable Diffusion, so I assume its license would allow its > use in DigiKam (though this needs to be confirmed, as I don't have > experience with licensing). There are LLMs that can run locally. The main problem with implementing a LLM search is that the LLM itself is typically absolutely huge. Digikam is already quite large, and introducing a LLM will make it even bigger. Even if we download the trained LLM after digikam is installed, the LLM code is large by itself. CLIP is nice, but it's Python based, and digiKam is C++. While it's possible to integrate the two, we don't want to introduce another dependency into digiKam. As an AI/ML fan myself, I'm always looking for ways to assist the creative photography process through modern tools (but not generative AI). If it becomes feasible to introduce LLM searching in digiKam, I will definitely look into it. Cheers, Mike Indeed, the fact that it's Python might complicate things. It would likely be possible to export the model to ONNX, but that would require hosting the converted model, as I don't think it's available in that format. Also, the ONNX runtime can range from 50MB to 300MB depending on the OS and whether there's GPU support. Not to mention, using ONNX could add extra complexity. There’s also LibTorch, which could be used to run the model without ONNX, but you would still need to convert the model, which adds another dependency. (In reply to chair-tweet-decal from comment #4) > Indeed, the fact that it's Python might complicate things. It would likely > be possible to export the model to ONNX, but that would require hosting the > converted model, as I don't think it's available in that format. Also, the > ONNX runtime can range from 50MB to 300MB depending on the OS and whether > there's GPU support. > > Not to mention, using ONNX could add extra complexity. > > There’s also LibTorch, which could be used to run the model without ONNX, > but you would still need to convert the model, which adds another dependency. Running an ONNX model isn't a problem. We already use several ONNX models in the face recognition engine, and soon to be release image classification engine. Since OpenCV is built into digiKam, we use the OpenCV ONNX (and other model types like Caffee and DarkNet) runtime. Cheers, Mike Hi Michael, I'm not favorable to mix different programing language in digiKam. This will increase the entropy and make the maintenance a hell. On possible solution is to create a 3rd party plugin. Krita do it already for example, but in digiKam the plugin interface (especially the database) is not yet ready to be interrogated by Python code. It's technically possible of course, but it's not yet the priority for the moment. Gilles (In reply to caulier.gilles from comment #6) > Hi Michael, > > I'm not favorable to mix different programing language in digiKam. This will > increase the entropy and make the maintenance a hell. > > > Gilles Hi Gilles, yes, I agree we shouldn't mix programming languages. I think the OP is talking about turning the LLM into an ONNX model that we can use natively in digiKam without using a different programming language. It's an interesting thought. Cheers, Mike I've tried some experiments on my side, here is the result, unfortunately inconclusive, but if it can help for the next steps. For your information, I’m not very experienced in machine learning, so I’m probably missing some things. What I hadn’t anticipated: Before using the models, there is the preprocessing of the inputs. For the image part, no matter which model is used, the CLIP preprocessing seems to be the same: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPImageProcessor Handling images is fairly simple, the difficulty doesn’t seem to be here. For the text part, there is tokenization. This is more complicated and requires additional libraries and configuration files. A large number of models available on Hugging Face use https://huggingface.co/openai/clip-vit-large-patch14/tree/main, which requires vocab.json and merges.txt. The M-CLIP models (https://huggingface.co/M-CLIP/XLM-Roberta-Large-Vit-B-16Plus/tree/main, multilingual) propose using the SentencePiece tokenizer and provide the sentencepiece.bpe.model. In both cases, padding and truncation need to be managed. Model management: I’ve tried to have one model for the image and another for the text to allow them to be used separately, and some models (M-CLIP) already use this approach. This project (https://github.com/Lednik7/CLIP-ONNX) seems to split the model if the base one combines both. This ONNX function seems to be able to extract part of a model: https://onnx.ai/onnx/api/utils.html#extract-model. I haven’t tested either of these methods because I found this ModelZoo, which offers several pre-split models: https://github.com/jina-ai/clip-as-service/blob/main/server/clip_server/model/clip_onnx.py. For preprocessing, I wanted to include them in the model to simply have an ONNX model that handles everything and avoid having to add libraries. I tried with https://github.com/microsoft/onnxruntime-extensions/blob/main/onnxruntime_extensions/tools/Example%20usage%20of%20the%20PrePostProcessor.md, which seems to offer what’s necessary for image and text preprocessing and only requires including a library for custom operators when creating the ONNX session. The tokenizers don’t seem to handle padding and truncation. I haven’t managed to get a functional model. For tokenization in C++, https://github.com/google/sentencepiece seems interesting because it would only require having the .bpe.model (and handling padding and truncation), which is the case for some models, but not all. It explains how to train a BPE model, but I don’t know how to convert vocab.json + merges.txt into the model. For the choice of the model, the M-CLIP models seem interesting because they offer BPE for the tokenizer, which seems simpler to use. They are multilingual and have good performance. However, they are large, and I’m not sure if they would work on a "lightweight" computer, or what the inference speed would be. I also don’t have this information for other models. I've been looking into your idea of adding NLP search for images. The hardest part is proper classification of the images. There is a new Autotagging engine in 8.6.0 that's much more accurate when using YOLOv11. These tags plus user entered tags will be the foundation of the search for now. I think for a first try, I'll tokenize existing tags (including autotags). That should at least get us to the point where we can test a simple NLP search model. We can iterate from there by tokenizing other metadata, including dates. Preprocessing and classifying images isn't something that can be done at the time of the search. For large libraries, this can take hours and hours. All of the images would have to be pre-classified for the search to be usable. This is why I'm staying away (at least for now) from some of the models you mentioned. Cheers, Mike Hi Michael, +1 about the NPL search support in digiKam... Gilles Hi everyone, I am Srisharan, and I have been contributing to KDE since last December, primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas and found this project particularly interesting. I have experience in C++ and am knowledgeable in machine learning and LLMs. Additionally, I have just completed a research paper on image processing. Given my background, I am eager to contribute to this project for GSoC'25. Could anyone guide me on how to get started? (In reply to Srisharan V S from comment #11) > Hi everyone, > > I am Srisharan, and I have been contributing to KDE since last December, > primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas > and found this project particularly interesting. > > I have experience in C++ and am knowledgeable in machine learning and LLMs. > Additionally, I have just completed a research paper on image processing. > Given my background, I am eager to contribute to this project for GSoC'25. > > Could anyone guide me on how to get started? Hi Srisharan, I'm the dev who focuses on the AI/ML aspects of digiKam., Send an email to digikam-users@kde.org with your request. I'll start a private email chain with myself and the other digiKam devs. Cheers, Mike Hi Srisharan, Ask Michael said, please contact the developers by private mail. email are listed in the project ideas list from KDE site. To contribute on this topic, to be short, you must write a proposal to this project, with the list of : - Concepts to implement, - Problematic to solve. - Technical solution to apply - A release plan - A details to your skill - and more... This paper must be published when ready to to Google Summer of Code web site to be review by all KDE mentors in goal to be selected for the event. By private mail, we will respond to all your questions on the project. Best Gilles Caulier (In reply to Michael Miller from comment #12) > (In reply to Srisharan V S from comment #11) > > Hi everyone, > > > > I am Srisharan, and I have been contributing to KDE since last December, > > primarily in KDE Games. I recently went through KDE's list of GSoC'25 ideas > > and found this project particularly interesting. > > > > I have experience in C++ and am knowledgeable in machine learning and LLMs. > > Additionally, I have just completed a research paper on image processing. > > Given my background, I am eager to contribute to this project for GSoC'25. > > > > Could anyone guide me on how to get started? > > Hi Srisharan, > I'm the dev who focuses on the AI/ML aspects of digiKam., Send an email to > digikam-users@kde.org with your request. I'll start a private email chain > with myself and the other digiKam devs. > > Cheers, > Mike Hi Mike, I have sent en email to digikam-users@kde.org. Apparently, a moderator needs to approve my mail request since, I am not a member of the group. Hopefully, someone can approve it soon. Cheers, Srisharan no. Please don't waste time, send mail in private, not to the mailing list. CC Me, Michael, and Maik, as listed in the project idea : https://community.kde.org/GSoC/2025/Ideas#Project:_Interface_the_database_search_engine_to_an_AI_based_LLM |