SUMMARY I would like to request the integration of the CLIP-ViT-H-14 multimodal model into digiKam to enable advanced semantic search and automated image tagging. RATIONALE Currently, digiKam relies on metadata (EXIF/IPTC) and basic AI tools for face detection and quality analysis. Adding a CLIP (Contrastive Language-Image Pre-training) backbone would allow users to: Search by Natural Language: Search for images using descriptive phrases (e.g., "sunset over mountains with a red car") without needing manual tags. Improved Visual Similarity: Find "more images like this" with much higher accuracy than current color-based histograms. Automated Keyword Suggestion: Use the ViT-H-14 model to generate high-quality semantic keywords for a collection. TECHNICAL SUGGESTIONS Model: CLIP-ViT-H-14-laion2B-s32B-b79K is widely considered the industry standard for open-source semantic embeddings. Implementation: This could be integrated into the existing "Maintenance" or "Search" sidebar. Since digiKam already uses OpenCV and deep learning engines for face recognition, this model could leverage the same GPU acceleration infrastructure. Performance: While ViT-H-14 is large, it provides a significantly better "zero-shot" understanding than the smaller ViT-B models, making it ideal for professional photography management. ADDITIONAL CONTEXT Other open-source photo managers (like Immich or Photoprism or Photochat AI ) have successfully implemented CLIP-based search. Bringing this to digiKam would maintain its position as the premier advanced photo management suite for the KDE community.