514908 – Wishlist: Integrate local LLM/Vision Model support for AI-powered image captioning and tagging

Bug 514908 - Wishlist: Integrate local LLM/Vision Model support for AI-powered image captioning and tagging

Summary: Wishlist: Integrate local LLM/Vision Model support for AI-powered image capti...

Status:	REPORTED

Alias:	None

Product:	digikam
Classification:	Applications
Component:	Tags-AutoAssignement (other bugs)
Version First Reported In:	9.0.0
Platform:	Other Other

Importance:	NOR wishlist
Target Milestone:	---
Assignee:	Digikam Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2026-01-21 17:18 UTC by 1234destiny1234
Modified:	2026-01-21 17:34 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
URL for the Github Repo (77 bytes, text/plain) 2026-01-21 17:18 UTC, 1234destiny1234	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description 1234destiny1234 2026-01-21 17:18:29 UTC

Created attachment 188757 [details]
URL for the Github Repo

Feature Goal:
Integrate automated, high-quality image captioning and keyword generation using local Vision-Language Models (VLM), similar to the functionality in the ImageIndexer tool by jabberjabberjabber.

Specific Features to Adopt:
Local LLM Integration: Support for backends like KoboldCPP or Ollama or similar model feature to process images locally without privacy concerns .
Automated Captioning: Use AI to generate natural language descriptions of images (e.g., "A golden retriever playing with a blue ball in a sunny park").
Advanced Tagging: Extract specific keywords from the AI-generated captions to populate the digiKam Tags hierarchy automatically.
Batch Processing: The ability to run this "indexing" over a selection of images or an entire album as a background task

Why this is needed:
Current AI tagging in digiKam is often limited to basic object detection (e.g., "dog," "car"). Modern VLMs can provide context, mood, and detailed descriptions that significantly enhance the searchability of large photo collections.