472692 – Tesseract OCR does not take language selection into account

Bug 472692 - Tesseract OCR does not take language selection into account

Summary: Tesseract OCR does not take language selection into account

Status:	RESOLVED FIXED

Alias:	None

Product:	digikam
Classification:	Applications
Component:	Plugin-Generic-OcrTextConverter (other bugs)
Version First Reported In:	8.2.0
Platform:	Microsoft Windows Microsoft Windows

Importance:	NOR normal
Target Milestone:	---
Assignee:	Digikam Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-07-27 08:40 UTC by claus.peja+kde
Modified:	2023-07-29 17:45 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:	https://invent.kde.org/graphics/digikam/-/commit/21ef0f72c6af7be18c6f5ae57159e50a5b4c894f
Version Fixed/Implemented In:	8.2.0
Sentry Crash Report:

Attachments
One of the files I want to extract text from. (627.33 KB, image/jpeg) 2023-07-27 08:40 UTC, claus.peja+kde	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description claus.peja+kde 2023-07-27 08:40:56 UTC

Created attachment 160560 [details]
One of the files I want to extract text from.

SUMMARY
I use Tesseract OCR with digiKam 8.2.0 (20.07.2023) on Windows 10 Pro. I try to get the text from a jpg. If I select 'Languages: Default', I get a result, but German umlauts, ä, ü, and ö, are scanned incorrectly as a o u, and, yes, that's a difference in German ;-) .
But when I select 'Languages: deu', I get no result. No test is found at all. But also selecting e.g. eng gives no result.
However, when I use Tesseract (v5.3.1.20230401) directly on the command line with switch -l deu, it works.
Tesseract command that works: tesseract /dir/pic1.jpg /text/pic1.ocr-result -l deu
I attach one of the pictures I use. I marked the last sentence and the umlauts in it.

STEPS TO REPRODUCE
1. Open the image attached in the 'OCR text converter...'
2. Select 'Languages: Default'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72
3. Start OCR
4. Now you get the result without umlauts (ö ü)
5. Close OCR
6. Open the same image again in 'OCR text converter...'
7. Select 'Languages: deu'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72
8. Start OCR
9. Now you get no result

OBSERVED RESULT
With default, the sentence is scanned as: Die Giebel und Traufen konnen durch Wind- bzw. Traufen- oder Tropf-bretter geschutzt werden.

EXPECTED RESULT
The correct sentence is: Die Giebel und Traufen können durch Wind- bzw. Traufen- oder Tropf-bretter geschützt werden.

SOFTWARE/OS VERSIONS
Windows 10, 22H2

ADDITIONAL INFORMATION

Comment 1 Maik Qualmann 2023-07-27 10:39:17 UTC

I tested it here under Linux, the Windows test will follow. If I select German as the language, I get a correct text with German umlauts.

Maik

Comment 2 Maik Qualmann 2023-07-27 18:50:14 UTC

Git commit 5918439aafb5b2f7387490cb2abc9178fe33f374 by Maik Qualmann.
Committed on 27/07/2023 at 20:49.
Pushed by mqualmann into branch 'master'.

fix language parameter for Tesseract OCR on Windows

M  +11   -0    core/dplugins/generic/tools/ocrtextconverter/tesseractbinary.cpp

https://invent.kde.org/graphics/digikam/-/commit/5918439aafb5b2f7387490cb2abc9178fe33f374

Comment 3 Maik Qualmann 2023-07-27 21:08:47 UTC

Ok, we're a big step further, the language setting works, we get a text with German umlauts, but in the Windows codepage format and not UTF8. This is correct when we view the text file in the Windows text editor, but not in our preview.
The question now is, do we want codepage or UTF8 on Windows?

Maik

Comment 4 Maik Qualmann 2023-07-28 06:09:47 UTC

Git commit cc42ef72e33356f66ec132e96cbb684d3c8d28bc by Maik Qualmann.
Committed on 28/07/2023 at 08:08.
Pushed by mqualmann into branch 'master'.

according to Tesseract doc the output encoding should be UTF8

M  +1    -1    core/dplugins/generic/tools/ocrtextconverter/ocrtesseractengine.cpp

https://invent.kde.org/graphics/digikam/-/commit/cc42ef72e33356f66ec132e96cbb684d3c8d28bc

Comment 5 Maik Qualmann 2023-07-28 11:54:27 UTC

Ok, encoding is fine on Windows now. We still have to fix the writing of the OCR text in the metadata. At the moment it is only written to the DB, which at the end restores the original caption text with a rescan. As with this sample image that already contains a caption text, we would overwrite it with the OCR text. Here we either have to merge or think of something else.

Maik

Comment 6 Maik Qualmann 2023-07-29 17:45:19 UTC

Git commit 21ef0f72c6af7be18c6f5ae57159e50a5b4c894f by Maik Qualmann.
Committed on 29/07/2023 at 19:44.
Pushed by mqualmann into branch 'master'.

the DBInfoIface must also write metadata via MetadataHub
FIXED-IN: 8.2.0

M  +1    -1    NEWS
M  +25   -4    core/libs/database/utils/ifaces/dbinfoiface.cpp

https://invent.kde.org/graphics/digikam/-/commit/21ef0f72c6af7be18c6f5ae57159e50a5b4c894f