Bug 472692 - Tesseract OCR does not take language selection into account
Summary: Tesseract OCR does not take language selection into account
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Applications
Component: Plugin-Generic-OcrTextConverter (show other bugs)
Version: 8.2.0
Platform: Microsoft Windows Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-27 08:40 UTC by claus.peja+kde
Modified: 2023-07-29 17:45 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In: 8.2.0
Sentry Crash Report:


Attachments
One of the files I want to extract text from. (627.33 KB, image/jpeg)
2023-07-27 08:40 UTC, claus.peja+kde
Details

Note You need to log in before you can comment on or make changes to this bug.
Description claus.peja+kde 2023-07-27 08:40:56 UTC
Created attachment 160560 [details]
One of the files I want to extract text from.

SUMMARY
I use Tesseract OCR with digiKam 8.2.0 (20.07.2023) on Windows 10 Pro. I try to get the text from a jpg. If I select 'Languages: Default', I get a result, but German umlauts, ä, ü, and ö, are scanned incorrectly as a o u, and, yes, that's a difference in German ;-) .
But when I select 'Languages: deu', I get no result. No test is found at all. But also selecting e.g. eng gives no result.
However, when I use Tesseract (v5.3.1.20230401) directly on the command line with switch -l deu, it works.
Tesseract command that works: tesseract /dir/pic1.jpg /text/pic1.ocr-result -l deu
I attach one of the pictures I use. I marked the last sentence and the umlauts in it.

STEPS TO REPRODUCE
1. Open the image attached in the 'OCR text converter...'
2. Select 'Languages: Default'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72
3. Start OCR
4. Now you get the result without umlauts (ö ü)
5. Close OCR
6. Open the same image again in 'OCR text converter...'
7. Select 'Languages: deu'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72
8. Start OCR
9. Now you get no result

OBSERVED RESULT
With default, the sentence is scanned as: Die Giebel und Traufen konnen durch Wind- bzw. Traufen- oder Tropf-bretter geschutzt werden.

EXPECTED RESULT
The correct sentence is: Die Giebel und Traufen können durch Wind- bzw. Traufen- oder Tropf-bretter geschützt werden.

SOFTWARE/OS VERSIONS
Windows 10, 22H2

ADDITIONAL INFORMATION
Comment 1 Maik Qualmann 2023-07-27 10:39:17 UTC
I tested it here under Linux, the Windows test will follow. If I select German as the language, I get a correct text with German umlauts.

Maik
Comment 2 Maik Qualmann 2023-07-27 18:50:14 UTC
Git commit 5918439aafb5b2f7387490cb2abc9178fe33f374 by Maik Qualmann.
Committed on 27/07/2023 at 20:49.
Pushed by mqualmann into branch 'master'.

fix language parameter for Tesseract OCR on Windows

M  +11   -0    core/dplugins/generic/tools/ocrtextconverter/tesseractbinary.cpp

https://invent.kde.org/graphics/digikam/-/commit/5918439aafb5b2f7387490cb2abc9178fe33f374
Comment 3 Maik Qualmann 2023-07-27 21:08:47 UTC
Ok, we're a big step further, the language setting works, we get a text with German umlauts, but in the Windows codepage format and not UTF8. This is correct when we view the text file in the Windows text editor, but not in our preview.
The question now is, do we want codepage or UTF8 on Windows?

Maik
Comment 4 Maik Qualmann 2023-07-28 06:09:47 UTC
Git commit cc42ef72e33356f66ec132e96cbb684d3c8d28bc by Maik Qualmann.
Committed on 28/07/2023 at 08:08.
Pushed by mqualmann into branch 'master'.

according to Tesseract doc the output encoding should be UTF8

M  +1    -1    core/dplugins/generic/tools/ocrtextconverter/ocrtesseractengine.cpp

https://invent.kde.org/graphics/digikam/-/commit/cc42ef72e33356f66ec132e96cbb684d3c8d28bc
Comment 5 Maik Qualmann 2023-07-28 11:54:27 UTC
Ok, encoding is fine on Windows now. We still have to fix the writing of the OCR text in the metadata. At the moment it is only written to the DB, which at the end restores the original caption text with a rescan. As with this sample image that already contains a caption text, we would overwrite it with the OCR text. Here we either have to merge or think of something else.

Maik
Comment 6 Maik Qualmann 2023-07-29 17:45:19 UTC
Git commit 21ef0f72c6af7be18c6f5ae57159e50a5b4c894f by Maik Qualmann.
Committed on 29/07/2023 at 19:44.
Pushed by mqualmann into branch 'master'.

the DBInfoIface must also write metadata via MetadataHub
FIXED-IN: 8.2.0

M  +1    -1    NEWS
M  +25   -4    core/libs/database/utils/ifaces/dbinfoiface.cpp

https://invent.kde.org/graphics/digikam/-/commit/21ef0f72c6af7be18c6f5ae57159e50a5b4c894f