Created attachment 160560 [details] One of the files I want to extract text from. SUMMARY I use Tesseract OCR with digiKam 8.2.0 (20.07.2023) on Windows 10 Pro. I try to get the text from a jpg. If I select 'Languages: Default', I get a result, but German umlauts, ä, ü, and ö, are scanned incorrectly as a o u, and, yes, that's a difference in German ;-) . But when I select 'Languages: deu', I get no result. No test is found at all. But also selecting e.g. eng gives no result. However, when I use Tesseract (v5.3.1.20230401) directly on the command line with switch -l deu, it works. Tesseract command that works: tesseract /dir/pic1.jpg /text/pic1.ocr-result -l deu I attach one of the pictures I use. I marked the last sentence and the umlauts in it. STEPS TO REPRODUCE 1. Open the image attached in the 'OCR text converter...' 2. Select 'Languages: Default'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72 3. Start OCR 4. Now you get the result without umlauts (ö ü) 5. Close OCR 6. Open the same image again in 'OCR text converter...' 7. Select 'Languages: deu'. What you select for 'Segmentation mode' and 'Engine mode' makes no difference. DPI=72 8. Start OCR 9. Now you get no result OBSERVED RESULT With default, the sentence is scanned as: Die Giebel und Traufen konnen durch Wind- bzw. Traufen- oder Tropf-bretter geschutzt werden. EXPECTED RESULT The correct sentence is: Die Giebel und Traufen können durch Wind- bzw. Traufen- oder Tropf-bretter geschützt werden. SOFTWARE/OS VERSIONS Windows 10, 22H2 ADDITIONAL INFORMATION
I tested it here under Linux, the Windows test will follow. If I select German as the language, I get a correct text with German umlauts. Maik
Git commit 5918439aafb5b2f7387490cb2abc9178fe33f374 by Maik Qualmann. Committed on 27/07/2023 at 20:49. Pushed by mqualmann into branch 'master'. fix language parameter for Tesseract OCR on Windows M +11 -0 core/dplugins/generic/tools/ocrtextconverter/tesseractbinary.cpp https://invent.kde.org/graphics/digikam/-/commit/5918439aafb5b2f7387490cb2abc9178fe33f374
Ok, we're a big step further, the language setting works, we get a text with German umlauts, but in the Windows codepage format and not UTF8. This is correct when we view the text file in the Windows text editor, but not in our preview. The question now is, do we want codepage or UTF8 on Windows? Maik
Git commit cc42ef72e33356f66ec132e96cbb684d3c8d28bc by Maik Qualmann. Committed on 28/07/2023 at 08:08. Pushed by mqualmann into branch 'master'. according to Tesseract doc the output encoding should be UTF8 M +1 -1 core/dplugins/generic/tools/ocrtextconverter/ocrtesseractengine.cpp https://invent.kde.org/graphics/digikam/-/commit/cc42ef72e33356f66ec132e96cbb684d3c8d28bc
Ok, encoding is fine on Windows now. We still have to fix the writing of the OCR text in the metadata. At the moment it is only written to the DB, which at the end restores the original caption text with a rescan. As with this sample image that already contains a caption text, we would overwrite it with the OCR text. Here we either have to merge or think of something else. Maik
Git commit 21ef0f72c6af7be18c6f5ae57159e50a5b4c894f by Maik Qualmann. Committed on 29/07/2023 at 19:44. Pushed by mqualmann into branch 'master'. the DBInfoIface must also write metadata via MetadataHub FIXED-IN: 8.2.0 M +1 -1 NEWS M +25 -4 core/libs/database/utils/ifaces/dbinfoiface.cpp https://invent.kde.org/graphics/digikam/-/commit/21ef0f72c6af7be18c6f5ae57159e50a5b4c894f