488526 – Exporting PDF with OCR text recognition based on tesseract doesn't work anymore

Bug 488526 - Exporting PDF with OCR text recognition based on tesseract doesn't work anymore

Summary: Exporting PDF with OCR text recognition based on tesseract doesn't work anymore

Status:	RESOLVED FIXED

Alias:	None

Product:	Skanpage
Classification:	Applications
Component:	general (show other bugs)
Version:	24.05.0
Platform:	Neon Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Alexander Stippich

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-06-15 11:36 UTC by Nicola Jelmorini
Modified:	2024-11-24 15:07 UTC (History)
CC List:	0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
The window of the functionality "Export PDF" (30.80 KB, image/png) 2024-06-26 12:32 UTC, Nicola Jelmorini	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Nicola Jelmorini 2024-06-15 11:36:07 UTC

***
If you're not sure this is actually a bug, instead post about it at https://discuss.kde.org

If you're reporting a crash, attach a backtrace with debug symbols; see https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports
***

SUMMARY
The "Export PDF" functionality allows me to create a PDF with text recognition in different languages. Clicking the button "Export PDF" shows up a window that should list the languages available for the tesseract module to use in the process of the text recognition. However, with the latest update of skanpage (deb package from the repository), the list of languages is no more visible, and the OCR is not working anymore. 
With the previous version of skanpage instead, the OCR was perfectly working. Maybe in the latest deb package the "tesseract" dependency is missing?

STEPS TO REPRODUCE
1. Scan a page with Skanpage
2. Click the "Export PDF" button
3. The list of languages for the OCR is missing

OBSERVED RESULT
The list of languages for the OCR functionality is missing, so the PDF output has no text recognized.

EXPECTED RESULT
It should be possible to select the languages that I want to recognize. The PDF output should contain the recognized text, and I should be able to copy it.

SOFTWARE/OS VERSIONS
Windows: n/a
macOS:  n/a
Linux/KDE Plasma: KDE neon 6.0 (based on ubuntu 22.04)
(available in About System)
KDE Plasma Version:  6.0.5
KDE Frameworks Version:  6.2.0
Qt Version:  6.7.0

ADDITIONAL INFORMATION

Comment 1 Alexander Stippich 2024-06-23 18:01:32 UTC

Have you checked that you still have the corresponding tesseract language files installed?

Comment 2 Nicola Jelmorini 2024-06-24 08:21:03 UTC

(In reply to Alexander Stippich from comment #1)
> Have you checked that you still have the corresponding tesseract language
> files installed?

Hi,
I have made no changes to my system. The following tesseract packages are installed since the beginning on my system:

=====================================================================================================
nicola@nicola-XPS-13-9360:~
➤ apt list --installed | grep tesseract

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libtesseract4/jammy,now 4.1.1-2.1build1 amd64 [installato, automatico]
tesseract-ocr-eng/jammy,jammy,now 1:4.00~git30-7274cfa-1.1 all [installato, automatico]
tesseract-ocr-ita/jammy,jammy,now 1:4.00~git30-7274cfa-1.1 all [installato]
tesseract-ocr-osd/jammy,jammy,now 1:4.00~git30-7274cfa-1.1 all [installato, automatico]
tesseract-ocr/jammy,now 4.1.1-2.1build1 amd64 [installato]
nicola@nicola-XPS-13-9360:~
=====================================================================================================

Comment 3 Alexander Stippich 2024-06-24 16:45:21 UTC

The tesseract dependency was bumped to 5 fpr 24.05. Is tesseract5 available in Ubuntu 22.04?

Comment 4 Nicola Jelmorini 2024-06-25 13:38:43 UTC

(In reply to Alexander Stippich from comment #3)
> The tesseract dependency was bumped to 5 fpr 24.05. Is tesseract5 available
> in Ubuntu 22.04?

Unfortunately no. In Ubuntu 22.04 there is the package "libtesseract4".

On the "Ubuntu packages" website I see that the package "libtesseract5" is included starting Ubuntu 23.10 (mantic).
I'm using KDE Neon that upgrades between LTS editions only, thus, I suppose that tesseract5 will be available when KDE Neon will be upgraded to Ubuntu 24.04 (noble). 
Or you know other viable options?

Comment 5 Alexander Stippich 2024-06-25 15:30:46 UTC

There is a ppa that should work, but I have not tested it:
https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5?field.series_filter=jammy

Comment 6 Nicola Jelmorini 2024-06-26 12:32:27 UTC

Created attachment 171010 [details]
The window of the functionality "Export PDF"

The screenshot shows that there is no language selection in the window for the OCR text recognition.

Comment 7 Nicola Jelmorini 2024-06-26 12:39:34 UTC

(In reply to Alexander Stippich from comment #5)
> There is a ppa that should work, but I have not tested it:
> https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr5?field.
> series_filter=jammy


The PPA installs indeed the version 5 of tesseract and the languages I need, as you can see here:

=====================================================================================================
nicola@nicola-XPS-13-9360:~
➤ apt list --installed | grep tesseract
libtesseract5/jammy,now 5.4.1-1ppa1~jammy1 amd64 [installato, automatico]
tesseract-ocr-eng/jammy,jammy,now 1:5.0.0~git39-6572757-2ppa1~jammy1 all [installato, automatico]
tesseract-ocr-ita/jammy,jammy,now 1:5.0.0~git39-6572757-2ppa1~jammy1 all [installato]
tesseract-ocr-osd/jammy,jammy,now 1:5.0.0~git39-6572757-2ppa1~jammy1 all [installato, automatico]
tesseract-ocr/jammy,now 5.4.1-1ppa1~jammy1 amd64 [installato]
nicola@nicola-XPS-13-9360:~
➤ tesseract --list-langs
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (3):
eng
ita
osd
nicola@nicola-XPS-13-9360:~
=====================================================================================================

But unfortunately the issue is still present: no languages selection available for the OCR text recognition. 
The screenshot "The window of the functionality Export PDF" that I have uploaded, shows you that the "Export PDF" window is missing the language selection list.

Comment 8 Alexander Stippich 2024-07-22 08:09:25 UTC

I'm afraid that you have to wait for KDE Neon being rebased to 24.04

Comment 9 Nicola Jelmorini 2024-07-22 10:54:42 UTC

(In reply to Alexander Stippich from comment #8)
> I'm afraid that you have to wait for KDE Neon being rebased to 24.04

OK, I understand and I can live with it.
The wait shouldn't be too long.

Thank you anyway for your support.

Comment 10 Alexander Stippich 2024-11-24 12:31:37 UTC

Is this still an issue with KDE neon based on 24.04?

Comment 11 Nicola Jelmorini 2024-11-24 14:40:56 UTC

(In reply to Alexander Stippich from comment #10)
> Is this still an issue with KDE neon based on 24.04?

I'm sorry, I have completely forgotten to give you a feedback after my upgrade to 24.04.
Anyway, I have good news: this issue, after the upgrade, is gone 👍. 
For me, this bug report can be closed now.
Thank you.

Comment 12 Alexander Stippich 2024-11-24 15:07:49 UTC

Thanks for the feedback!