Bug 192302

Summary: Use Ocropus & Tesseract for OCR (goal: 'paperless office')
Product: [Applications] Skanlite Reporter: Sputnik <sputnikshock>
Component: generalAssignee: Kåre Särs <kare.sars>
Status: RESOLVED NOT A BUG    
Severity: wishlist CC: aspotashev, brainstorm, trossi.dev
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Sputnik 2009-05-11 10:21:13 UTC
Version:           0.3-0ubuntu1 (using KDE 4.2.3)
OS:                Linux
Installed from:    Ubuntu Packages

Make Ocropus and tesseract part of a KDE for use in a "paperless office".

KDE is a wellknown and widespread operation enviroment. What Linux/KDE missed so far was a good OCR machine.

With Ocropus this seems to come to a better state.

Wish: Make Ocropus with the use of tesseract be part of KDE. Integrate it into skanlite - or build another application that uses its power.

http://code.google.com/p/ocropus/

OCRopus(tm) is a state-of-the-art document analysis and Optical
 Character Recognition (OCR) system, featuring
 pluggable layout analysis, pluggable character recognition, statistical
 natural language modeling, and multi-lingual capabilities. ( License: Apache License 2.0 )

Ocropus is now in Debian - and will be in Ubuntu, too starting with karmic.

Ocropus makes use of tesseract: http://code.google.com/p/tesseract-ocr/

The Tesseract OCR engine was originally developed at HP between 1985 and 1995.
 It was open-sourced by HP and UNLV in 2005 and Google has lead further
 development. ( License: Apache License 2.0 )


Personal statement:
Ocropus produces by far the best OCR result that I have ever seen on Linux! - This is really worth to be used for a KDE office!

The next step would be a GUI that uses spellchecking and correction by the user. - But the first thing will be just to make use of the commandline power in KDE.
Comment 1 Kåre Särs 2009-08-03 21:49:30 UTC
I totally agree that KDE needs the OCR, but skanlite is not the right application for that. I would be more that happy to help somebody that wants to do an OCR application that uses libksane. There was somebody (don't remember now who it was) doing some OCR app, but would use Akonadi to save the documents. There was a short discussion on kde-imaging or kde-devel...
Comment 2 2wxsy58236r3 2021-01-16 09:55:23 UTC
*** Bug 426829 has been marked as a duplicate of this bug. ***
Comment 3 Andrea Ippolito 2021-07-29 13:09:25 UTC
Too bad to see that nothing has changed since 2009 :(

I guess I'll have to keep using my crappy self-made script that combines scanimage + imagemagick's img2pdf + ocrmypdf
Comment 4 Kåre Särs 2021-07-29 13:19:40 UTC
You could always try https://invent.kde.org/utilities/skanpage

There has been quite some development there the last half year.

And why not scratch your itch and help with the OCR part ;)
Comment 5 Andrea Ippolito 2021-07-29 13:28:19 UTC
(In reply to Kåre Särs from comment #4)
> You could always try https://invent.kde.org/utilities/skanpage
> 
> There has been quite some development there the last half year.
> 
> And why not scratch your itch and help with the OCR part ;)

Thanks, that's excellent news, I'm definitely gonna keep an eye on it :)
Comment 6 Andrea Ippolito 2024-03-17 19:13:08 UTC
Now that Skanpage features OCR capabilities, is this (~15 years old!) bug report still relevant?
Comment 7 Kåre Särs 2024-03-18 19:24:37 UTC
Yep, I think we can point to Skanpage for the OCR parts