192302 – Use Ocropus & Tesseract for OCR (goal: 'paperless office')

Bug 192302 - Use Ocropus & Tesseract for OCR (goal: 'paperless office')

Summary: Use Ocropus & Tesseract for OCR (goal: 'paperless office')

Status:	RESOLVED NOT A BUG

Alias:	None

Product:	Skanlite
Classification:	Applications
Component:	general (other bugs)
Version First Reported In:	unspecified
Platform:	Ubuntu Linux

Importance:	NOR wishlist
Target Milestone:	---
Assignee:	Kåre Särs

URL:
Keywords:

Duplicates (1):	426829 (view as bug list)
Depends on:
Blocks:

Reported:	2009-05-11 10:21 UTC by Sputnik
Modified:	2024-03-18 19:24 UTC (History)
CC List:	3 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Sputnik 2009-05-11 10:21:13 UTC

Version:           0.3-0ubuntu1 (using KDE 4.2.3)
OS:                Linux
Installed from:    Ubuntu Packages

Make Ocropus and tesseract part of a KDE for use in a "paperless office".

KDE is a wellknown and widespread operation enviroment. What Linux/KDE missed so far was a good OCR machine.

With Ocropus this seems to come to a better state.

Wish: Make Ocropus with the use of tesseract be part of KDE. Integrate it into skanlite - or build another application that uses its power.

http://code.google.com/p/ocropus/

OCRopus(tm) is a state-of-the-art document analysis and Optical
 Character Recognition (OCR) system, featuring
 pluggable layout analysis, pluggable character recognition, statistical
 natural language modeling, and multi-lingual capabilities. ( License: Apache License 2.0 )

Ocropus is now in Debian - and will be in Ubuntu, too starting with karmic.

Ocropus makes use of tesseract: http://code.google.com/p/tesseract-ocr/

The Tesseract OCR engine was originally developed at HP between 1985 and 1995.
 It was open-sourced by HP and UNLV in 2005 and Google has lead further
 development. ( License: Apache License 2.0 )


Personal statement:
Ocropus produces by far the best OCR result that I have ever seen on Linux! - This is really worth to be used for a KDE office!

The next step would be a GUI that uses spellchecking and correction by the user. - But the first thing will be just to make use of the commandline power in KDE.

Comment 1 Kåre Särs 2009-08-03 21:49:30 UTC

I totally agree that KDE needs the OCR, but skanlite is not the right application for that. I would be more that happy to help somebody that wants to do an OCR application that uses libksane. There was somebody (don't remember now who it was) doing some OCR app, but would use Akonadi to save the documents. There was a short discussion on kde-imaging or kde-devel...

Comment 2 2wxsy58236r3 2021-01-16 09:55:23 UTC

*** Bug 426829 has been marked as a duplicate of this bug. ***

Comment 3 Andrea Ippolito 2021-07-29 13:09:25 UTC

Too bad to see that nothing has changed since 2009 :(

I guess I'll have to keep using my crappy self-made script that combines scanimage + imagemagick's img2pdf + ocrmypdf

Comment 4 Kåre Särs 2021-07-29 13:19:40 UTC

You could always try https://invent.kde.org/utilities/skanpage

There has been quite some development there the last half year.

And why not scratch your itch and help with the OCR part ;)

Comment 5 Andrea Ippolito 2021-07-29 13:28:19 UTC

(In reply to Kåre Särs from comment #4)
> You could always try https://invent.kde.org/utilities/skanpage
> 
> There has been quite some development there the last half year.
> 
> And why not scratch your itch and help with the OCR part ;)

Thanks, that's excellent news, I'm definitely gonna keep an eye on it :)

Comment 6 Andrea Ippolito 2024-03-17 19:13:08 UTC

Now that Skanpage features OCR capabilities, is this (~15 years old!) bug report still relevant?

Comment 7 Kåre Särs 2024-03-18 19:24:37 UTC

Yep, I think we can point to Skanpage for the OCR parts