228469 – Okular: Incomplete export of PDF czech content to plain text

Bug 228469 - Okular: Incomplete export of PDF czech content to plain text

Summary: Okular: Incomplete export of PDF czech content to plain text

Status:	RESOLVED NOT A BUG

Alias:	None

Product:	okular
Classification:	Applications
Component:	general (show other bugs)
Version:	unspecified
Platform:	Ubuntu Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Okular developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-02-25 15:54 UTC by sobik2
Modified:	2010-02-26 09:50 UTC (History)
CC List:	0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
source latex file (300 bytes, text/x-tex) 2010-02-26 00:26 UTC, sobik2	Details
created pdf document with 'pdflatex document.tex ' (10.31 KB, application/pdf) 2010-02-26 00:27 UTC, sobik2	Details
plain text exported from pdf (270 bytes, text/plain) 2010-02-26 00:27 UTC, sobik2	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description sobik2 2010-02-25 15:54:52 UTC

Version:           Okular 0.10  (using KDE 4.4.0)
OS:                Linux
Installed from:    Ubuntu Packages

Okular: Incomplete export of PDF czech content to plain text

how to reproduce (always reproduceable on my system)
=====================================================

1) create latex document with this content:

  \documentclass[a4paper,12pt,titlepage]{article} 
  \usepackage[utf8x]{inputenc}
  \usepackage[czech]{babel}
  \usepackage{fontenc}
  \begin{document}
  all special Czech characters:\\
  Příliš žluťoučký kůň úpěl ďábelské ódy.\\
  PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.
  \end{document}

2) this document include all special Czech characters with diacritical mark
3) compile it with 'pdflatex <document_name>' or 
with 'pdfcslatex <document_name>'
4) open created pdf in Okular -> click File -> export as -> plain text
5) see exported plain text document, the content is mismatched


some others info
================
sobi@sobi-laptop:~/Dokumenty$ pdflatex --version
pdfTeX using libpoppler 3.141592-1.40.3-2.2 (Web2C 7.5.6)
kpathsea version 3.5.6                                   
Copyright 2007 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea is copyright 2007 Karl Berry and Olaf Weber.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX using libpoppler copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX using libpoppler source.
Primary author of pdfTeX using libpoppler: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea written by Karl Berry, Olaf Weber, and others.

Compiled with libpng 1.2.37; using libpng 1.2.37
Compiled with zlib 1.2.3.3; using zlib 1.2.3.3
Compiled with libpoppler version 0.12.0


sobi@sobi-laptop:~/Dokumenty$ locale
LANG=cs_CZ.UTF-8
LANGUAGE=
LC_CTYPE="cs_CZ.UTF-8"
LC_NUMERIC="cs_CZ.UTF-8"
LC_TIME="cs_CZ.UTF-8"
LC_COLLATE="cs_CZ.UTF-8"
LC_MONETARY="cs_CZ.UTF-8"
LC_MESSAGES="cs_CZ.UTF-8"
LC_PAPER="cs_CZ.UTF-8"
LC_NAME="cs_CZ.UTF-8"
LC_ADDRESS="cs_CZ.UTF-8"
LC_TELEPHONE="cs_CZ.UTF-8"
LC_MEASUREMENT="cs_CZ.UTF-8"
LC_IDENTIFICATION="cs_CZ.UTF-8"
LC_ALL=


sobi@sobi-laptop:~/Dokumenty$ uname -a
Linux sobi-laptop 2.6.31-20-generic-pae #57-Ubuntu SMP Mon Feb 8 10:23:59 UTC 2010 i686 GNU/Linux

Comment 1 sobik2 2010-02-25 15:56:28 UTC

i have installed latex with:
'sudo apt-get install texlive texlive-lang-czechslovak'

using kubuntu 9.10, KDE 4.4.0

Comment 2 Albert Astals Cid 2010-02-25 21:39:16 UTC

Please attach the pdf file

Comment 3 sobik2 2010-02-26 00:26:13 UTC

Created attachment 41118 [details]
source latex file

Comment 4 sobik2 2010-02-26 00:27:01 UTC

Created attachment 41119 [details]
created pdf document with 'pdflatex document.tex
'

Comment 5 sobik2 2010-02-26 00:27:44 UTC

Created attachment 41120 [details]
plain text exported from pdf

Comment 6 Albert Astals Cid 2010-02-26 00:55:47 UTC

The pdf is not correctly formed to make text extraction possible, try to do the text extraction with Adobe Reader and you'll see how it also fails, you might want to contact pdflatex people about it

Comment 7 sobik2 2010-02-26 01:37:48 UTC

i tried it in adobe acrobat 9, it seems to be really invalid pdf..

because
1) when i create odt document in open office 3.1
2) export it like pdf
3) open this pdf in okular -> export as plain text ->
the content is spelled correctly...

question is, how the user can know, that his pdf is non valid
(i didnt find any linux tool)

SUGGEST
=======
so maybe i suggest to implement some validity control, which shows warning before exporting non-valid pdf to plain text, so then users neednt report it as bug (exactly like me)

Comment 8 Albert Astals Cid 2010-02-26 09:50:23 UTC

There is not way to differenciate a broken pdf from a non broken one other than reading the extracted text, so what you ask is impossible.