Bug 228469

Summary:	Okular: Incomplete export of PDF czech content to plain text
Product:	[Applications] okular	Reporter:	sobik2 <sobik2>
Component:	general	Assignee:	Okular developers <okular-devel>
Status:	RESOLVED NOT A BUG
Severity:	normal
Priority:	NOR
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Ubuntu
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	source latex file created pdf document with 'pdflatex document.tex ' plain text exported from pdf

Description sobik2 2010-02-25 15:54:52 UTC

Version:           Okular 0.10  (using KDE 4.4.0)
OS:                Linux
Installed from:    Ubuntu Packages

Okular: Incomplete export of PDF czech content to plain text

how to reproduce (always reproduceable on my system)
=====================================================

1) create latex document with this content:

  \documentclass[a4paper,12pt,titlepage]{article} 
  \usepackage[utf8x]{inputenc}
  \usepackage[czech]{babel}
  \usepackage{fontenc}
  \begin{document}
  all special Czech characters:\\
  Příliš žluťoučký kůň úpěl ďábelské ódy.\\
  PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.
  \end{document}

2) this document include all special Czech characters with diacritical mark
3) compile it with 'pdflatex <document_name>' or 
with 'pdfcslatex <document_name>'
4) open created pdf in Okular -> click File -> export as -> plain text
5) see exported plain text document, the content is mismatched


some others info
================
sobi@sobi-laptop:~/Dokumenty$ pdflatex --version
pdfTeX using libpoppler 3.141592-1.40.3-2.2 (Web2C 7.5.6)
kpathsea version 3.5.6                                   
Copyright 2007 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea is copyright 2007 Karl Berry and Olaf Weber.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX using libpoppler copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX using libpoppler source.
Primary author of pdfTeX using libpoppler: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea written by Karl Berry, Olaf Weber, and others.

Compiled with libpng 1.2.37; using libpng 1.2.37
Compiled with zlib 1.2.3.3; using zlib 1.2.3.3
Compiled with libpoppler version 0.12.0


sobi@sobi-laptop:~/Dokumenty$ locale
LANG=cs_CZ.UTF-8
LANGUAGE=
LC_CTYPE="cs_CZ.UTF-8"
LC_NUMERIC="cs_CZ.UTF-8"
LC_TIME="cs_CZ.UTF-8"
LC_COLLATE="cs_CZ.UTF-8"
LC_MONETARY="cs_CZ.UTF-8"
LC_MESSAGES="cs_CZ.UTF-8"
LC_PAPER="cs_CZ.UTF-8"
LC_NAME="cs_CZ.UTF-8"
LC_ADDRESS="cs_CZ.UTF-8"
LC_TELEPHONE="cs_CZ.UTF-8"
LC_MEASUREMENT="cs_CZ.UTF-8"
LC_IDENTIFICATION="cs_CZ.UTF-8"
LC_ALL=


sobi@sobi-laptop:~/Dokumenty$ uname -a
Linux sobi-laptop 2.6.31-20-generic-pae #57-Ubuntu SMP Mon Feb 8 10:23:59 UTC 2010 i686 GNU/Linux

Comment 1 sobik2 2010-02-25 15:56:28 UTC

i have installed latex with:
'sudo apt-get install texlive texlive-lang-czechslovak'

using kubuntu 9.10, KDE 4.4.0

Comment 2 Albert Astals Cid 2010-02-25 21:39:16 UTC

Please attach the pdf file

Comment 3 sobik2 2010-02-26 00:26:13 UTC

Created attachment 41118 [details]
source latex file

Comment 4 sobik2 2010-02-26 00:27:01 UTC

Created attachment 41119 [details]
created pdf document with 'pdflatex document.tex
'

Comment 5 sobik2 2010-02-26 00:27:44 UTC

Created attachment 41120 [details]
plain text exported from pdf

Comment 6 Albert Astals Cid 2010-02-26 00:55:47 UTC

The pdf is not correctly formed to make text extraction possible, try to do the text extraction with Adobe Reader and you'll see how it also fails, you might want to contact pdflatex people about it

Comment 7 sobik2 2010-02-26 01:37:48 UTC

i tried it in adobe acrobat 9, it seems to be really invalid pdf..

because
1) when i create odt document in open office 3.1
2) export it like pdf
3) open this pdf in okular -> export as plain text ->
the content is spelled correctly...

question is, how the user can know, that his pdf is non valid
(i didnt find any linux tool)

SUGGEST
=======
so maybe i suggest to implement some validity control, which shows warning before exporting non-valid pdf to plain text, so then users neednt report it as bug (exactly like me)

Comment 8 Albert Astals Cid 2010-02-26 09:50:23 UTC

There is not way to differenciate a broken pdf from a non broken one other than reading the extracted text, so what you ask is impossible.