Bug 228469 - Okular: Incomplete export of PDF czech content to plain text
Summary: Okular: Incomplete export of PDF czech content to plain text
Status: RESOLVED NOT A BUG
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Ubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-25 15:54 UTC by sobik2
Modified: 2010-02-26 09:50 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments
source latex file (300 bytes, text/x-tex)
2010-02-26 00:26 UTC, sobik2
Details
created pdf document with 'pdflatex document.tex ' (10.31 KB, application/pdf)
2010-02-26 00:27 UTC, sobik2
Details
plain text exported from pdf (270 bytes, text/plain)
2010-02-26 00:27 UTC, sobik2
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sobik2 2010-02-25 15:54:52 UTC
Version:           Okular 0.10  (using KDE 4.4.0)
OS:                Linux
Installed from:    Ubuntu Packages

Okular: Incomplete export of PDF czech content to plain text

how to reproduce (always reproduceable on my system)
=====================================================

1) create latex document with this content:

  \documentclass[a4paper,12pt,titlepage]{article} 
  \usepackage[utf8x]{inputenc}
  \usepackage[czech]{babel}
  \usepackage{fontenc}
  \begin{document}
  all special Czech characters:\\
  Příliš žluťoučký kůň úpěl ďábelské ódy.\\
  PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY.
  \end{document}

2) this document include all special Czech characters with diacritical mark
3) compile it with 'pdflatex <document_name>' or 
with 'pdfcslatex <document_name>'
4) open created pdf in Okular -> click File -> export as -> plain text
5) see exported plain text document, the content is mismatched


some others info
================
sobi@sobi-laptop:~/Dokumenty$ pdflatex --version
pdfTeX using libpoppler 3.141592-1.40.3-2.2 (Web2C 7.5.6)
kpathsea version 3.5.6                                   
Copyright 2007 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea is copyright 2007 Karl Berry and Olaf Weber.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX using libpoppler copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX using libpoppler source.
Primary author of pdfTeX using libpoppler: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Kpathsea written by Karl Berry, Olaf Weber, and others.

Compiled with libpng 1.2.37; using libpng 1.2.37
Compiled with zlib 1.2.3.3; using zlib 1.2.3.3
Compiled with libpoppler version 0.12.0


sobi@sobi-laptop:~/Dokumenty$ locale
LANG=cs_CZ.UTF-8
LANGUAGE=
LC_CTYPE="cs_CZ.UTF-8"
LC_NUMERIC="cs_CZ.UTF-8"
LC_TIME="cs_CZ.UTF-8"
LC_COLLATE="cs_CZ.UTF-8"
LC_MONETARY="cs_CZ.UTF-8"
LC_MESSAGES="cs_CZ.UTF-8"
LC_PAPER="cs_CZ.UTF-8"
LC_NAME="cs_CZ.UTF-8"
LC_ADDRESS="cs_CZ.UTF-8"
LC_TELEPHONE="cs_CZ.UTF-8"
LC_MEASUREMENT="cs_CZ.UTF-8"
LC_IDENTIFICATION="cs_CZ.UTF-8"
LC_ALL=


sobi@sobi-laptop:~/Dokumenty$ uname -a
Linux sobi-laptop 2.6.31-20-generic-pae #57-Ubuntu SMP Mon Feb 8 10:23:59 UTC 2010 i686 GNU/Linux
Comment 1 sobik2 2010-02-25 15:56:28 UTC
i have installed latex with:
'sudo apt-get install texlive texlive-lang-czechslovak'

using kubuntu 9.10, KDE 4.4.0
Comment 2 Albert Astals Cid 2010-02-25 21:39:16 UTC
Please attach the pdf file
Comment 3 sobik2 2010-02-26 00:26:13 UTC
Created attachment 41118 [details]
source latex file
Comment 4 sobik2 2010-02-26 00:27:01 UTC
Created attachment 41119 [details]
created pdf document with 'pdflatex document.tex
'
Comment 5 sobik2 2010-02-26 00:27:44 UTC
Created attachment 41120 [details]
plain text exported from pdf
Comment 6 Albert Astals Cid 2010-02-26 00:55:47 UTC
The pdf is not correctly formed to make text extraction possible, try to do the text extraction with Adobe Reader and you'll see how it also fails, you might want to contact pdflatex people about it
Comment 7 sobik2 2010-02-26 01:37:48 UTC
i tried it in adobe acrobat 9, it seems to be really invalid pdf..

because
1) when i create odt document in open office 3.1
2) export it like pdf
3) open this pdf in okular -> export as plain text ->
the content is spelled correctly...

question is, how the user can know, that his pdf is non valid
(i didnt find any linux tool)

SUGGEST
=======
so maybe i suggest to implement some validity control, which shows warning before exporting non-valid pdf to plain text, so then users neednt report it as bug (exactly like me)
Comment 8 Albert Astals Cid 2010-02-26 09:50:23 UTC
There is not way to differenciate a broken pdf from a non broken one other than reading the extracted text, so what you ask is impossible.