Version: Okular 0.10 (using KDE 4.4.0) OS: Linux Installed from: Ubuntu Packages Okular: Incomplete export of PDF czech content to plain text how to reproduce (always reproduceable on my system) ===================================================== 1) create latex document with this content: \documentclass[a4paper,12pt,titlepage]{article} \usepackage[utf8x]{inputenc} \usepackage[czech]{babel} \usepackage{fontenc} \begin{document} all special Czech characters:\\ Příliš žluťoučký kůň úpěl ďábelské ódy.\\ PŘÍLIŠ ŽLUŤOUČKÝ KŮŇ ÚPĚL ĎÁBELSKÉ ÓDY. \end{document} 2) this document include all special Czech characters with diacritical mark 3) compile it with 'pdflatex <document_name>' or with 'pdfcslatex <document_name>' 4) open created pdf in Okular -> click File -> export as -> plain text 5) see exported plain text document, the content is mismatched some others info ================ sobi@sobi-laptop:~/Dokumenty$ pdflatex --version pdfTeX using libpoppler 3.141592-1.40.3-2.2 (Web2C 7.5.6) kpathsea version 3.5.6 Copyright 2007 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX). Kpathsea is copyright 2007 Karl Berry and Olaf Weber. There is NO warranty. Redistribution of this software is covered by the terms of both the pdfTeX using libpoppler copyright and the Lesser GNU General Public License. For more information about these matters, see the file named COPYING and the pdfTeX using libpoppler source. Primary author of pdfTeX using libpoppler: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX). Kpathsea written by Karl Berry, Olaf Weber, and others. Compiled with libpng 1.2.37; using libpng 1.2.37 Compiled with zlib 1.2.3.3; using zlib 1.2.3.3 Compiled with libpoppler version 0.12.0 sobi@sobi-laptop:~/Dokumenty$ locale LANG=cs_CZ.UTF-8 LANGUAGE= LC_CTYPE="cs_CZ.UTF-8" LC_NUMERIC="cs_CZ.UTF-8" LC_TIME="cs_CZ.UTF-8" LC_COLLATE="cs_CZ.UTF-8" LC_MONETARY="cs_CZ.UTF-8" LC_MESSAGES="cs_CZ.UTF-8" LC_PAPER="cs_CZ.UTF-8" LC_NAME="cs_CZ.UTF-8" LC_ADDRESS="cs_CZ.UTF-8" LC_TELEPHONE="cs_CZ.UTF-8" LC_MEASUREMENT="cs_CZ.UTF-8" LC_IDENTIFICATION="cs_CZ.UTF-8" LC_ALL= sobi@sobi-laptop:~/Dokumenty$ uname -a Linux sobi-laptop 2.6.31-20-generic-pae #57-Ubuntu SMP Mon Feb 8 10:23:59 UTC 2010 i686 GNU/Linux
i have installed latex with: 'sudo apt-get install texlive texlive-lang-czechslovak' using kubuntu 9.10, KDE 4.4.0
Please attach the pdf file
Created attachment 41118 [details] source latex file
Created attachment 41119 [details] created pdf document with 'pdflatex document.tex '
Created attachment 41120 [details] plain text exported from pdf
The pdf is not correctly formed to make text extraction possible, try to do the text extraction with Adobe Reader and you'll see how it also fails, you might want to contact pdflatex people about it
i tried it in adobe acrobat 9, it seems to be really invalid pdf.. because 1) when i create odt document in open office 3.1 2) export it like pdf 3) open this pdf in okular -> export as plain text -> the content is spelled correctly... question is, how the user can know, that his pdf is non valid (i didnt find any linux tool) SUGGEST ======= so maybe i suggest to implement some validity control, which shows warning before exporting non-valid pdf to plain text, so then users neednt report it as bug (exactly like me)
There is not way to differenciate a broken pdf from a non broken one other than reading the extracted text, so what you ask is impossible.