Bug 217914

Summary: mixed languages: cannot copy text (garbage instead of proper letters)
Product: [Applications] okular Reporter: Maciej Pilichowski <bluedzins>
Component: PDF backendAssignee: Okular developers <okular-devel>
Status: RESOLVED NOT A BUG    
Severity: normal    
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: openSUSE   
OS: Unspecified   
Latest Commit: Version Fixed In:

Description Maciej Pilichowski 2009-12-08 20:34:54 UTC
Version:            (using KDE 4.3.3)
Installed from:    SuSE RPMs

Let's say you have a pdf with mixed languages (example: polish and russian). Here you can download one for testing:
http://www.jezykiobce.net/jezykiobce_pdf/rosyjski_kurs_podst.pdf

when you copy polish part or russian part (or both) you end up with garbage, like:
êàíèêóëû

only basic latin letters plus letter ó are copied correctly. It is not a matter of font used, because I can use for the same font intended letters:
ęóąśł
or
яерсидоф
Comment 1 Pino Toscano 2009-12-08 20:50:03 UTC
Given the problem is reproduceable (in the very same way) with:

- acroread 9.2
- okular 0.9.2 + poppler 0.12.2
- evince 2.28.1 + poppler 0.12.2

I'm rather inclined to conclude the document might be badly encoded.
(Note: what you see in a PDF is not what you copy as text.)
Comment 2 Maciej Pilichowski 2009-12-08 22:37:32 UTC
Pino, thank you for explanation. I assumed that by definition pdf has to be properly encoded (text I mean).