Bug 407133

Summary: Copy text from rotated pdf gives rubbish
Product: [Applications] okular Reporter: Axel Braun <axel.braun>
Component: generalAssignee: Okular developers <okular-devel>
Status: CONFIRMED ---    
Severity: normal CC: aacid, eam67, langec, martin.marmsoler, postix, vapier, yury.tarasievich, zbwu1996
Priority: HI    
Version: 1.7.0   
Target Milestone: ---   
Platform: Other   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=207748
https://bugs.kde.org/show_bug.cgi?id=361538
https://bugs.kde.org/show_bug.cgi?id=445851
https://bugs.kde.org/show_bug.cgi?id=401044
Latest Commit: Version Fixed In:
Attachments: example for rotated page
Vertical texts are used for diagrams, but Okular can’t search for them
Diagonal text is not recognized as line
Diagonal watermark text breaks text entity reordering

Description Axel Braun 2019-05-01 19:05:13 UTC
Created attachment 119778 [details]
example for rotated page

SUMMARY
You have a PDF that is landscape, and you rotate it to see it properly on the screen. If you then copy text, the clipboard contains rubbish

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 5.15.4
KDE Plasma Version: 
KDE Frameworks Version: 5.57.0
Qt Version: 5.12.3

ADDITIONAL INFORMATION
see example attached (you dont need to rotate if it is shown portrait)
Comment 1 Laura David Hurka 2019-05-12 12:38:16 UTC
Created attachment 120007 [details]
Vertical texts are used for diagrams, but Okular can’t search for them

You can fix the clipboard content with the following command ;)
perl -e 'print reverse split //, <>;'

Seems like the TextPage, which is used for search and text-copying, is filled this way. While the Generator adds horizontal words as words, vertical words are split into letters. Then, Okular thinks, that the uppermost letter is the first letter.

Letters or words are stored in TextEntity objects in the TextPage. The TextEntity stores the letter/word as string and the bounding rectangle.

The problem is one of these two: (choose what you like more)
1. TextPage and TextEntity can’t store transformations, or even simple rotation. So, the generator splits vertical words into single letters. *1
2. The generator, which uses poppler to read the pdf, gets vertical words already split into letters.

*1) Possible reason: this way, one can (theoretically *2) use the Text Selection tool to select the word.
*2) Practically not, because Okular adds any other letter on the same height to the selection.

I have attached a screenshot which illustrates the practical relevance of this problem: In many datasheets (not only TI), vertical text is used to describe vertical axes of diagrams. Splitting them into words prevents searching for a specific diagram.
Comment 2 Laura David Hurka 2019-05-12 17:34:10 UTC
Created attachment 120017 [details]
Diagonal text is not recognized as line

Looking into core/textpage.cpp tells me that the generators just output characters with their bounding rectangles. (These informations become TinyTextEntitys.) There seems to be no information about orientation.

There are some functions in core/textpage.cpp, whose code I didn’t read yet:


removeSpace()
Claims to remove space, to make output from different generators uniform.

makeWordFromCharacters()
Claims to rearrange characters to words, using spaces to distinguish between adjacent words. (But spaces are removed?)

makeAndSortLines()
Claims to look for adjacent words to make a line of them, and to sort the lines.

calculateStatisticalInformation()
Claims to be able to distinguish between character spacing, word spacing, and column spacing. Needed for multi-column layouts.

XYCutForBoudingBoxes()
Claims to apply the XY-cut algorithm, to seperate... something

addNecessarySpace()
Inserts the space that was probaby removed by removeSpace(), so selecting text does not result in words that are squashed together.

TextPagePrivate::correctTextOrder()
Calls the above, statically declared functions.


Unfortunately, these functions don’t seem to be designed for vertical text. Even slightly diagonal text causes problems, see screenshot. (Possible reasons: XY-cut can’t “see” diagonal texts, makeAndSortLines() collects characters in a bad order)

There are many commits on these functions, mainly done in 2011 by Albert Astals Cid and Mohammad Mahfuzur Rahman Mamun. The beginning was probably this commit?

> commit 2eb5f270fd4befb6a84ff2e9bdd921271930e046
> Author: Mohammad Mahfuzur Rahman Mamun <mamun.nightcrawelr@gmail.com>
> Date:   Mon Jun 27 19:58:24 2011 +0600
> 
>     three functions added in textpage
> 
> [snip a lot]

Maybe these two people can give more information on how vertical text is supposed to be handled.
Comment 3 Laura David Hurka 2020-09-05 10:52:56 UTC
*** Bug 318768 has been marked as a duplicate of this bug. ***
Comment 4 Laura David Hurka 2020-09-05 10:53:05 UTC
*** Bug 338563 has been marked as a duplicate of this bug. ***
Comment 5 Laura David Hurka 2020-09-05 10:53:13 UTC
*** Bug 426171 has been marked as a duplicate of this bug. ***
Comment 6 Laura David Hurka 2020-09-05 10:53:22 UTC
*** Bug 300400 has been marked as a duplicate of this bug. ***
Comment 7 Laura David Hurka 2020-12-20 12:42:34 UTC
*** Bug 181559 has been marked as a duplicate of this bug. ***
Comment 8 Laura David Hurka 2021-07-17 12:05:10 UTC
Created attachment 140133 [details]
Diagonal watermark text breaks text entity reordering

I just got this link: http://files.pine64.org/doc/datasheet/pine64/AXP803_Datasheet_V1.0.pdf

Text selection doesn’t work because of that “conf i dent i al” watermark.
Comment 9 Laura David Hurka 2022-09-21 16:27:06 UTC
*** Bug 459447 has been marked as a duplicate of this bug. ***