Bug 407133 - Copy text from rotated pdf gives rubbish
Summary: Copy text from rotated pdf gives rubbish
Status: CONFIRMED
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: 1.7.0
Platform: Other Linux
: HI normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
: 181559 300400 318768 338563 426171 459447 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-05-01 19:05 UTC by Axel Braun
Modified: 2023-04-12 14:30 UTC (History)
8 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
example for rotated page (14.38 KB, application/pdf)
2019-05-01 19:05 UTC, Axel Braun
Details
Vertical texts are used for diagrams, but Okular can’t search for them (47.70 KB, image/png)
2019-05-12 12:38 UTC, David Hurka
Details
Diagonal text is not recognized as line (11.57 KB, image/png)
2019-05-12 17:34 UTC, David Hurka
Details
Diagonal watermark text breaks text entity reordering (190.01 KB, image/png)
2021-07-17 12:05 UTC, David Hurka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Axel Braun 2019-05-01 19:05:13 UTC
Created attachment 119778 [details]
example for rotated page

SUMMARY
You have a PDF that is landscape, and you rotate it to see it properly on the screen. If you then copy text, the clipboard contains rubbish

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 5.15.4
KDE Plasma Version: 
KDE Frameworks Version: 5.57.0
Qt Version: 5.12.3

ADDITIONAL INFORMATION
see example attached (you dont need to rotate if it is shown portrait)
Comment 1 David Hurka 2019-05-12 12:38:16 UTC
Created attachment 120007 [details]
Vertical texts are used for diagrams, but Okular can’t search for them

You can fix the clipboard content with the following command ;)
perl -e 'print reverse split //, <>;'

Seems like the TextPage, which is used for search and text-copying, is filled this way. While the Generator adds horizontal words as words, vertical words are split into letters. Then, Okular thinks, that the uppermost letter is the first letter.

Letters or words are stored in TextEntity objects in the TextPage. The TextEntity stores the letter/word as string and the bounding rectangle.

The problem is one of these two: (choose what you like more)
1. TextPage and TextEntity can’t store transformations, or even simple rotation. So, the generator splits vertical words into single letters. *1
2. The generator, which uses poppler to read the pdf, gets vertical words already split into letters.

*1) Possible reason: this way, one can (theoretically *2) use the Text Selection tool to select the word.
*2) Practically not, because Okular adds any other letter on the same height to the selection.

I have attached a screenshot which illustrates the practical relevance of this problem: In many datasheets (not only TI), vertical text is used to describe vertical axes of diagrams. Splitting them into words prevents searching for a specific diagram.
Comment 2 David Hurka 2019-05-12 17:34:10 UTC
Created attachment 120017 [details]
Diagonal text is not recognized as line

Looking into core/textpage.cpp tells me that the generators just output characters with their bounding rectangles. (These informations become TinyTextEntitys.) There seems to be no information about orientation.

There are some functions in core/textpage.cpp, whose code I didn’t read yet:


removeSpace()
Claims to remove space, to make output from different generators uniform.

makeWordFromCharacters()
Claims to rearrange characters to words, using spaces to distinguish between adjacent words. (But spaces are removed?)

makeAndSortLines()
Claims to look for adjacent words to make a line of them, and to sort the lines.

calculateStatisticalInformation()
Claims to be able to distinguish between character spacing, word spacing, and column spacing. Needed for multi-column layouts.

XYCutForBoudingBoxes()
Claims to apply the XY-cut algorithm, to seperate... something

addNecessarySpace()
Inserts the space that was probaby removed by removeSpace(), so selecting text does not result in words that are squashed together.

TextPagePrivate::correctTextOrder()
Calls the above, statically declared functions.


Unfortunately, these functions don’t seem to be designed for vertical text. Even slightly diagonal text causes problems, see screenshot. (Possible reasons: XY-cut can’t “see” diagonal texts, makeAndSortLines() collects characters in a bad order)

There are many commits on these functions, mainly done in 2011 by Albert Astals Cid and Mohammad Mahfuzur Rahman Mamun. The beginning was probably this commit?

> commit 2eb5f270fd4befb6a84ff2e9bdd921271930e046
> Author: Mohammad Mahfuzur Rahman Mamun <mamun.nightcrawelr@gmail.com>
> Date:   Mon Jun 27 19:58:24 2011 +0600
> 
>     three functions added in textpage
> 
> [snip a lot]

Maybe these two people can give more information on how vertical text is supposed to be handled.
Comment 3 David Hurka 2020-09-05 10:52:56 UTC
*** Bug 318768 has been marked as a duplicate of this bug. ***
Comment 4 David Hurka 2020-09-05 10:53:05 UTC
*** Bug 338563 has been marked as a duplicate of this bug. ***
Comment 5 David Hurka 2020-09-05 10:53:13 UTC
*** Bug 426171 has been marked as a duplicate of this bug. ***
Comment 6 David Hurka 2020-09-05 10:53:22 UTC
*** Bug 300400 has been marked as a duplicate of this bug. ***
Comment 7 David Hurka 2020-12-20 12:42:34 UTC
*** Bug 181559 has been marked as a duplicate of this bug. ***
Comment 8 David Hurka 2021-07-17 12:05:10 UTC
Created attachment 140133 [details]
Diagonal watermark text breaks text entity reordering

I just got this link: http://files.pine64.org/doc/datasheet/pine64/AXP803_Datasheet_V1.0.pdf

Text selection doesn’t work because of that “conf i dent i al” watermark.
Comment 9 David Hurka 2022-09-21 16:27:06 UTC
*** Bug 459447 has been marked as a duplicate of this bug. ***