161324 – recognise columns in the text of a page

Bug 161324 - recognise columns in the text of a page

Summary: recognise columns in the text of a page

Status:	RESOLVED FIXED

Alias:	None

Product:	okular
Classification:	Applications
Component:	general (other bugs)
Version First Reported In:	0.6.3
Platform:	unspecified Linux

Importance:	HI wishlist
Target Milestone:	---
Assignee:	Okular developers

URL:
Keywords:

Duplicates (8):	162957 170102 175377 194120 225267 235531 268334 276580 (view as bug list)
Depends on:
Blocks:	190433
	Show dependency tree / graph

Reported:	2008-04-27 13:02 UTC by Alvaro Aguilera
Modified:	2024-02-26 21:04 UTC (History)
CC List:	22 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:	4.8.0
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Alvaro Aguilera 2008-04-27 13:02:52 UTC

Version:           0.6.3 (using 4.0.3 (KDE 4.0.3) "release 19.2", compiled sources)
Compiler:          gcc
OS:                Linux (x86_64) release 2.6.22.17-0.1-default

I find Okular's yellow highlighter very useful, something missing however, is the ability to recognize a layout with columns. Almost every PDF I read has such format and Okular forces me to highlight line by line, instead of allowing me to mark the whole paragraph.

Comment 1 Pino Toscano 2008-04-27 13:55:33 UTC

Give a better title, as it's a general "problem".

Comment 2 kde2eran 2008-04-28 18:16:14 UTC

In the general case this seems to require a layout analysis, such as OCRopus.

Comment 3 Pino Toscano 2008-05-31 21:33:55 UTC

*** Bug 162957 has been marked as a duplicate of this bug. ***

Comment 4 Bui Arantsson 2008-06-24 11:46:37 UTC

This feature indeed needs to be fixed, if okular's highlighter tool is to become useful for scientific work, seeing as almost all journals use text-layouts with columns. However, I am not a programmer, and thus have no idea how it should be implemented, and whether analysis of pdf layouts is easy or not. If not, another possibility might be to allow the user to subdivide documents himself. I.e to allow the user to "draw" borders to which the highlighter will limit itself. Almost like setting margins in a word editor, although of course purely for internal use.

Comment 5 jos poortvliet 2008-07-16 21:34:13 UTC

I just bumped into this when trying to do a screencast about Okular (for a upcoming KDE promo site). I must say it is rather unfortunate, and I don't think I will demo this feature as it is - it will only make ppl feel betrayed if they find out it doesn't work as it should. This is no stab at you guys developing this app - Okular is way cool. It's just that this issue somehow has to be solved. I have no idea if this even works properly in for example Adobe acrobat reader, or any other app - I suspect this is pretty hard to do, given the little I know about layout stuff in PDF's. Pitty...

Anyway. I hope this can be solved someday - somehow. Meanwhile, keep up the good work. Okular is really nice, but still has many small issues...

I do wanna say the selection mechanism you guys made (right mouseclick - select an area - copy text/picture) works SOO GOOD :D

Comment 6 Pino Toscano 2008-08-30 21:35:01 UTC

*** Bug 170102 has been marked as a duplicate of this bug. ***

Comment 7 Pino Toscano 2008-11-17 09:38:06 UTC

*** Bug 175377 has been marked as a duplicate of this bug. ***

Comment 8 James Rivett-Carnac 2008-11-17 09:52:44 UTC

*** This bug has been confirmed by popular vote. ***

Comment 9 Michal Witkowski 2009-01-09 20:18:07 UTC

The same can be said about the text select tool. It highlights text from both columns.

Is there any hope that this might get resolved any time soon?

Comment 10 Albert Astals Cid 2009-01-10 15:27:04 UTC

Pattern recognizion of what is a column and what is not based on coordinates of each character is something your brain can do very easily but programming an algorithm that does that is not trivial by far, so i guess the answer is no

Comment 11 Michal Witkowski 2009-01-10 15:46:59 UTC

Well, the thing is that both Adobe Reader and Foxit Reader are able to detect columns just fine (text selection, text highlight) so it's possible for sure. Maybe Okular's PDF backend is limited and doesn't provide text-layout information and that's what makes it hard. But saying that it's a hard problem solvable to a computer is just not true.

Comment 12 Michal Witkowski 2009-01-10 15:51:21 UTC

Just as I thought, it's a poppler bug. A similar problem is seen in evince (gnome pdf viewer)

https://bugs.launchpad.net/poppler/+bug/33288

Comment 13 Robert Knight 2009-01-10 16:11:51 UTC

> Pattern recognizion of what is a column and what is not based on
> coordinates of each character is something your brain can do very easily
> but programming an algorithm that does that is not trivial by far,
> so i guess the answer is no 

It is certainly possible but not trivial - Ocropus provides a free software C++ implementation of algorithms to do this if you're interested.  The basic approach is to try to the largest columns of whitespace in the page and divide the text into columns based on that.

Comment 14 Robert Knight 2009-01-10 16:12:59 UTC

> The basic approach is to try to the largest columns of whitespace
> in the page and divide the text into columns based on that. 

Sorry, that should read:

The basic approach seems to be finding the largest columns of whitespace in the page and dividing the text into columns based on that.

Comment 15 Albert Astals Cid 2009-01-10 16:50:59 UTC

<quote>
Maybe Okular's PDF backend is limited and doesn't provide text-layout
information and that's what makes it hard.
</quote>

I like when people speak if they knew how PDF works. Please if you know that PDF provides text-layout go to the poppler project (which by coincidence i am the maintainer of) and send a patch.


<quote>
But saying that it's a hard problem solvable to a computer is just not true.
</quote>

I also like when people decides that something is not hard because someone else is able of doing it. What about painting the Mona Lisa, it should not be that difficult, someone did it 500 years ago! How you dare to say that painting it is something difficult!

Comment 16 Isaac Puch Rojo 2009-02-27 10:17:11 UTC

(In reply to comment #15)

OK, The discussion could be more diplomatic and the comment are not very constructive. But if you want to work with scientific paper, this bug is very important. 

I only want to ask, if the Okular Team want to work in this problem or no. I would respect that they don't want. 

Thanks for the great Program!

Comment 17 Pino Toscano 2009-05-26 09:53:41 UTC

*** Bug 194120 has been marked as a duplicate of this bug. ***

Comment 18 dani.valverde 2009-09-10 12:01:23 UTC

What about adding a text highlight tool (just as the selection tool, but to highlight instead of selecting)? It may be an easier solution while looking for a fancier way ...

Comment 19 Isaac Puch Rojo 2009-09-10 12:12:26 UTC

The solution from Dani Valverde is not perfect, but it will be work. 
I give my virtually vote ;-)

By, Isaac

Comment 20 chosunsk 2009-09-13 12:31:41 UTC

This bug is over three years old :(

https://launchpad.net/ubuntu/+source/poppler/+bug/33288

https://bugs.freedesktop.org/show_bug.cgi?id=3188

Comment 21 Albert Astals Cid 2009-09-13 21:58:52 UTC

Three years don't make it easier to solve, we still welcome people with knowledge on how to fix it.

Comment 22 Pino Toscano 2010-02-02 14:34:37 UTC

*** Bug 225267 has been marked as a duplicate of this bug. ***

Comment 23 Ekin Akoglu 2010-02-02 15:24:24 UTC

I agree that three years do not make it easier to solve but it definitely makes it a must feature that needs to be implemented. By the way, what do developers do between the releases apart from fixing bugs?

Comment 24 Michal Witkowski 2010-02-02 16:03:55 UTC

From: https://bugs.freedesktop.org/show_bug.cgi?id=3188

"Comment  #45 From Praveen Thirukonda  2009-12-27 00:42:00 PST  -------

it seems this bug now has a working patch and yet there has not been any
activity for the past few weeks.
It would really be great if this is committed soon as this is a really annoying
bug for many. "

It seems that there's hope :)

Comment 25 Albert Astals Cid 2010-02-02 23:50:07 UTC

#23: What do we do? Well, personally i sleep 7 hours a day, work 8 hours a day, spend 2 eating and preparing things to eat, 1 travelling to and from work, 1 going to shop things to eat and the rest of the 3 hours i try to code things for KDE, but then some user demands to know what i do with my life and that 3 hours become 2.5 hours. You should be happy i have no friends, otherwise that 2.5 hours would be a 0

Comment 26 Albert Astals Cid 2010-02-02 23:51:38 UTC

#24 This is not going to help okular at all since we do not use poppler text algorithms since we support text selection for more formats than just PDF

Comment 27 Ekin Akoglu 2010-02-04 14:08:22 UTC

I did not mean to be rude when making above statement. I really appreciate KDE and its applications in terms of the approach they have taken, i.e. abundant configurability and capability of the application. If only Okular had this feature.

Comment 28 Luigi Toscano 2010-04-27 20:39:55 UTC

*** Bug 235531 has been marked as a duplicate of this bug. ***

Comment 29 Yuval Aviel 2010-08-05 11:29:20 UTC

(In reply to comment #26)
> #24 This is not going to help okular at all since we do not use poppler text
> algorithms since we support text selection for more formats than just PDF

I guess that 90% of Okular users that also use annotation, use it for reading PDF files.

Maybe solving this issue with Poppler solution is not such a bad way to go.

Comment 30 chosunsk 2010-08-20 03:22:39 UTC

(In reply to comment #29)
> (In reply to comment #26)
> > #24 This is not going to help okular at all since we do not use poppler text
> > algorithms since we support text selection for more formats than just PDF
> 
> I guess that 90% of Okular users that also use annotation, use it for reading
> PDF files.
> 
> Maybe solving this issue with Poppler solution is not such a bad way to go.

Indeed, evince, which uses poppler algorithms, supports column selection.

Comment 31 Albert Astals Cid 2010-08-20 20:06:14 UTC

Indeed, evince does not support text selection in the horde of document formats that Okular does, our selection might be better or worse but it is [mostly] consisten among document formats.

But you don't really care, you like bashing developers because you think that will make them realize that you are right.

Comment 32 Peter Hedlund 2010-09-17 00:18:03 UTC

(In reply to comment #31)
> Indeed, evince does not support text selection in the horde of document formats
> that Okular does, our selection might be better or worse but it is [mostly]
> consisten among document formats.
> 
> But you don't really care, you like bashing developers because you think that
> will make them realize that you are right.

Albert, relax. But still, for many users Okular = pdf and for many users pdf = two-column scientific papers. Okular uses the poppler backend for pdf and if the backend now supports column selection, so should Okular.

I am sure there are already some if... then to handle all the formats you say Okular supports. Please consider making it a priority to add use of the poppler backend if the format where selection is happening is pdf.

Thanks,
Peter

Comment 33 Albert Astals Cid 2010-09-17 00:38:35 UTC

Let me tell you a secret: i don't need advanced text selection in okular, so obviously it's not my priority.

Now let me tell you another secret: Okular is free software! So all you that need advanced text selection are very welcome to improve okular text selection algorithm send a patch and then not only pdf text selection would be better but all the other formats too! For free!

Comment 34 Peter Hedlund 2010-09-17 00:49:11 UTC

(In reply to comment #33)
> Let me tell you a secret: i don't need advanced text selection in okular, so
> obviously it's not my priority.
> 
> Now let me tell you another secret: Okular is free software! So all you that
> need advanced text selection are very welcome to improve okular text selection
> algorithm send a patch and then not only pdf text selection would be better but
> all the other formats too! For free!

Here is my secret: I have never needed anything in the program I maintain (KWordQuiz), but I think it is fun when people show interest and tell we about features they would like. If they are reasonable I see it as a challenge to my limited self-taught programming skills to try to implement them. That why I use free software.

I have actually looked in to pdf developement as I had some interest in page manipulation features like adding and removing (pages) so I know it is no walk in the park. Still it seems someone has already done a significant part of the work in this (selection) case. Now is the time to step up to the final challenge or is programming not fun anymore?

Well, back to Adobe Reader...

Comment 35 kde2eran 2010-09-17 07:10:19 UTC

Does poppler guess the text layout using some generic heuristic algorithm, or use some explicit information on text ordering embedded in the PDF format? If it's the latter, then Okular ought to use that embedded information, via poppler, instead of discarding it and taking a wild guess instead.

Comment 36 Alvaro Aguilera 2010-09-17 10:46:54 UTC

I like the idea of supporting the multiple file formats but I guess that 99% of the people (myself included) use Okular exclusively as a PDF reader. It's a pity that the formant independence gets in the way of implementing features that would be actually useful for the majority of its users. I'd bet that if someone revamps KPDF people would make the switch from one day to the other.

Comment 37 Robert Knight 2010-09-17 11:21:53 UTC

> Does poppler guess the text layout using some generic heuristic algorithm, or
> use some explicit information on text ordering embedded in the PDF format?

PDFs do not contain layout information about how text is structured into paragraphs and columns.  As I understand it, what PDF provides is essentially a list of commands that say "draw string S at position P with font F".

I haven't looked into recent versions of Poppler but older versions had some fairly complex heuristic algorithms to try to piece together the layout given the input.  These algorithms had some interesting flaws.  If I remember correctly, due to numerical instability the order of paragraphs in the output text could differ significantly depending on the processor on which you ran the code.

Comment 38 uetsah 2010-09-17 13:03:37 UTC

(In reply to comment #31)
> Indeed, evince does not support text selection in the horde of document formats
> that Okular does, our selection might be better or worse but it is [mostly]
> consisten among document formats.

Supporting multiple document formats consistently is great, but won't it be possible to still allow certain features to only be supported by some document formats and not others? Or to be implemented differently for each backend, where it makes sense?

The text selection user interface could still stay the same for every format, but in the background it could use whatever algorithms each respective backend provides for reading or guessing text layout structure.

So in case of PDF documents, the backend would use Poppler's heuristic algorithms. In case of OpenDocument documents, the backend would use the structural information already available in the document file. And so on...

Of course there could also be a generic algorithm that guesses the text structure independently of the document format, but as I understand it, that would be much more work...

Btw, I personally think that even with this feature missing, Okular is still best PDF viewer out there, so thanks for the great work and for giving it away for free... :-)
If this feature is not on your priority list, that's of course totally fine, but please maybe still consider it for the future, just in case one day you're bored and don't know anything else to implement... ;-)

Comment 39 yves 2010-11-01 17:16:46 UTC

hi
sorry not sure this is the right place for me to comment..
As a user of Okular I would also benefit from the double column recognition for annotations, etc...
My work around is to cover text with an inline comment without text and lower the opacity or put an ellipse and change it into a rectangle.
With this in mind I would then also be happy if I could set the parameters of the annotations (opacity, collors etc...) once for all so that when I place a new one, it has already the look I want.
I have no idea how difficult that is to do...
cheers
y.

Comment 40 Albert Astals Cid 2010-11-01 20:39:21 UTC

yves, you are asking for something (I would be happy if I could set the parameters of the annotations) that has nothing to do with this bug. Please open a separate with issue.

Comment 41 Pino Toscano 2011-03-13 10:59:47 UTC

*** Bug 268334 has been marked as a duplicate of this bug. ***

Comment 42 Albert Astals Cid 2011-06-27 07:36:50 UTC

*** Bug 276580 has been marked as a duplicate of this bug. ***

Comment 43 Albert Astals Cid 2011-08-28 20:33:57 UTC

This has been implemented in this year GSoC and will be available in Okular as of KDE 4.8

You can find more info at http://tsdgeos.blogspot.com/2011/08/okular-selection-gsoc-in-depth-analysis.html

You are all encouraged to give a try to the current git master code (if you know how to compile) and give back constructive feedback.

Comment 44 champignoom 2024-02-26 07:18:15 UTC

I'm reading a two-column pdf (https://dl.acm.org/doi/pdf/10.1145/3477113.3487272), for which the selection still doesn't work properly.

Okular Version: 23.08.5
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.115.0
Qt Version: 5.15.12

Is there any chance to further improve the column recognition algorithm?

Comment 45 Albert Astals Cid 2024-02-26 21:04:04 UTC

I am going to close this, please open a new bug.

This is has been marked as fixed for 13 years old and has more than 20 users that get notified when things change here, and my guess is that  they really don't want to be bothered about this particular PDF that fails, because for them it works, if it didn't work, they would have reopened this bug shortly in these 13 years that the bug was marked as fixed.

aacid
alex.danila.web
bui.foss
champignoom
chosunsk
coibyxqx
dani.valverde
ekin
guilfordstuff
ismabox
james.rivett.carnac
kde2eran
lmb
magisu
matt
neuro
peter
puchrojo
robertknight
tlfarani
yuval.aviel
yves.hennequin