334733 – Okular txt backend chokes on unicode text

Bug 334733 - Okular txt backend chokes on unicode text

Summary: Okular txt backend chokes on unicode text

Status:	RESOLVED FIXED

Alias:	None

Product:	okular
Classification:	Applications
Component:	general (show other bugs)
Version:	21.04.3
Platform:	Ubuntu Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Okular developers

URL:
Keywords:

Duplicates (2):	353302 416997 (view as bug list)
Depends on:
Blocks:

Reported:	2014-05-14 07:47 UTC by Sergio
Modified:	2021-07-17 02:45 UTC (History)
CC List:	10 users (show)

See Also:
Latest Commit:	https://invent.kde.org/graphics/okular/commit/1047fd1df77a3e70ebf76c26bd821d268063592c
Version Fixed In:	21.08
Sentry Crash Report:

Attachments
A file with two lines (second line is Unicode Cyrillic) (47 bytes, text/plain) 2014-05-14 08:11 UTC, Yuri Chornoivan	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Sergio 2014-05-14 07:47:00 UTC

This bug has been tagged for the general component of okular, but in fact has to do with the txt backend that is not present in the drop down menu in the bug tracker.

To reproduce

Make a text file with utf-8 encoding. Make sure that there is one character with a two byte representation in it, say 'è'.

Trying to display the file content in okular, in the best scenario displays the file with some weird gliphs in place of the two-byte char.  In the worst case displays a blank page.

In order to support txt files, I think that okular needs to be able to guess the encoding first. Even better, when the txt backend is active, there should be a way to explicitly instruct okular about the encoding to use (e.g. an extra entry in the view menu), like all programs that need to deal with text files (e.g. kate) typically do.

Reproducible: Always

Comment 1 Albert Astals Cid 2014-05-14 07:48:39 UTC

For confirmation, could you please attach such a file?

Comment 2 Yuri Chornoivan 2014-05-14 08:11:41 UTC

Created attachment 86625 [details]
A file with two lines (second line is Unicode Cyrillic)

Comment 3 Albert Astals Cid 2014-05-14 08:26:32 UTC

Correct

Comment 4 Christoph Feck 2014-05-14 21:33:12 UTC

I never got either KEncodingProber or KEncodingDetector to work correctly (in other words, to detect UTF-8). The workaround was to simply assume UTF-8, and if conversion fails, because the file is not UTF-8, then try locale encoding. See bug 228172.

Comment 5 Albert Astals Cid 2020-02-07 18:28:32 UTC

*** Bug 416997 has been marked as a duplicate of this bug. ***

Comment 6 Alexander Kernozhitsky 2020-12-22 12:51:19 UTC

Just tried on Okular 20.12.0, the bug is still reproducible for me.

Comment 7 Ilya Bizyaev 2021-07-12 11:25:15 UTC

*** Bug 353302 has been marked as a duplicate of this bug. ***

Comment 8 Yaroslav Sidlovsky 2021-07-12 11:33:01 UTC

The problem lies there: https://invent.kde.org/graphics/okular/-/blob/5447aa1021a2313c4e4cfddbd3a0abb86270ee13/generators/txt/document.cpp#L52.

For small text confidence() will always returns small values.
In case of example from the attachment "confidence() == 0.2" => no encoding will be selected at all.

Comment 9 Yaroslav Sidlovsky 2021-07-12 11:35:29 UTC

Plus I can confirm that bug still exists in okular-21.04.3.

Comment 10 Yaroslav Sidlovsky 2021-07-12 12:16:06 UTC

https://invent.kde.org/graphics/okular/-/merge_requests/454

Comment 11 Albert Astals Cid 2021-07-14 12:36:27 UTC

Git commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca by Albert Astals Cid, on behalf of Yaroslav Sidlovsky.
Committed on 14/07/2021 at 08:23.
Pushed by aacid into branch 'master'.

Fixed encoding detection for small texts (up to 3000 bytes)

M  +5    -0    generators/txt/document.cpp

https://invent.kde.org/graphics/okular/commit/929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca

Comment 12 Albert Astals Cid 2021-07-14 19:58:34 UTC

Git commit 1047fd1df77a3e70ebf76c26bd821d268063592c by Albert Astals Cid, on behalf of Yaroslav Sidlovsky.
Committed on 14/07/2021 at 19:58.
Pushed by aacid into branch 'release/21.08'.

Fixed encoding detection for small texts (up to 3000 bytes)
(cherry picked from commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca)

M  +5    -0    generators/txt/document.cpp

https://invent.kde.org/graphics/okular/commit/1047fd1df77a3e70ebf76c26bd821d268063592c