This bug has been tagged for the general component of okular, but in fact has to do with the txt backend that is not present in the drop down menu in the bug tracker. To reproduce Make a text file with utf-8 encoding. Make sure that there is one character with a two byte representation in it, say 'รจ'. Trying to display the file content in okular, in the best scenario displays the file with some weird gliphs in place of the two-byte char. In the worst case displays a blank page. In order to support txt files, I think that okular needs to be able to guess the encoding first. Even better, when the txt backend is active, there should be a way to explicitly instruct okular about the encoding to use (e.g. an extra entry in the view menu), like all programs that need to deal with text files (e.g. kate) typically do. Reproducible: Always
For confirmation, could you please attach such a file?
Created attachment 86625 [details] A file with two lines (second line is Unicode Cyrillic)
Correct
I never got either KEncodingProber or KEncodingDetector to work correctly (in other words, to detect UTF-8). The workaround was to simply assume UTF-8, and if conversion fails, because the file is not UTF-8, then try locale encoding. See bug 228172.
*** Bug 416997 has been marked as a duplicate of this bug. ***
Just tried on Okular 20.12.0, the bug is still reproducible for me.
*** Bug 353302 has been marked as a duplicate of this bug. ***
The problem lies there: https://invent.kde.org/graphics/okular/-/blob/5447aa1021a2313c4e4cfddbd3a0abb86270ee13/generators/txt/document.cpp#L52. For small text confidence() will always returns small values. In case of example from the attachment "confidence() == 0.2" => no encoding will be selected at all.
Plus I can confirm that bug still exists in okular-21.04.3.
https://invent.kde.org/graphics/okular/-/merge_requests/454
Git commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca by Albert Astals Cid, on behalf of Yaroslav Sidlovsky. Committed on 14/07/2021 at 08:23. Pushed by aacid into branch 'master'. Fixed encoding detection for small texts (up to 3000 bytes) M +5 -0 generators/txt/document.cpp https://invent.kde.org/graphics/okular/commit/929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca
Git commit 1047fd1df77a3e70ebf76c26bd821d268063592c by Albert Astals Cid, on behalf of Yaroslav Sidlovsky. Committed on 14/07/2021 at 19:58. Pushed by aacid into branch 'release/21.08'. Fixed encoding detection for small texts (up to 3000 bytes) (cherry picked from commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca) M +5 -0 generators/txt/document.cpp https://invent.kde.org/graphics/okular/commit/1047fd1df77a3e70ebf76c26bd821d268063592c