Summary: | Okular txt backend chokes on unicode text | ||
---|---|---|---|
Product: | [Applications] okular | Reporter: | Sergio <sergio.callegari> |
Component: | general | Assignee: | Okular developers <okular-devel> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | aacid, bizyaev, cfeck, Fahad.alsaidi, m.weghorn, motherlode.muwa, nate, sh200105, simonandric5, zawertun |
Priority: | NOR | ||
Version: | 21.04.3 | ||
Target Milestone: | --- | ||
Platform: | Ubuntu | ||
OS: | Linux | ||
Latest Commit: | https://invent.kde.org/graphics/okular/commit/1047fd1df77a3e70ebf76c26bd821d268063592c | Version Fixed In: | 21.08 |
Attachments: | A file with two lines (second line is Unicode Cyrillic) |
Description
Sergio
2014-05-14 07:47:00 UTC
For confirmation, could you please attach such a file? Created attachment 86625 [details]
A file with two lines (second line is Unicode Cyrillic)
Correct I never got either KEncodingProber or KEncodingDetector to work correctly (in other words, to detect UTF-8). The workaround was to simply assume UTF-8, and if conversion fails, because the file is not UTF-8, then try locale encoding. See bug 228172. *** Bug 416997 has been marked as a duplicate of this bug. *** Just tried on Okular 20.12.0, the bug is still reproducible for me. *** Bug 353302 has been marked as a duplicate of this bug. *** The problem lies there: https://invent.kde.org/graphics/okular/-/blob/5447aa1021a2313c4e4cfddbd3a0abb86270ee13/generators/txt/document.cpp#L52. For small text confidence() will always returns small values. In case of example from the attachment "confidence() == 0.2" => no encoding will be selected at all. Plus I can confirm that bug still exists in okular-21.04.3. Git commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca by Albert Astals Cid, on behalf of Yaroslav Sidlovsky. Committed on 14/07/2021 at 08:23. Pushed by aacid into branch 'master'. Fixed encoding detection for small texts (up to 3000 bytes) M +5 -0 generators/txt/document.cpp https://invent.kde.org/graphics/okular/commit/929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca Git commit 1047fd1df77a3e70ebf76c26bd821d268063592c by Albert Astals Cid, on behalf of Yaroslav Sidlovsky. Committed on 14/07/2021 at 19:58. Pushed by aacid into branch 'release/21.08'. Fixed encoding detection for small texts (up to 3000 bytes) (cherry picked from commit 929c94e09d6b44b7b26c2f43e9d0b8451ee0e4ca) M +5 -0 generators/txt/document.cpp https://invent.kde.org/graphics/okular/commit/1047fd1df77a3e70ebf76c26bd821d268063592c |