Bug 321074

Summary: wrong encoding of registered mark
Product: [Unmaintained] kio Reporter: Christopher Yeleighton <giecrilj>
Component: manAssignee: Unassigned bugs mailing-list <unassigned-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: fpilee, kollix, mail
Priority: NOR    
Version: 4.10.3   
Target Milestone: ---   
Platform: openSUSE   
OS: Linux   
URL: man:/selinux
Latest Commit: Version Fixed In: 14.12.1
Sentry Crash Report:

Description Christopher Yeleighton 2013-06-12 19:47:19 UTC
The registered mark is converted by the manual page slave to a Chinese ideograph.

Reproducible: Always

Steps to Reproduce:
  1.  { man selinux; }
  2. { kioclient cat man:/selinux; }
3.
Actual Results:  
  1. Type Enforcement®
  2. Type Enforcement速

Expected Results:  
  2. Type Enforcement®


LC_CTYPE="pl_PL.UTF-8"
Comment 1 Martin Koller 2015-01-06 18:50:43 UTC
See bug 141340

In brief: as man page files do not define in which encoding they are written,
we try to auto-detect the encoding, which fails in this case.
(The guessed encoding is here "EUC-JP" ...)

Not sure how we could fix this.
Comment 2 Martin Koller 2015-01-06 18:51:55 UTC
*** Bug 329966 has been marked as a duplicate of this bug. ***
Comment 3 Martin Koller 2015-01-06 18:52:51 UTC
*** Bug 337479 has been marked as a duplicate of this bug. ***
Comment 4 Martin Koller 2015-01-08 18:24:41 UTC
Git commit 3208955e66c48b07281271933c2be5e49328720f by Martin Koller.
Committed on 08/01/2015 at 18:18.
Pushed by mkoller into branch 'Applications/14.12'.

Do not use KEncodingProber - it gives false results; Try dirname or UTF8

The auto-detection of the man page file content with KEncodingProber
was not successful - there are some bug reports showing it does not work
reliable - often giving EUC-JP or gb18030 as encoding, which is wrong.

I now try to find the encoding inside the man page file
(according manconv) or from the name of the directory in which the
file resides. However, on my openSuse system, neither the definition
inside nor the directory name tells me it's UTF-8, but all pages are in
UTF-8. Therefore I now use UTF-8 as default, which can be overridden
with the env-var MAN_ICONV_INPUT_CHARSET
FIXED-IN: 14.12.1

M  +9    -18   kioslave/man/kio_man.cpp
M  +92   -20   kioslave/man/man2html.cpp
M  +6    -0    kioslave/man/man2html.h

http://commits.kde.org/kde-runtime/3208955e66c48b07281271933c2be5e49328720f
Comment 5 Christopher Yeleighton 2018-04-15 21:26:03 UTC
Still present in kio-extras 17.04.2
Comment 6 Julian Steinmann 2018-06-24 13:23:46 UTC
I cannot reproduce this issue with kio-extras 18.04.2: when I execute e.g  kioclient5 cat man:/usr/share/man/es/man1/ark.1.gz and open the output in a browser, all special characters are displayed correctly. Can anybody confirm that this is no longer an issue?
Comment 7 Martin Koller 2018-08-24 11:41:23 UTC
I can reproduce it (kio_man version 18.8.0)
man:/selinux(8)
Comment 8 Martin Koller 2018-08-24 16:15:01 UTC
Git commit 1c45ddbe94c3fdfedf35f801ddfeeab6d17f2cc4 by Martin Koller.
Committed on 24/08/2018 at 15:19.
Pushed by mkoller into branch 'master'.

Fwd port: Do not use KEncodingProber - it gives false results

forward port of 3208955e66c48b07281271933c2be5e49328720f
from old kde-runtime repo

Original commit text:
Do not use KEncodingProber - it gives false results; Try dirname or UTF8

The auto-detection of the man page file content with KEncodingProber
was not successful - there are some bug reports showing it does not work
reliable - often giving EUC-JP or gb18030 as encoding, which is wrong.

I now try to find the encoding inside the man page file
(according manconv) or from the name of the directory in which the
file resides. However, on my openSuse system, neither the definition
inside nor the directory name tells me it's UTF-8, but all pages are in
UTF-8. Therefore I now use UTF-8 as default, which can be overridden
with the env-var MAN_ICONV_INPUT_CHARSET

M  +9    -18   man/kio_man.cpp
M  +89   -25   man/man2html.cpp
M  +6    -0    man/man2html.h
M  +1    -1    man/tests/CMakeLists.txt

https://commits.kde.org/kio-extras/1c45ddbe94c3fdfedf35f801ddfeeab6d17f2cc4
Comment 9 Christopher Yeleighton 2019-11-03 03:12:25 UTC
Note: the encoding prober in kcodecs-5.55.0 gives me UTF-8 at 99% as expected when fed with the file hunspell.1 (unzipped).  The unzipped file does not start with a BOM.