Version: (using KDE Devel) Installed from: Compiled sources Compiler: gcc3.4.3 OS: Linux KWord doesn't recognize language/charset settings. Documents with Polish letters (encoded as cp1250) aren't properly translated into screen even when in file are properly declared settings: {\rtf1\ansi {\fonttbl{\f0\fcharset238 Arial;}}{\colortbl;\red255\green0\blue0;\red0\green0\blue0;}{\stylesheet{\fs20\lang1045 \snext0 Normal;}{\s15\ql\f2\fs20\lang1045 \snext0 Tabela;}}\pard\plain \fs20\lang1045 \f0 Fragment will encodings - \fcharset238 - Eastern European; language \lang1045 - Polish Full document in attachment.
Created attachment 15136 [details] rtf document with Polish letters displayed incorrectly in KWord
\ansi without further preciion means CP 1252 however. \fcharset means only "font charset", not the encoding of the file. So I would consider this a buggy file (unfortunately as too many RTF files). Have a nice day!
> So I would consider this a buggy file (unfortunately as too many RTF > files). Problem is that according to your definition there are no proper rtf files making KWord unusable for Polish users. The only source of "proper" RTF files in MS world are MS-products. Didn't spot properly displayed in KWord RTF file produced by any other tool (OCRs, specialized law applications, etc.).
On Thursday 16 March 2006 01:02, Mikolaj Machowski wrote: (...) > > So I would consider this a buggy file (unfortunately as too many RTF > > files). > > Problem is that according to your definition there are no proper rtf > files making KWord unusable for Polish users. The only source of > "proper" RTF files in MS world are MS-products. Didn't spot properly > displayed in KWord RTF file produced by any other tool (OCRs, > specialized law applications, etc.). So I suppose that somebody will really have to implement a way to override the encoding of a RTF file. Have a nice day!
Created attachment 15149 [details] Patch to fix kword import RTF charset. The attached patch implements \fcharset. Ok to commit?
Patch committed. Please feel free to revert or mail or reply if the solution doesn't solve the prob or introduces a regression. Thanks :)
On Thursday 16 March 2006 20:22, Sebastian Sauer wrote: [bugs.kde.org quoted mail] Please revert. The patch is wrong, even as a hack. The parameter of \fcharset is not a codepage and cannot be used directly as codepage. Have a nice day!
Confirming, it doesn't work (waited for recompilation of KOffice) In RTF charset isn't simply related to codepage:: +void RTFImport::setCharset( RTFProperty *property ) +{ + if(token.value >= 0) + setCodepage(property); +} + For example \fcharset238 is codepage cp-1250 KWord shows MS-Word documents properly only because letters there are encoded by Unicode and special entities - it doesn't use native encoding. Aaahhh. Checked RTF 1.6 doc and it looks like by design RTF doesn't support native encodings?! They are supported only through Unicode entities. So you are right - document is invalid but that type of documents is really popular in Poland. I wonder about other countries with non-latin1 charsets/encodings... Table of charsets from RTF 1.6 docs: \fcharsetN *\fcharset* Specifies the character set of a font in the font table. Values for <i>N</i> are defined by Windows header files: 0 -- ANSI <- cp-1252 (MM) 1 -- Default 2 -- Symbol 3 -- Invalid 77 -- Mac 128 -- Shift Jis 129 -- Hangul <- cp-949 (?) 130 -- Johab <- cp-1361 134 -- GB2312 <- I always mix these two: cp-936 and cp-950 136 -- Big5 <- as above 161 -- Greek <- cp-1253 162 -- Turkish <- cp-1254 163 -- Vietnamese <- cp-1258 177 -- Hebrew <- cp-1255 178 -- Arabic 179 -- Arabic Traditional 180 -- Arabic user 181 -- Hebrew user 186 -- Baltic <- cp-1257 204 -- Russian <- cp-1251 (? - not sure, there are several cyrillics) 222 -- Thai <- cp-874 238 -- Eastern European <- cp-1250 254 -- PC 437 255 -- OEM
On Friday 17 March 2006 12:06, Mikolaj Machowski wrote: (...) > > KWord shows MS-Word documents properly only because letters there are > encoded by Unicode and special entities - it doesn't use native > encoding. KWord's RTF import filter uses native encoding when it is correctly defined: \pc codepage 850 (approximation as it should be codepage 435) \pca codepage 850 \mac Apple Roman encoding \ansi codepage 1252 Then there is the \ansicpg keyword to set a codepage. > > Aaahhh. Checked RTF 1.6 doc and it looks like by design RTF doesn't > support native encodings?! It does, as it defines the keywords that I have listed above. >They are supported only through Unicode > entities. Using Unicode (especially the \u keyword) is only an option, even if perhaps a recommended one, if backward compatibility is not needed. > So you are right - document is invalid but that type of > documents is really popular in Poland. I wonder about other countries > with non-latin1 charsets/encodings... Yes, I am starting to wonder too. It worries me that \pc and \ansi would perhaps not mean a particular codepage but just the locale MS-DOS respectively Windows codepages. If that is the case, then the RTF filter need quite an improvement. (And a hack will probably not be enough, for documents using multiple kinds of fonts with differents \fcharset declarations.) > > Table of charsets from RTF 1.6 docs: > > \fcharsetN *\fcharset* > Specifies the character set of a font in the font table. Values for > <i>N</i> are defined by Windows header files: I suppose that it would be the best if such a tablewould be more central in KOffice, as at least other KWord filters would need it too, as they come from Windows too. (...) Have a nice day!
Commit reverted (as in functionality disabled again). I'll look at it after release.
From http://www.df.lth.se/~triad/krad/recode/rtf-cvs.c ; {0, 1, 1252, "CP1252"}, /* ANSI_CHARSET (wingdi.h) */ {1, 2, 0, "UCS2"}, /* DEFAULT_CHARSET (Mozilla) */ {2, 1, 0, ""}, /* SYMBOL_CHARSET (wingdi.h) */ {77, 1, 0, "macintosh"}, /* MAC_CHARSET (wingdi.h) */ {128, 2, 932, "CP932"}, /* SHIFTJIS_CHARSET (Wine) */ {129, 2, 949, "CP949"}, /* HANGEUL_CHARSET (Wine) */ {130, 2, 1361, "CP1361"}, /* JOHAB_CHARSET (Wine) */ {134, 2, 936, "CP936"}, /* GB2312_CHARSET (Wine) */ {136, 2, 950, "CP950"}, /* CHINESEBIG5_CHARSET (Wine) */ {161, 1, 1253, "CP1253"}, /* GREEK_CHARSET (wingdi.h) */ {162, 1, 1254, "CP1254"}, /* TURKISH_CHARSET (wingdi.h) */ {163, 2, 1258, "CP1258"}, /* VIETNAMESE_CHARSET (Mozilla) */ {177, 1, 1255, "CP1255"}, /* HEBREW_CHARSET (wingdi.h) */ {178, 1, 1256, "CP1256"}, /* ARABIC_CHARSET former ARABICSIMPLIFIED_CHARSET (RTF 1.3) */ {179, 1, 0, ""}, /* ARABICTRADITIONAL_CHARSET - obsolete? (RTF 1.3) */ {180, 1, 0, ""}, /* ARABICUSER_CHARSET - obsolete? (RTF 1.3) */ {181, 1, 0, ""}, /* HEBREWUSER_CHARSET - obsolete? (RTF 1.3) */ {186, 1, 1257, "CP1257"}, /* BALTIC_CHARSET (wingdi.h) */ {204, 1, 1251, "CP1251"}, /* RUSSIAN_CHARSET former CYRILLIC_CHARSET (RTF 1.3) */ {222, 1, 874, "CP874"}, /* THAI_CHARSET (Wine) */ {238, 1, 1250, "CP1250"}, /* EASTEUROPE_CHARSET former EASTERNEUROPE_CHARSET (RTF 1.3) */ {254, 1, 437, "IBM437"}, /* PC437_CHARSET - obsolete? (RTF 1.3) */ {255, 1, 0, ""}, /* OEM_CHARSET (wingdi.h) */ {0, 0, 0, NULL}
> Then there is the \ansicpg keyword to set a codepage. After adding just \ansicpg1250 immediately after \ansi KWord displays previously attached document as it was indented (all Polish characters visible). > > Aaahhh. Checked RTF 1.6 doc and it looks like by design RTF doesn't > > support native encodings?! > It does, as it defines the keywords that I have listed above. Sorry, misunderstood, pre-1.6 versions officially didn't support them. Even now \ansicpg is rather for proper translation of UTF than native encodings. 8-bit characters are only as a side effect: (Converters that communicate with Microsoft Word for Windows or Microsoft Word for the Macintosh should expect 8-bit characters.) > It worries me that \pc and \ansi would perhaps not mean a particular > codepage but just the locale MS-DOS respectively Windows codepages. If > that is the case, then the RTF filter need quite an improvement. I am afraid this is the case. Also possible is that MS-programs are just guessing encoding depending on locale or perform additional tests to display properly. You could check OO.o code - oowriter displays document without problems.
On Friday 17 March 2006 16:45, Sebastian Sauer wrote: (...) > From http://www.df.lth.se/~triad/krad/recode/rtf-cvs.c ; seems to be LGPL too. :-) > > {0, 1, 1252, "CP1252"}, /* ANSI_CHARSET (wingdi.h) */ > {1, 2, 0, "UCS2"}, /* DEFAULT_CHARSET (Mozilla) */ UTF-16 is perhaps good as font encoding, but surely it should not be used as encoding for the RTF stream. > {2, 1, 0, ""}, /* SYMBOL_CHARSET (wingdi.h) */ > {77, 1, 0, "macintosh"}, /* MAC_CHARSET (wingdi.h) */ Qt names it "Apple Roman" > {128, 2, 932, "CP932"}, /* SHIFTJIS_CHARSET (Wine) */ > {129, 2, 949, "CP949"}, /* HANGEUL_CHARSET (Wine) */ > {130, 2, 1361, "CP1361"}, /* JOHAB_CHARSET (Wine) */ > {134, 2, 936, "CP936"}, /* GB2312_CHARSET (Wine) */ > {136, 2, 950, "CP950"}, /* CHINESEBIG5_CHARSET (Wine) */ I am not sure if we have correct supports from Qt for these. (Qt has probably only the non-Microsoft variants.) > {161, 1, 1253, "CP1253"}, /* GREEK_CHARSET (wingdi.h) */ > {162, 1, 1254, "CP1254"}, /* TURKISH_CHARSET (wingdi.h) */ > {163, 2, 1258, "CP1258"}, /* VIETNAMESE_CHARSET (Mozilla) */ > {177, 1, 1255, "CP1255"}, /* HEBREW_CHARSET (wingdi.h) */ > {178, 1, 1256, "CP1256"}, /* ARABIC_CHARSET former > ARABICSIMPLIFIED_CHARSET (RTF 1.3) */ {179, 1, 0, ""}, /* > ARABICTRADITIONAL_CHARSET - obsolete? (RTF 1.3) */ {180, 1, 0, ""}, > /* ARABICUSER_CHARSET - obsolete? (RTF 1.3) */ {181, 1, 0, ""}, > /* HEBREWUSER_CHARSET - obsolete? (RTF 1.3) */ {186, 1, 1257, > "CP1257"}, /* BALTIC_CHARSET (wingdi.h) */ > {204, 1, 1251, "CP1251"}, /* RUSSIAN_CHARSET former CYRILLIC_CHARSET > (RTF 1.3) */ {222, 1, 874, "CP874"}, /* THAI_CHARSET (Wine) */ > {238, 1, 1250, "CP1250"}, /* EASTEUROPE_CHARSET former > EASTERNEUROPE_CHARSET (RTF 1.3) */ >{254, 1, 437, "IBM437"}, /*> PC437_CHARSET - obsolete? (RTF 1.3) */ Qt does not offer 437, only 850 (which I have found to be a good enough approximation for the RTF import filter). > {255, 1, 0, ""}, /* > OEM_CHARSET (wingdi.h) */ > {0, 0, 0, NULL} Have a nice day!
On Friday 17 March 2006 16:49, Mikolaj Machowski wrote: (...) > > Then there is the \ansicpg keyword to set a codepage. > > After adding just \ansicpg1250 immediately after \ansi KWord displays > previously attached document as it was indented (all Polish characters > visible). That is good. (The document could be even more "wrong".) > > > > Aaahhh. Checked RTF 1.6 doc and it looks like by design RTF doesn't > > > support native encodings?! > > > > It does, as it defines the keywords that I have listed above. > > Sorry, misunderstood, pre-1.6 versions officially didn't support them. It depends. \pc \pca \mac and \ansi are already existing since WinWord 1.x (so probably RTF 1.2). Only \ansicpg is relatively recent. > Even now \ansicpg is rather for proper translation of UTF than native > encodings. Why? The \u keyword does not need to know the encoding of the file. > 8-bit characters are only as a side effect: On contrary, I think that it is the primary goal. > (Converters that > communicate with Microsoft Word for Windows or Microsoft Word for the > Macintosh should expect 8-bit characters.) The problem is that basically RTF is a 7 bit file format, as at the time RTF 1.0 was defined major U.S. networks were not 8 bit clean. Until RTF 1,2, it was made a little less U.S but you had to encode the characters with \' if they were not 7 bit clean. Nowadays it should be 8 bit clean. > > > It worries me that \pc and \ansi would perhaps not mean a particular > > codepage but just the locale MS-DOS respectively Windows codepages. If > > that is the case, then the RTF filter need quite an improvement. > > I am afraid this is the case. Also possible is that MS-programs are > just guessing encoding depending on locale > or perform additional tests > to display properly. > You could check OO.o code - oowriter displays > document without problems. It is rather difficult to read OOo's code. Have a nice day!
Created attachment 15163 [details] Second try to get a working patch done. The attached patch just translates the table above to the matching codepage and sets it. It's absolutly not perfect and a few charsets are not handled.
Created attachment 15164 [details] Theird try to get a working patch done. Changed patch to handle Comment #14 except the >> {1, 2, 0, "UCS2"}, /* DEFAULT_CHARSET (Mozilla) */ > UTF-16 is perhaps good as font encoding, but surely it should not be used as encoding for the RTF stream. note. What would be the right codepage in that case?
Created attachment 15166 [details] Forth try to get a working patch done. In OpenOffice.org the function rtl_getTextEncodingFromWindowsCharset in http://rpms.alerque.com/BUILD/ooo-build-1.9.78.2/build/src680-m78/sal/textenc/tencinfo.c (LGPL too) is responsible for translating the windows-charsets. I changed the patch to behave more like oo.org does. So, except charset==1 and charset==2 it should behave now the same way like oo.org does.
Sorry, I am lost in all that patches and cannot test it. Attaching two screenshots: badpar.png: previous KWord display correctpar.png: how it should look This is first paragraph of orzeczenie.rtf Thanks for your work. Created an attachment (id=15170) correctpar.png Created an attachment (id=15171) badpar.png
Thanks for the shoots, Mikolaj. With the patch it just looks like at correctpar.png. So, I changed charset==1 to use CP1252 cause that seems to be the default codepage for the RTF import-filter and committed the patch ( http://lists.kde.org/?l=kde-commits&m=114270225126906&w=2 ). So, I assume this bugreport is closed now?
> Thanks for the shoots, Mikolaj. With the patch it just looks like at > correctpar.png. So, I changed charset==1 to use CP1252 cause that seems > to be the default codepage for the RTF import-filter and committed the > patch ( http://lists.kde.org/?l=kde-commits&m=114270225126906&w=2 ). > > So, I assume this bugreport is closed now? I just borked my system. :/ Cannot recompile koffice for few days and confirm it. Also I'd like to test with other rtf docs I just found and also don't display properly in KWord.
On Friday 17 March 2006 17:22, Sebastian Sauer wrote: (...) > ------- Additional Comments From mail dipe org 2006-03-17 17:22 ------- > Created an attachment (id=15164) > --> (http://bugs.kde.org/attachment.cgi?id=15164&action=view) > Theird try to get a working patch done. > > Changed patch to handle Comment #14 except the > > >> {1, 2, 0, "UCS2"}, /* DEFAULT_CHARSET (Mozilla) */ > > > > UTF-16 is perhaps good as font encoding, but surely it should not be used > > as > > encoding for the RTF stream. > > note. What would be the right codepage in that case? Well, I suppose that "DEFAULT" means that it cannot be used as an hint for the file encoding, as this bug is about that we cannot trust the default encoding of the RTF encoding keywords. Have a nice day!
Let's mark the report as fixed now cause the a few months ago committed patch still solves the report issue (at least for me). Please fill free to reopen + provide some testcase if the bug is still valid. Thanks :)