123672 – RTF - kword doesn't recognize lang/charset settings

Bug 123672 - RTF - kword doesn't recognize lang/charset settings

Summary: RTF - kword doesn't recognize lang/charset settings

Status:	RESOLVED FIXED

Alias:	None

Product:	kword
Classification:	Miscellaneous
Component:	filters (show other bugs)
Version:	unspecified
Platform:	Compiled Sources Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	KOffice Bug Wranglers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-03-15 17:49 UTC by Mikolaj Machowski
Modified:	2006-07-06 15:23 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
rtf document with Polish letters displayed incorrectly in KWord (10.00 KB, text/rtf) 2006-03-15 17:50 UTC, Mikolaj Machowski	Details
Patch to fix kword import RTF charset. (1.76 KB, patch) 2006-03-16 19:01 UTC, Sebastian Sauer	Details
Second try to get a working patch done. (2.08 KB, patch) 2006-03-17 17:14 UTC, Sebastian Sauer	Details
Theird try to get a working patch done. (2.09 KB, patch) 2006-03-17 17:22 UTC, Sebastian Sauer	Details
Forth try to get a working patch done. (1.87 KB, patch) 2006-03-17 18:26 UTC, Sebastian Sauer	Details
correctpar.png (35.51 KB, image/png) 2006-03-17 21:41 UTC, Mikolaj Machowski	Details
badpar.png (35.75 KB, image/png) 2006-03-17 21:41 UTC, Mikolaj Machowski	Details
Show Obsolete (3) View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Mikolaj Machowski 2006-03-15 17:49:39 UTC

Version:            (using KDE Devel)
Installed from:    Compiled sources
Compiler:          gcc3.4.3 
OS:                Linux

KWord doesn't recognize language/charset settings. Documents with Polish
letters (encoded as cp1250) aren't properly translated into screen even when 
in file are properly declared settings:

{\rtf1\ansi {\fonttbl{\f0\fcharset238 Arial;}}{\colortbl;\red255\green0\blue0;\red0\green0\blue0;}{\stylesheet{\fs20\lang1045 \snext0 Normal;}{\s15\ql\f2\fs20\lang1045 \snext0 Tabela;}}\pard\plain \fs20\lang1045 \f0

Fragment will encodings - \fcharset238 - Eastern European; language \lang1045 - Polish

Full document in attachment.

Comment 1 Mikolaj Machowski 2006-03-15 17:50:35 UTC

Created attachment 15136 [details]
rtf document with Polish letters displayed incorrectly in KWord

Comment 2 Nicolas Goutte 2006-03-15 17:59:09 UTC

\ansi without further preciion means CP 1252 however.

\fcharset means only "font charset", not the encoding of the file.

So I would consider this a buggy file (unfortunately as too many RTF files).

Have a nice day!

Comment 3 Mikolaj Machowski 2006-03-16 00:59:52 UTC

> So I would consider this a buggy file (unfortunately as too many RTF
> files).

Problem is that according to your definition there are no proper rtf
files making KWord unusable for Polish users.  The only source of
"proper" RTF files in MS world are MS-products. Didn't spot properly
displayed in KWord RTF file produced by any other tool (OCRs,
specialized law applications, etc.).

Comment 4 Nicolas Goutte 2006-03-16 01:16:35 UTC

On Thursday 16 March 2006 01:02, Mikolaj Machowski wrote:
(...)
> > So I would consider this a buggy file (unfortunately as too many RTF
> > files).
>
> Problem is that according to your definition there are no proper rtf
> files making KWord unusable for Polish users.  The only source of
> "proper" RTF files in MS world are MS-products. Didn't spot properly
> displayed in KWord RTF file produced by any other tool (OCRs,
> specialized law applications, etc.).


So I suppose that somebody will really have to implement a way to override the 
encoding of a RTF file.

Have a nice day!

Comment 5 Sebastian Sauer 2006-03-16 19:01:53 UTC

Created attachment 15149 [details]
Patch to fix kword import RTF charset.

The attached patch implements \fcharset.
Ok to commit?

Comment 6 Sebastian Sauer 2006-03-16 20:22:19 UTC

Patch committed. Please feel free to revert or mail or reply if the solution doesn't solve the prob or introduces a regression. Thanks :)

Comment 7 Nicolas Goutte 2006-03-17 09:52:24 UTC

On Thursday 16 March 2006 20:22, Sebastian Sauer wrote:
[bugs.kde.org quoted mail]

Please revert. The patch is wrong, even as a hack.

The parameter of \fcharset is not a codepage and cannot be used directly as 
codepage.

Have a nice day!

Comment 8 Mikolaj Machowski 2006-03-17 12:06:25 UTC

Confirming, it doesn't work (waited for recompilation of KOffice)

In RTF charset isn't simply related to codepage::

    +void RTFImport::setCharset( RTFProperty *property )
    +{
    +    if(token.value >= 0)
    +        setCodepage(property);
    +}
    +

For example \fcharset238 is codepage cp-1250

KWord shows MS-Word documents properly only because letters there are
encoded by Unicode and special entities - it doesn't use native
encoding.

Aaahhh.  Checked RTF 1.6 doc and it looks like by design RTF doesn't
support native encodings?! They are supported only through Unicode
entities. So you are right - document is invalid but that type of
documents is really popular in Poland. I wonder about other countries
with non-latin1 charsets/encodings...

Table of charsets from RTF 1.6 docs:

\fcharsetN				*\fcharset*
Specifies the character set of a font in the font table. Values for
<i>N</i> are defined by Windows header files:

0 -- ANSI       <- cp-1252 (MM)
1 -- Default
2 -- Symbol
3 -- Invalid
77 -- Mac
128 -- Shift Jis
129 -- Hangul <- cp-949 (?)
130 -- Johab <- cp-1361
134 -- GB2312 <- I always mix these two: cp-936 and cp-950
136 -- Big5 <- as above
161 -- Greek  <- cp-1253
162 -- Turkish <- cp-1254
163 -- Vietnamese <- cp-1258
177 -- Hebrew <- cp-1255
178 -- Arabic
179 -- Arabic Traditional
180 -- Arabic user
181 -- Hebrew user
186 -- Baltic <- cp-1257
204 -- Russian <- cp-1251 (? - not sure, there are several cyrillics)
222 -- Thai <- cp-874
238 -- Eastern European  <- cp-1250
254 -- PC 437
255 -- OEM

Comment 9 Nicolas Goutte 2006-03-17 12:59:56 UTC

On Friday 17 March 2006 12:06, Mikolaj Machowski wrote:
(...)
>
> KWord shows MS-Word documents properly only because letters there are
> encoded by Unicode and special entities - it doesn't use native
> encoding.

KWord's RTF import filter uses native encoding when it is correctly defined:
\pc codepage 850 (approximation as it should be codepage 435)
\pca codepage 850
\mac Apple Roman encoding
\ansi codepage 1252

Then there is the \ansicpg keyword to set a codepage.

>
> Aaahhh.  Checked RTF 1.6 doc and it looks like by design RTF doesn't
> support native encodings?! 

It does, as it defines the keywords that I have listed above.

>They are supported only through Unicode
> entities.

Using Unicode (especially the \u keyword) is only an option, even if perhaps a 
recommended one, if backward compatibility is not needed.

> So you are right - document is invalid but that type of
> documents is really popular in Poland. I wonder about other countries
> with non-latin1 charsets/encodings...

Yes, I am starting to wonder too. 

It worries me that \pc and \ansi would perhaps not mean a particular codepage 
but just the locale MS-DOS respectively Windows codepages. If that is the 
case, then the RTF filter need quite an improvement.

(And a hack will probably not be enough, for documents using multiple kinds of 
fonts with differents \fcharset declarations.)

>
> Table of charsets from RTF 1.6 docs:
>
> \fcharsetN				*\fcharset*
> Specifies the character set of a font in the font table. Values for
> <i>N</i> are defined by Windows header files:

I suppose that it would be the best if such a tablewould be more central in 
KOffice, as at least other KWord filters would need it too, as they come from 
Windows too.

(...)

Have a nice day!

Comment 10 Sebastian Sauer 2006-03-17 16:20:09 UTC

Commit reverted (as in functionality disabled again). I'll look at it after release.

Comment 11 Sebastian Sauer 2006-03-17 16:45:46 UTC

From http://www.df.lth.se/~triad/krad/recode/rtf-cvs.c ;

{0, 1, 1252, "CP1252"},          /* ANSI_CHARSET (wingdi.h) */
{1, 2, 0, "UCS2"},               /* DEFAULT_CHARSET (Mozilla) */
{2, 1, 0, ""},                   /* SYMBOL_CHARSET (wingdi.h) */
{77, 1, 0, "macintosh"},         /* MAC_CHARSET (wingdi.h) */
{128, 2, 932, "CP932"},          /* SHIFTJIS_CHARSET (Wine) */
{129, 2, 949, "CP949"},          /* HANGEUL_CHARSET (Wine) */
{130, 2, 1361, "CP1361"},        /* JOHAB_CHARSET (Wine) */
{134, 2, 936, "CP936"},          /* GB2312_CHARSET (Wine) */
{136, 2, 950, "CP950"},          /* CHINESEBIG5_CHARSET (Wine) */
{161, 1, 1253, "CP1253"},        /* GREEK_CHARSET (wingdi.h) */
{162, 1, 1254, "CP1254"},        /* TURKISH_CHARSET (wingdi.h) */
{163, 2, 1258, "CP1258"},        /* VIETNAMESE_CHARSET (Mozilla) */
{177, 1, 1255, "CP1255"},        /* HEBREW_CHARSET (wingdi.h) */
{178, 1, 1256, "CP1256"},        /* ARABIC_CHARSET former ARABICSIMPLIFIED_CHARSET (RTF 1.3) */
{179, 1, 0, ""},                 /* ARABICTRADITIONAL_CHARSET - obsolete? (RTF 1.3) */
{180, 1, 0, ""},                 /* ARABICUSER_CHARSET - obsolete? (RTF 1.3) */
{181, 1, 0, ""},                 /* HEBREWUSER_CHARSET - obsolete? (RTF 1.3) */
{186, 1, 1257, "CP1257"},        /* BALTIC_CHARSET (wingdi.h) */
{204, 1, 1251, "CP1251"},        /* RUSSIAN_CHARSET former CYRILLIC_CHARSET (RTF 1.3) */
{222, 1, 874, "CP874"},          /* THAI_CHARSET (Wine) */
{238, 1, 1250, "CP1250"},        /* EASTEUROPE_CHARSET former EASTERNEUROPE_CHARSET (RTF 1.3) */
{254, 1, 437, "IBM437"},         /* PC437_CHARSET - obsolete? (RTF 1.3) */
{255, 1, 0, ""},                 /* OEM_CHARSET (wingdi.h) */
{0, 0, 0, NULL}

Comment 12 Mikolaj Machowski 2006-03-17 16:49:04 UTC

> Then there is the \ansicpg keyword to set a codepage.

After adding just \ansicpg1250 immediately after \ansi KWord displays
previously attached document as it was indented (all Polish characters
visible).
> > Aaahhh.  Checked RTF 1.6 doc and it looks like by design RTF doesn't
> > support native encodings?!
> It does, as it defines the keywords that I have listed above.

Sorry, misunderstood, pre-1.6 versions officially didn't support them.
Even now \ansicpg is rather for proper translation of UTF than native
encodings. 8-bit characters are only as a side effect: (Converters that
communicate with Microsoft Word for Windows or Microsoft Word for the
Macintosh should expect 8-bit characters.)
> It worries me that \pc and \ansi would perhaps not mean a particular
> codepage but just the locale MS-DOS respectively Windows codepages. If
> that is the case, then the RTF filter need quite an improvement.

I am afraid this is the case. Also possible is that MS-programs are
just guessing encoding depending on locale or perform additional tests
to display properly. You could check OO.o code - oowriter displays
document without problems.

Comment 13 Nicolas Goutte 2006-03-17 16:55:19 UTC

On Friday 17 March 2006 16:45, Sebastian Sauer wrote:
(...)
> From http://www.df.lth.se/~triad/krad/recode/rtf-cvs.c ;


seems to be LGPL too. :-)

>
> {0, 1, 1252, "CP1252"},          /* ANSI_CHARSET (wingdi.h) */


> {1, 2, 0, "UCS2"},               /* DEFAULT_CHARSET (Mozilla) */


UTF-16 is perhaps good as font encoding, but surely it should not be used as 
encoding for the RTF stream.

> {2, 1, 0, ""},                   /* SYMBOL_CHARSET (wingdi.h) */


> {77, 1, 0, "macintosh"},         /* MAC_CHARSET (wingdi.h) */


Qt names it "Apple Roman"

> {128, 2, 932, "CP932"},          /* SHIFTJIS_CHARSET (Wine) */
> {129, 2, 949, "CP949"},          /* HANGEUL_CHARSET (Wine) */
> {130, 2, 1361, "CP1361"},        /* JOHAB_CHARSET (Wine) */
> {134, 2, 936, "CP936"},          /* GB2312_CHARSET (Wine) */
> {136, 2, 950, "CP950"},          /* CHINESEBIG5_CHARSET (Wine) */


I am not sure if we have correct supports from Qt for these. (Qt has probably 
only the non-Microsoft variants.)

> {161, 1, 1253, "CP1253"},        /* GREEK_CHARSET (wingdi.h) */
> {162, 1, 1254, "CP1254"},        /* TURKISH_CHARSET (wingdi.h) */
> {163, 2, 1258, "CP1258"},        /* VIETNAMESE_CHARSET (Mozilla) */
> {177, 1, 1255, "CP1255"},        /* HEBREW_CHARSET (wingdi.h) */
> {178, 1, 1256, "CP1256"},        /* ARABIC_CHARSET former
> ARABICSIMPLIFIED_CHARSET (RTF 1.3) */ {179, 1, 0, ""},                 /*
> ARABICTRADITIONAL_CHARSET - obsolete? (RTF 1.3) */ {180, 1, 0, ""},        
>         /* ARABICUSER_CHARSET - obsolete? (RTF 1.3) */ {181, 1, 0, ""},    
>             /* HEBREWUSER_CHARSET - obsolete? (RTF 1.3) */ {186, 1, 1257,
> "CP1257"},        /* BALTIC_CHARSET (wingdi.h) */
> {204, 1, 1251, "CP1251"},        /* RUSSIAN_CHARSET former CYRILLIC_CHARSET
> (RTF 1.3) */ {222, 1, 874, "CP874"},          /* THAI_CHARSET (Wine) */
> {238, 1, 1250, "CP1250"},        /* EASTEUROPE_CHARSET former
> EASTERNEUROPE_CHARSET (RTF 1.3) */ 


>{254, 1, 437, "IBM437"},         /*> PC437_CHARSET - obsolete? (RTF 1.3) */


Qt does not offer 437, only 850 (which I have found to be a good enough 
approximation for the RTF import filter).

> {255, 1, 0, ""},                 /*
> OEM_CHARSET (wingdi.h) */
> {0, 0, 0, NULL}


Have a nice day!

Comment 14 Nicolas Goutte 2006-03-17 17:06:45 UTC

On Friday 17 March 2006 16:49, Mikolaj Machowski wrote:
(...)
> > Then there is the \ansicpg keyword to set a codepage.
>
> After adding just \ansicpg1250 immediately after \ansi KWord displays
> previously attached document as it was indented (all Polish characters
> visible).

That is good. (The document could be even more "wrong".)

>
> > > Aaahhh.  Checked RTF 1.6 doc and it looks like by design RTF doesn't
> > > support native encodings?!
> >
> > It does, as it defines the keywords that I have listed above.
>
> Sorry, misunderstood, pre-1.6 versions officially didn't support them.

It depends. \pc \pca \mac and \ansi are already existing since WinWord 1.x (so 
probably RTF 1.2).

Only \ansicpg is relatively recent.

> Even now \ansicpg is rather for proper translation of UTF than native
> encodings.

Why? The \u keyword does not need to know the encoding of the file.

> 8-bit characters are only as a side effect: 

On contrary, I think that it is the primary goal.

> (Converters that
> communicate with Microsoft Word for Windows or Microsoft Word for the
> Macintosh should expect 8-bit characters.)

The problem is that basically RTF is a 7 bit file format, as at the time RTF 
1.0 was defined major U.S. networks were not 8 bit clean.

Until RTF 1,2, it was made a little less U.S but you had to encode the 
characters with \' if they were not 7 bit clean.

Nowadays it should be 8 bit clean.

>
> > It worries me that \pc and \ansi would perhaps not mean a particular
> > codepage but just the locale MS-DOS respectively Windows codepages. If
> > that is the case, then the RTF filter need quite an improvement.
>
> I am afraid this is the case. Also possible is that MS-programs are
> just guessing encoding depending on locale
> or perform additional tests
> to display properly.

> You could check OO.o code - oowriter displays
> document without problems.

It is rather difficult to read OOo's code.

Have a nice day!

Comment 15 Sebastian Sauer 2006-03-17 17:14:08 UTC

Created attachment 15163 [details]
Second try to get a working patch done.

The attached patch just translates the table above to the matching codepage and
sets it. It's absolutly not perfect and a few charsets are not handled.

Comment 16 Sebastian Sauer 2006-03-17 17:22:02 UTC

Created attachment 15164 [details]
Theird try to get a working patch done.

Changed patch to handle Comment #14 except the 

>> {1, 2, 0, "UCS2"},		    /* DEFAULT_CHARSET (Mozilla) */ 
> UTF-16 is perhaps good as font encoding, but surely it should not be used as 
encoding for the RTF stream. 
 
note. What would be the right codepage in that case?

Comment 17 Sebastian Sauer 2006-03-17 18:26:52 UTC

Created attachment 15166 [details]
Forth try to get a working patch done.

In OpenOffice.org the function rtl_getTextEncodingFromWindowsCharset in
http://rpms.alerque.com/BUILD/ooo-build-1.9.78.2/build/src680-m78/sal/textenc/tencinfo.c
(LGPL too) is responsible for translating the windows-charsets.
I changed the patch to behave more like oo.org does. So, except charset==1 and
charset==2 it should behave now the same way like oo.org does.

Comment 18 Mikolaj Machowski 2006-03-17 21:41:53 UTC

Sorry, I am lost in all that patches and cannot test it.
Attaching two screenshots:

badpar.png: previous KWord display
correctpar.png: how it should look

This is first paragraph of orzeczenie.rtf

Thanks for your work.


Created an attachment (id=15170)
correctpar.png

Created an attachment (id=15171)
badpar.png

Comment 19 Sebastian Sauer 2006-03-18 18:22:32 UTC

Thanks for the shoots, Mikolaj. With the patch it just looks like at correctpar.png. So, I changed charset==1 to use CP1252 cause that seems to be the default codepage for the RTF import-filter and committed the patch ( http://lists.kde.org/?l=kde-commits&m=114270225126906&w=2 ).

So, I assume this bugreport is closed now?

Comment 20 Mikolaj Machowski 2006-03-19 16:14:52 UTC

> Thanks for the shoots, Mikolaj. With the patch it just looks like at
> correctpar.png. So, I changed charset==1 to use CP1252 cause that seems
> to be the default codepage for the RTF import-filter and committed the
> patch ( http://lists.kde.org/?l=kde-commits&m=114270225126906&w=2 ).
>
> So, I assume this bugreport is closed now?


I just borked my system. :/ Cannot recompile koffice for few days and
confirm it. Also I'd like to test with other rtf docs I just found and
also don't display properly in KWord.

Comment 21 Nicolas Goutte 2006-04-10 18:42:45 UTC

On Friday 17 March 2006 17:22, Sebastian Sauer wrote:
(...)
> ------- Additional Comments From mail dipe org  2006-03-17 17:22 -------
> Created an attachment (id=15164)
>  --> (http://bugs.kde.org/attachment.cgi?id=15164&action=view)
> Theird try to get a working patch done.
>
> Changed patch to handle Comment #14 except the
>
> >> {1, 2, 0, "UCS2"},		    /* DEFAULT_CHARSET (Mozilla) */
> >
> > UTF-16 is perhaps good as font encoding, but surely it should not be used
> > as
>
> encoding for the RTF stream.
>
> note. What would be the right codepage in that case?


Well, I suppose that "DEFAULT" means that it cannot be used as an hint for the 
file encoding, as this bug is about that we cannot trust the default encoding 
of the RTF encoding keywords.

Have a nice day!

Comment 22 Sebastian Sauer 2006-07-06 15:23:11 UTC

Let's mark the report as fixed now cause the a few months ago committed patch still solves the report issue (at least for me). Please fill free to reopen + provide some testcase if the bug is still valid. Thanks :)