Bug 72917 - UTF8 and other cause XML parsing errors, only in IRC conversations
Summary: UTF8 and other cause XML parsing errors, only in IRC conversations
Status: RESOLVED NOT A BUG
Alias: None
Product: kopete
Classification: Applications
Component: IRC Plugin (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Kopete Developers
URL:
Keywords:
: 73362 73625 73877 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-01-18 23:39 UTC by Iván Sánchez Ortega
Modified: 2004-02-04 19:48 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Ethereal dump from connecting to #selinux (se.linux.org). (2.57 KB, text/plain)
2004-01-23 10:36 UTC, Robin Rosenberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Iván Sánchez Ortega 2004-01-18 23:39:10 UTC
Version:           0.7.94 (using KDE 3.1.94 (CVS >= 20031206),  (testing/unstable))
Compiler:          gcc version 3.3.3 20031229 (prerelease) (Debian)
OS:          Linux (i686) release 2.4.22-xfs

Here is an excerpt from a random piece of chatting in IRC:

[23:31:56] <_ytrio_>pero sacan discos
An internal Kopete error occurred while parsing a message:
XML document could not be parsed!
[23:32:01] <_ytrio_>y hacen conciertos

Viewing the history, I can manage to know what was it about:

[23:31:56] <_ytrio_>pero sacan discos
[23:31:57] <_ytrio_>v�eos
[23:32:01] <_ytrio_>y hacen conciertos

Instead of the character "í", I see a ? sign surrounded in a black square.

My guess: Only UTF8 codification is used in the IRC plugin, when the rest of the chatters use ISO-8859-15. That is causing the XML parser to screw itself up. Amazingly enough, this only seems to happen in IRC chats (MSN and Jabber had proven to be OK)
My suggestion: let the user specify the local coding in the preferences window and/or manage to parse the IRC traffic.

BTW, I'm using AS's Debian packages, as of CVS20040103 snapshot, with kdelibs CVS20031215 or so.
Comment 1 Jason Keirstead 2004-01-19 01:43:53 UTC
You have to change the text codec for the chanel you are in to whatever the people are using.

Right click on the channel contact, or use the IRC menu in the chat window.

Comment 2 Iván Sánchez Ortega 2004-01-19 14:27:57 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

> ------- You have to change the text codec for the chanel you are in to
> whatever the people are using.
>
> Right click on the channel contact, or use the IRC menu in the chat window.

The fact is that I can't do that - there is no option to change the tet codec 
in the context menu of the channel contact. 
Maybe that option was added just a few days/weeks ago?
Is that option to be included in the per-protocol settings window?

Regards,
Comment 3 Jason Keirstead 2004-01-19 15:12:24 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations[Scanned]

On January 19, 2004 9:27 am, Iván Sánchez Ortega wrote:
> The fact is that I can't do that - there is no option to change the tet
> codec in the context menu of the channel contact.
> Maybe that option was added just a few days/weeks ago?

Yes it was added recently.

Update your CVS or wait until 0.8 final / KDE 3.2 is released on Feb. 2

Comment 4 Robin Rosenberg 2004-01-23 10:34:36 UTC
Changing character set does not work. I get the error as soon as I connct.

Furthermore the user should be given a hint. Especially since XML is used in the IRC protocol.m The error message says "internal error", a phrase typically meaning "oops, programmer forgot to check something".

Kopete could use the error message from the XML parser as a way of detecting the character set by trying another.
1. UTF-8, easy to detect usually as we have seen
2. The user's desktop setting
3. ISOLatin1.

In any case, an error in handling the characters should be displayed to the user in a friendly way.
Comment 5 Robin Rosenberg 2004-01-23 10:36:29 UTC
Created attachment 4309 [details]
Ethereal dump from connecting to #selinux (se.linux.org).
Comment 6 Robin Rosenberg 2004-01-23 10:51:41 UTC
One more reason for reaopening this bug.

It is not reasonable to have to change the character set for every contact in IRC in forum with 200 connected users, especially since I cannot see who wrote what I couldn't see. I've just tried in a chat with 30 users, and got exhausted after setting the character set for 10 users. The only hint with regard to who sent the message is that some user's message pass through as they happen to contain only ASCII.

I have version kopete 0.8.0
Comment 7 Robin Rosenberg 2004-01-23 11:06:24 UTC
This message gave the XML parse error despite the fact that I had changed the character set for the user.
---------------------------ethereal-----------------------
:sensei!~sensei@as2-6-3.lde.g.bonet.se PRIVMSG #Selinux :Man måste naturligtvis ställa in x så att den känner till de nya koderna.
---------------------------ethereal-----------------------

Noticed a funny thing though. "sensei" uses the channel name #Selinux with a capital S. #Selinux instead of #selinux.
Comment 8 Alexandre Pereira 2004-01-23 12:45:14 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

also , i would like to see the message that gave the error ....

so what if in a long line , there is a caracter that would be a square or even 
nonexistente ?!!  at least i can try to guess what someone said, or who or 
etc...



Em Sexta, 23 de Janeiro de 2004 09:34, o Robin Rosenberg escreveu:
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
>
> http://bugs.kde.org/show_bug.cgi?id=72917
>
>
>
>
> ------- Additional Comments From robin.rosenberg@dewire.com  2004-01-23
> 10:34 ------- Changing character set does not work. I get the error as soon
> as I connct.
>
> Furthermore the user should be given a hint. Especially since XML is used
> in the IRC protocol.m The error message says "internal error", a phrase
> typically meaning "oops, programmer forgot to check something".
>
> Kopete could use the error message from the XML parser as a way of
> detecting the character set by trying another. 1. UTF-8, easy to detect
> usually as we have seen
> 2. The user's desktop setting
> 3. ISOLatin1.
>
> In any case, an error in handling the characters should be displayed to the
> user in a friendly way. _______________________________________________
> Kopete-devel mailing list
> Kopete-devel@kde.org
> https://mail.kde.org/mailman/listinfo/kopete-devel

Comment 9 Robin Rosenberg 2004-01-23 13:02:41 UTC
Opps, I was following the "this is a duplicate" etc and found this to be "similar". I was getting the XML parser error. I cannot see the message, rather I see this:

"An internal error occurred in kopete while parsing a message: XML document 
could not be parsed" 
Comment 10 Iván Sánchez Ortega 2004-01-23 13:16:43 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

A fecha Viernes, 23 de Enero de 2004 12:45, Iori Yagami escribi
Comment 11 Jason Keirstead 2004-01-23 13:36:40 UTC
A few points:

>Furthermore the user should be given a hint. Especially since XML is used in >the IRC protocol.m The error message says "internal error", a phrase typically >meaning "oops, programmer forgot to check something"

Kopete uses XML for all messages, on all protocols. A string can either be parsed as UTF 8, or it can't. If the string is not parseable (because there is no way to figure out the right codec), it can't be displayed in the XML document at all.

You can see themin network debug and stull like that because they don't have this restriction.

>It is not reasonable to have to change the character set for every contact in >IRC in forum with 200 connected users, especially since I cannot see who wrote >what I couldn't see. I've just tried in a chat with 30 users, and got >exhausted after setting the character set for 10 users. The only hint with

Heh, this might be your largest problem. Setting the character set for all these people won't do anything anyways.

You need to sett he character set for the CHANNEL you are in. Right click on the CHANNEL. Or use the IRC menu in the window.

You only need to set the character set for a user if you are in a one on one chat with them and they are not using UTF-8.

>Well, right now I'm switching to History mode to see the "error-giving" 
>messages. It seems that the logs are already parsed, but somehow distorted.

This is the same problem. The history plugin will store messages correctly only if the codec was set correctly originally.

-------------

The biggest probelm you are having is that *all* communications on this server is not in UTF-8. You need an "account default" codec that applies to all contacts on the server, and communication with the server itself.

This was a requested feature, and it will be in the next release. It did not go into this release due to string and fearure freeze.

Comment 12 Robin Rosenberg 2004-01-23 13:55:18 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 13.36 skrev Jason Keirstead:
> >Furthermore the user should be given a hint. Especially since XML is used in >the IRC protocol.m The error message says "internal error", a phrase typically >meaning "oops, programmer forgot to check something"
> 
> Kopete uses XML for all messages, on all protocols. 
I ran etherape on the connection (see attached file). It contains not XML whatsoever. The conclusion is that kopete internally creates invalid XML documents that were not there to start with.
> >It is not reasonable to have to change the character set for every contact in

> You need to sett he character set for the CHANNEL you are in. Right click on the CHANNEL. Or use the IRC menu in the window.
I did that first, so I expeted the channel topic to be visible. Not!
> 
> Heh, this might be your largest problem. Setting the character set for all these people won't do anything anyways.
I had to do that. Setting the character set on the channle did nothing.

> The biggest probelm you are having is that *all* communications on this server is not in UTF-8. You need an "account default" codec that applies to all contacts on the server, and communication with the server itself.
> 
> This was a requested feature, and it will be in the next release. It did not go into this release due to string and fearure freeze.

So I have to look for anther client. A pity since I liked kopete. In its current state it s U-N-U-S-A-B-L-E.

-- robin

Comment 13 Alexandre Pereira 2004-01-23 14:13:23 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

> So I have to look for anther client. A pity since I liked kopete. In its
> current state it s U-N-U-S-A-B-L-E.
>
> -- robin
> _______________________________________________


well , you do get the correct messages if you start kopete in a konsole.
when a xml error parsing occurs , the correct message will be displayed on the 
konsole.

Comment 14 Jason Keirstead 2004-01-23 14:17:10 UTC
>I ran etherape on the connection (see attached file). It contains not XML >whatsoever. The conclusion is that kopete internally creates invalid XML >documents that were not there to start with. 

As I said above. Etherrape and these other things don't use XML. They don't have to care. When you have an XML document, you *have* to specify the encoding it is in. It's a simple fact. If you  don't specify the right encoding a parsing error occurs.

We used to be able to somewhat reliably auto-detect the encoding, however QT releases since 3.1 have made this not possible any longer.

>I did that first, so I expeted the channel topic to be visible. Not! 

Yes, if what you are referring to is things like channel topics etc, these will be unreadable until we have an encoding setting for the server itself. I hope to have the code to do this done by next week, and it will be in CVS as soon as it's opened.

Comment 15 Alexandre Pereira 2004-01-23 14:26:30 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

Em Sexta, 23 de Janeiro de 2004 13:17, o Jason Keirstead escreveu:
> Yes, if what you are referring to is things like channel topics etc, these
> will be unreadable until we have an encoding setting for the server itself.
> I hope to have the code to do this done by next week, and it will be in CVS
> as soon as it's opened. _______________________________________________


when you have the code , post it as a patch against kde 3.2 rc1 and so on , so 
that distro packagers can apply it automaticly.

since i am using gentoo , i would like for gentoo to automaticly patch 
kdenetwork so that kopete in kde 3.2 rc1 and kde 3.2 has that codec.

Comment 16 Jason Keirstead 2004-01-23 15:15:32 UTC
> when you have the code , post it as a patch against kde 3.2 rc1 and so on , so
> that distro packagers can apply it automaticly.

I can't. The code will have lots of new labels and thus cannot go into the 3.2 branch, and thus packages could not apply it because it would all be untranslated.

PS: You do not need ot CC kopete-devel on replies to BKO, they are all sent there anyways.
Comment 17 Alexandre Pereira 2004-01-23 15:30:44 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

Em Sexta, 23 de Janeiro de 2004 14:15, o Jason Keirstead escreveu:
> I can't. The code will have lots of new labels and thus cannot go into the
> 3.2 branch, and thus packages could not apply it because it would all be
> untranslated.

well i dont know to what extent a patch cant be created , but it is only 
because only of an untranslated option , i would prefer to have that 
untranslated option , than not being able to chat in some irc servers.

gentoo would be very smart on this and probably use an use variable for ppl to 
set it up.

i know that kopete 3.3 or next one will have that feature. but until then 
(probably 6 months) , or ppl use cvs or ppl use your patch. its a very long 
time.

i dont speak for myself , since i have a patched kopete already. i would not 
miss installing the feature. but to the masses , it could be very usefull.

>PS: You do not need ot CC kopete-devel on replies to BKO, they are all sent 
>there anyways.

hehe , just asked that on irc , last time i replied

Comment 18 Robin Rosenberg 2004-01-23 15:51:39 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 15.15 skrev Jason Keirstead:
> I can't. The code will have lots of new labels and thus cannot go into the 3.2 branch, and thus packages could not apply it because it would all be untranslated.

It is a infinitely better if it works in english than not at all. 

Comment 19 Jason Keirstead 2004-01-23 16:15:07 UTC
It is not up to me if the code I create will be able to be backported. It is up to the i18n team.

I will try and make a backport patch and send it to them if I can.
Comment 20 Martijn Klingens 2004-01-23 16:42:14 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Friday 23 January 2004 15:51, Robin Rosenberg wrote:
> It is a infinitely better if it works in english than not at all.

Please read the KDE guidelines for the stable branches. Cases where string 
changes are allowed there are *very* rare. I doubt the KDE release 
coordinators and translation teams would approve such a change.

Comment 21 Robin Rosenberg 2004-01-23 16:59:37 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 16.42 skrev Martijn Klingens:
> Please read the KDE guidelines for the stable branches. Cases where string 
> changes are allowed there are *very* rare. I doubt the KDE release 
> coordinators and translation teams would approve such a change.

Some other "solution" would do, like making a best effort to display the text instead
of the XML parser error message. If I see something like

	Meddelande från mamma vid 23:59.60:
	Maten är klar

or
	Meddelande från mamma vid 23:59.60:
	Maten ?r klar

is legible but

	Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
	XML-dokument kunde inte tydas.

is not

-- robin

Comment 22 Jason Keirstead 2004-01-23 17:18:52 UTC
Martijn, and others, I have an idea how we can display thes emessages better.

Since  these errors are infrequent, a little inefficientcy on them is OK. So, what we could do is this... when the XML parsing fails, we throw the string through this jobbie, that takes the string we *thought* was as UTF-8 and *makes sure* it is UTF-8:

QCString decodeAttempt( QCString &utf )
{
    QTextCodec *utfCodec = QTextCodec::codecForName( "utf8" );
    QString resultString;

    for( uint i = 0; i < utf.length(); i++ )
    {
        QChar thisChar = utf[i];
        if( utfCodec->canDecode( thisChar )
            resultString += thisChar;
        else
            resultString += QChar('?');
    }
    
    return resultString.utf8();
}

.. Thus replacing all the un-encodable characters with ?

Then instead of

An internal Kopete error occurred while parsing a message:
XML document could not be parsed! 

We can display:

WARNING: Kopete could not properly determine the encoding of the following message:
Foobar message ?? foo.
Comment 23 Martijn Klingens 2004-01-23 17:25:35 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Friday 23 January 2004 17:18, Jason Keirstead wrote:
> We can display:
> 
> WARNING: Kopete could not properly determine the encoding of the following
> message:
> Foobar message ?? foo.

Ah, seems like a reasonable workaround.

As for the encoding DETECTION, would using KStringHandler::isUtf8() before 
using QString::fromUtf8() help?

Comment 24 Jason Keirstead 2004-01-23 17:27:51 UTC
Yes it would be a reasulable first check. If that fails there is no point trying to attempt the XML transform.

Note that the reason we didn't use that in Kopete yet is it is an @Since 3.2 method
Comment 25 Martijn Klingens 2004-01-23 17:39:43 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Friday 23 January 2004 17:27, Jason Keirstead wrote:
> Yes it would be a reasulable first check. If that fails there is no point
> trying to attempt the XML transform.

That, too. But I was thinking of a much earlier stage: when you are parsing 
incoming IRC data and when Oscar is parsing incoming ICQ data.

If isUtf8() fails it can try ::fromLatin1 because that one AFAIK can be 
reliably autodetected (unlike utf8() it doesn't accept invalid chars AFAIK), 
followed by your fallback.

A simple static in libkopete (QString
KopeteMessage::detectEncoding( char * ) ?) could handle it, and avoid the 
problem altogether.

> Note that the reason we didn't use that in Kopete yet is it is an
> @Since 3.2 method

True, it wouldn't help KDE 3.1 users. But #ifdef'd out it would tremendously 
help those who will be upgrading to 3.2. And it allows us to tell users to 
upgrade KDE rather than telling them we can't fix the bug. Lastly, we could 
even duplicate the call in compat/ though I'm not too much in favour of that.

Comment 26 Alexandre Pereira 2004-01-23 17:44:34 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

Em Sexta, 23 de Janeiro de 2004 16:25, o Martijn Klingens escreveu:
> ------- Additional Comments From klingens@kde.org  2004-01-23 17:25 -------
> Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only
> in IRC conversations
>
> On Friday 23 January 2004 17:18, Jason Keirstead wrote:
> > We can display:
> >
> > WARNING: Kopete could not properly determine the encoding of the
> > following message:
> > Foobar message ?? foo.
>
> Ah, seems like a reasonable workaround.
>
> As for the encoding DETECTION, would using KStringHandler::isUtf8() before
> using QString::fromUtf8() help?
> _______________________________________________

if you can do that , than you probably can make a iso8859-15 to utf8 
conversion and then try to put it on xml right ?!

even if this isnt 100 % full proof , it it would fix 25 % of cases would be 
already good.

also , could the :
" WARNING: Kopete could not properly determine the encoding of the following 
message:
 Foobar message ?? foo." 

be like : WARNING ?? : "message" 

as to not take too much space on the screen ( i see this going to happen 
alot )

Comment 27 Jason Keirstead 2004-01-23 18:48:45 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On January 23, 2004 12:39 pm, Martijn Klingens wrote:
> That, too. But I was thinking of a much earlier stage: when you are parsing
> incoming IRC data and when Oscar is parsing incoming ICQ data.

It would be useless here. The incoming IRC data is almost never going to be 
UTF 8, no one uses it ( I wish they did, but they don't ).

> If isUtf8() fails it can try ::fromLatin1 because that one AFAIK can be
> reliably autodetected (unlike utf8() it doesn't accept invalid chars
> AFAIK), followed by your fallback.

The IRC engine just uses the contact codec, which if not defined is UTF. 
Basically what you are saying is use latin1 if it is undefined, and if that 
craps out then try latin1, and if that fails just send the incorrect utf data 
to the XML parser, where it will fail and print the warning.

> A simple static in libkopete (QString
> KopeteMessage::detectEncoding( char * ) ?) could handle it, and avoid the
> problem altogether.

Only if you can pass a preferred codec to this static. I think this would be 
the best, a combination of your suggestion and my previous function:

QString KopeteMessage::decodeString( QCString string, QTextCodec 
*preferredCodec = 0L )
{
	if( !preferredCodec )
		preferredCodec = QTextCodec::codecForName("latin1");

	if( !preferredCodec->canDecode( string ) )
	{
		QTextCodec *utfCodec = QTextCodec::codecForName( "utf8" );
  		QString resultString;

    		for( uint i = 0; i < utf.length(); i++ )
		{
		    	QChar thisChar = utf[i];
		        if( utfCodec->canDecode( thisChar )
		            resultString += utfCodec->toUnicode( thisChar );
   			else
		            resultString += QChar('?');
		}
		
		return resultString;
	}
	else
	{
		return preferredCodec->toUnicode( string );
	}
}

Comment 28 Martijn Klingens 2004-01-23 19:55:57 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Friday 23 January 2004 18:48, Jason Keirstead wrote:
> The IRC engine just uses the contact codec, which if not defined is UTF.
> Basically what you are saying is use latin1 if it is undefined, and if that
> craps out then try latin1, and if that fails just send the incorrect utf
> data to the XML parser, where it will fail and print the warning.

Nope.

What I was saying is to try (in this order)

In the plugin (IRC or ICQ):
- Decode as utf8. If isUtf8() is available, use it and continue if it fails.
  Otherwise we have to assume it's utf8 and continue at the XSLT part below.

- Decode as latin1.

- If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local
  encoding is neither utf8 nor latin1.

- When all these failed, use your code that replaces invalid chars with
  question marks. Since we're doing it *here* that means the whole dreaded
  'should never happen' XML error indeed no longer happens at all.

In the XML/XSLT code:
- Use the code that we have now

- If the decoding fails, use a more verbose error. With the above changes
  this should however become an almost unused code path.

> Only if you can pass a preferred codec to this static. I think this would
> be the best, a combination of your suggestion and my previous function:
>
> (snip)

Your code has the tremendous advantage that it allows a custom codec 
selection, moving even more code duplication from the plugins. I like that. 
Some things I miss in your code though:

- if preferredCode is UTF-8 we're at square one, because canDecode() will
  always return true. Therefore UTF-8 should be special cased and use
  KStringHandler when available.

- You use less fallbacks than I had in mind. See the above heuristics.

(The encoding discussion is finally getting interesting again BTW after months 
of frustration :)

Comment 29 Jason Keirstead 2004-01-23 21:08:59 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On January 23, 2004 2:55 pm, Martijn Klingens wrote:
> What I was saying is to try (in this order)
>
> In the plugin (IRC or ICQ):
> - Decode as utf8. If isUtf8() is available, use it and continue if it
> fails. Otherwise we have to assume it's utf8 and continue at the XSLT part
> below.

This will only work in like 0.01% of the cases in IRC, that's why I was saying 
its a waste of time. Not sure about ICQ but I suspect it is the same thing 
there.. not many people use UTF-8. That's why I want to leave that check 
until last.

> - If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local
>   encoding is neither utf8 nor latin1.
>
> - When all these failed, use your code that replaces invalid chars with
>   question marks. Since we're doing it *here* that means the whole dreaded
>   'should never happen' XML error indeed no longer happens at all.

But in all this, where is the user's chosen codec? The user's selected codec
should *always* be tried first.

> Your code has the tremendous advantage that it allows a custom codec
> selection, moving even more code duplication from the plugins. I like that.
> Some things I miss in your code though:
>
> - if preferredCode is UTF-8 we're at square one, because canDecode() will
>   always return true. 

I don't really see this as much of a problem. If the default codec for all 
contacts is Latin1, then the user has to manually change to UTF-8. If they 
manually do this I don't have a problem with it mis-detecting and failing 
with an error / warning; they are the ones who chose that.

>   Therefore UTF-8 should be special cased and use 
>   KStringHandler when available.

Use it for what? As I said, isUTF8() is pretty much useless, since it is 
hardly *ever* UTF-8.

That's why my code tries everything else *first*, then falls back on UTF-8 
with the ? replacement if needed.

> - You use less fallbacks than I had in mind. See the above heuristics.

Other than the local8bit() fallback (which is also useless... what does my 
local codec have to do with the sender's? There's really no correlation, it'd 
just be random luck to work), the only difference is that I move the UTF 
check to the end.

Comment 30 Martijn Klingens 2004-01-23 21:20:30 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Friday 23 January 2004 21:09, Jason Keirstead wrote:
> I don't really see this as much of a problem. If the default codec for all
> contacts is Latin1, then the user has to manually change to UTF-8. If they
> manually do this I don't have a problem with it mis-detecting and failing
> with an error / warning; they are the ones who chose that.

Heh, you have a different goal than I have here :)

I want first and foremost to have accurate and autodetected conversion. 
Second, I want it to be failsafe regardless of a user's setting.

The user's setting should be TRIED first, but not FORCED. If it is broken utf8 
we know it will break the parser, it makes no sense to obey the user at all.

Arguably latin and utf could be switched, but whatever we do we should

- make sure that whenever Utf8 is being used isUtf8() is called first and if
  it fails forget about using Utf8

- try Utf8 somewhere along the lines.

I agree that the user's coded should be tried first, like I already said in 
the previous mail. But, again, tried != forced.

See also Thiago's mails to the list BTW. I suggest to continue in that thread, 
because we're now talking in two branches.

> Other than the local8bit() fallback (which is also useless... what does my
> local codec have to do with the sender's? There's really no correlation,
> it'd just be random luck to work),

Not really. Generally contact lists tend to consist of people from mostly the 
same country. In western countries it is a bit useless, also because the by 
far the most used Utf and Latin1 are both tried anyway, so the local is a 
duplicate, but especially in the Russian and Greek countries the local 
encoding is a VERY important one to try.

Comment 31 Stefan Gehn 2004-01-23 21:26:31 UTC
ICQ uses UTF-16BE or UTF-8 depending on the message type and the client in use, otherwise it just sends in the local encoding (at least official winblows clients did that until they knew how to spell UTF)
Comment 32 Robin Rosenberg 2004-01-23 22:19:45 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 17.18 skrev Jason Keirstead:
> ------- Additional Comments From jason@keirstead.org  2004-01-23 17:18 -------
> Since  these errors are infrequent, a little inefficientcy on them is OK. ...

They are AFAIC not uncommon. 80% of the messages I see are affeeted. A further 10% are 
not because I have set the character set to isolatin1 for these users. Thus the chat looks like 
kopete has flipped. It is better to say nothing than littering the chat log with warnings. This risk 
of misinterpretations of the message due to the wrong charset is very small. Alternatve the 
warning could be show the first time a user's message causes a problem, or it could be the 
same line as the "message from".

Although I suggested it, I thiink "smart" heuristics here can outsmart itself..  On the other hand 
a "per" topic or per  server setting is the best approach. People tend to use the same character 
set in IRC chats and they will get it right after one or two tries.

-- robin

Comment 33 Matt Rogers 2004-01-23 23:09:58 UTC
*** Bug 73362 has been marked as a duplicate of this bug. ***
Comment 34 Jason Keirstead 2004-01-24 01:15:29 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors,
 only in IRC conversations

>They are AFAIC not uncommon. 80% of the messages I see are affeeted. A further 10% are 
>  
>
As I said..the only reason you are seeing so many errors is you are on 
an all SE server. Which
will be rectified with account encodings.

Under *normal* situations you don't see many encoding errors.

Comment 35 Robin Rosenberg 2004-01-24 18:03:43 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

lördagen den 24 januari 2004 01.15 skrev Jason Keirstead:
> As I said..the only reason you are seeing so many errors is you are on 
> an all SE server. Which
> will be rectified with account encodings.
> 
> Under *normal* situations you don't see many encoding errors.

I fail to see how chattin in swedish is considers abnormal?

There should not be /any/ error severe that all I see is an error message. 

Could you reopen the bug (or any of the other reports referring to this problem)
until the "accont setting" gets there and kopete becomes usable again?

!Du har gått med i kanal #selinux
!
!Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
!XML-dokument kunde inte tydas.
!
!Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
!XML-dokument kunde inte tydas.

Comment 36 Robin Rosenberg 2004-01-24 18:08:08 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 15.30 skrev Iori Yagami:
> i dont speak for myself , since i have a patched kopete already. i would not 
> miss installing the feature. but to the masses , it could be very usefull.

How do I get the patches?

Comment 37 Alexandre Pereira 2004-01-24 18:56:24 UTC
Subject: RE: [Kopete-devel]  UTF8 and other cause XML parsing errors,only in IRC conversations 

fredagen den 23 januari 2004 15.30 skrev Iori Yagami:
> i dont speak for myself , since i have a patched kopete already. i would
not 
> miss installing the feature. but to the masses , it could be very usefull.

How do I get the patches?
_______________________________________________

They aren't made yet , but I think brunes will make some the next few weeks.

If he makes them , I will try to test them. You should try to test them too,
and see if they are stable. If need any help on patching , just ask on irc

Comment 38 Jason Keirstead 2004-01-24 20:39:55 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors,
 only in IRC conversations

>I fail to see how chattin in swedish is considers abnormal?
>  
>
Arg. It isn't. Normal situations means when your encoding matches the 
server's. Most IRC
servers use latin1 encoding, even if the channels are not latin1. I 
would indeed characterise a
server that communicates it's protocol information in sweedish encoding 
to be abnormal.

This will all be fixed when you can choose encoding for the server.

>Could you reopen the bug (or any of the other reports referring to this problem)
>until the "accont setting" gets there and kopete becomes usable again?
>  
>
It is not the same bug. It just looks the same to you, in reality 
they're not related.

If you want to open a bug for server encodings, feel free.

Comment 39 Robin Rosenberg 2004-01-25 11:58:53 UTC
Subject: Re:  UTF8 and other cause XML parsing errors, only in IRC conversations

lördagen den 24 januari 2004 20.39 skrev Jason Keirstead:
> Arg. It isn't. Normal situations means when your encoding matches the 
> server's. Most IRC
> servers use latin1 encoding, even if the channels are not latin1. I 
> would indeed characterise a
> server that communicates it's protocol information in sweedish encoding 
> to be abnormal.
"Swedish" encoding *is* Latin1. Both the server and the the channels
in the server i referred to earlier is Latin1. My computer is also latin1.

Comment 40 Jason Keirstead 2004-01-26 15:08:50 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On January 23, 2004 4:20 pm, Martijn Klingens wrote:
> I want first and foremost to have accurate and autodetected conversion.

This is impossible :P

> The user's setting should be TRIED first, but not FORCED. If it is broken
> utf8 we know it will break the parser, it makes no sense to obey the user
> at all.

So you mean, if the user chose UTF8 then we check isUTF8,  and if it is
not, then replace with ? characters wherever needed?

I would go for that I guess.

> - make sure that whenever Utf8 is being used isUtf8() is called first and
> if it fails forget about using Utf8

No. See, this is the problem. You are assuming that you should try UTF then if 
UTF fails then you'll be able to guess something.

This is backwards. UTF is the only codec that gives no failure, also it's the 
only one we have to scan over *twice (isUTF8() and then conversion ) so its 
the most expensive. And on top of all this, hardly anyone uses it. So it's 
most error prone, most expensive, and no one uses it. It *definitly* should 
be the last check.

*UNLESS* the user chose it. If the user chose UTF, then attempt isUTF8, if 
that fails, then *maybe* try latin1, if that fails, just clean up wherever 
possible. There's no point trying local8bit, it's bound to fail.

> Not really. Generally contact lists tend to consist of people from mostly
> the same country. 

Eh huh? Not from my experience... I have people from here, from Europe, from 
Asia. Anyways, contact lists don't really have much to do with it, especially 
on IRC. Anyone could message you from anywhere out of the blue.

My new proposed ordering in pseudo code:

if( userCodec == QTextCodec::codecForName("utf") )
{
	if( isUTF8( string ) )
		return tryCodec->decode( string )
	else
	{
		try QTextCodec::codecForName("latin1")->decode( string )
		if( success )
		{
			return
		}
		else
		{
			return cleanString( string );
		}
	}
}
else
{
	if( userCodec && tryCodec->decode( string )
		return;
	else
	{
		try QTextCodec::codecForName("latin1")->decode( string )
		if( success )
		{
			return
		}
		else
		{
			return cleanString( string );
		}
	}
}

.. where cleanString strips all non-UTF-8 decodable characters from the string 
somehow.

Comment 41 Robin Rosenberg 2004-01-26 16:25:33 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

fredagen den 23 januari 2004 21.40 skrev Jason Keirstead:
> On January 23, 2004 4:20 pm, Martijn Klingens wrote:
[....]
> > - make sure that whenever Utf8 is being used isUtf8() is called first and
> > if it fails forget about using Utf8
> No. See, this is the problem. You are assuming that you should try UTF then if 
> UTF fails then you'll be able to guess something.
> This is backwards. UTF is the only codec that gives no failure, also it's the 
> only one we have to scan over *twice (isUTF8() and then conversion ) so its 
> the most expensive. And on top of all this, hardly anyone uses it. So it's 
> most error prone, most expensive, and no one uses it. It *definitly* should 
> be the last check.

I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed so that
it is unlikely that a non-utf-8 string can recognized as utf-8. If the UTF-decoder cannot
fail, then what does it do when it encounters an illegal sequence?

On the other hand. How could an attempt to decode a string byes as IsoLatin1 fail? A
human user can say that something isn't latin1, but the computer cannot unless we
add a user specified blacklist, IMHO overkill.

> > Not really. Generally contact lists tend to consist of people from mostly
> > the same country. 
> 
> Eh huh? Not from my experience... I have people from here, from Europe, from 
> Asia. Anyways, contact lists don't really have much to do with it, especially 
> on IRC. Anyone could message you from anywhere out of the blue.

I suppose experience can vary here. To me it's either isolatin1 or ascii that comes
overr the wire.. With isolatin it's usually the same county or countries that use the
same character set. Nevertheless, the future will become more and more utf8:ized.

> My new proposed ordering in pseudo code:

sounds reasonable. 

Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The difference is that a few
characters that were "never" used were replaces by some that actually are used.

-- robin

Comment 42 Martijn Klingens 2004-01-26 20:56:16 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Monday 26 January 2004 15:08, Jason Keirstead wrote:
> > I want first and foremost to have accurate and autodetected conversion.
>
> This is impossible :P

True. But when using the right order (User Pref, Local Encoding, UTF-8, 
Latin1) at least you can make sure the chances of it failing are minimized.

> > The user's setting should be TRIED first, but not FORCED. If it is broken
> > utf8 we know it will break the parser, it makes no sense to obey the user
> > at all.
>
> So you mean, if the user chose UTF8 then we check isUTF8,  and if it is
> not, then replace with ? characters wherever needed?

Close. First, I would use QChar::replacement like Thiago mentioned instead of 
'?'.

Second, instead of doing the replacement if isUtf8 fails I would use Thiago's 
order, which would mean that after a Utf-8 failure latin1 is used.

Arguably we could better try Utf-8 BEFORE local encoding, because utf8 failure 
can be detected and local not in all cases (like when local is in Latin1).

> No. See, this is the problem. You are assuming that you should try UTF then
> if UTF fails then you'll be able to guess something.

Exactly.

> This is backwards. UTF is the only codec that gives no failure,

Yes. HOWEVER, Latin1 is even worse, because it CANNOT FAIL. Whatever you feed 
as Latin1, it is BY DEFINITION LEGAL. Thus, you can't do Utf-8 after Latin1, 
it _HAS_ to be done before Latin1.

> also it's the only one we have to scan over *twice (isUTF8() and then
> conversion ) so its the most expensive.

Like Thiago said, isUtf8() doesn't copy data and should be fairly inexpensive. 
Also, I would like to see figures of the additional load, I think it is in 
fact pretty much neglectable for most uses. After all QString is one of the 
most heavily optimized Qt classes. Do you have any KCacheGrind logs proving 
me wrong?

> And on top of all this, hardly anyone uses it.

More and more people start using it, especially with ICQ, which also needs 
this code. And, again, Utf-8 HAS to be checked before Latin1, because after 
trying Latin1 you cannot POSSIBLY get a failure.

So whether it "should" be the last check for performance reasons or not, it 
CANNOT be the last check, no matter how much you'd want it.

> There's no point trying local8bit, it's bound to fail.

This too is wrong for most non-western locales. In fact, with ICQ in Russia it 
would be VERY IMPORTANT to have.

> Eh huh? Not from my experience... I have people from here, from Europe,
> from Asia. Anyways, contact lists don't really have much to do with it,
> especially on IRC. Anyone could message you from anywhere out of the blue.

Try thinking outside the IRC box :) (With IRC I tend to agree with the people 
on channels being diverse, although many people I know are only on Dutch 
language IRC channels, and almost all people I know have exclusively Dutch 
people on their contact list. We open source people are quite a different 
breed from the average user base.)

Comment 43 Jason Keirstead 2004-01-26 21:07:08 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

> > There's no point trying local8bit, it's bound to fail.
>
> This too is wrong for most non-western locales. In fact, with ICQ in Russia
> it would be VERY IMPORTANT to have.

Why? It would only work in the situation where you're talking Russia -> 
Russia. And in that case, the sender should have the russian codec set anyway 
in the message so it would not be a problem?

> Try thinking outside the IRC box :) (With IRC I tend to agree with the
> people on channels being diverse, although many people I know are only on
> Dutch language IRC channels, and almost all people I know have exclusively
> Dutch people on their contact list. We open source people are quite a
> different breed from the average user base.)

While I do agree in a North American sense, since we all speak the same 
language, I would think that you assuming that most people in European 
countries having people using the same codec on their list would be pretty 
incorrect...


Comment 44 Martijn Klingens 2004-01-26 21:13:26 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On Monday 26 January 2004 21:07, Jason Keirstead wrote:
> Why? It would only work in the situation where you're talking Russia ->
> Russia.

Yes, by far the most common case to be talking to fellow landsmen :)

> And in that case, the sender should have the russian codec set 
> anyway in the message so it would not be a problem?

Because with ICQ quite a lot of clients don't provide an encoding at all, in 
which case ICQ has to do exactly the same guesswork as IRC?

Don't you read any of the OSCAR/ICQ-related mails on this list, nor my several 
mentions of ICQ in this very bug report? ;)

> While I do agree in a North American sense, since we all speak the same
> language, I would think that you assuming that most people in European
> countries having people using the same codec on their list would be pretty
> incorrect...

It's pretty correct actually in almost all eastern European and lots of Asian 
countries.

In Western Europe it's a bit of a mixture of Latin1, the default Windows 
encoding of which I don't know the name and a very strong rise of Utf-8.

Comment 45 Jason Keirstead 2004-01-26 21:27:49 UTC
Subject: Re: [Kopete-devel]  UTF8 and other cause XML parsing errors, only in IRC conversations

On January 26, 2004 4:13 pm, Martijn Klingens wrote:
> > Why? It would only work in the situation where you're talking Russia ->
> > Russia.
>
> Yes, by far the most common case to be talking to fellow landsmen :)

I know, but who is to guarentee that they're even using the same codec? You
could be using Cyrillic and they could be using UTF?

I just have a very strong opinion that local8bit has absolutely no realistic 
relationship to the sender's codec, and it shouldn't influence our guessing, 
especially when you consider a string could be valid in multiple encodings.

I would rather display a warning message saying "we don't know the codec" than 
try to guess the codec and be right 90% of the time, but 10% of the time end 
up displaying total garbage. Displaying garbage even once in awhile is not an 
option, it makes Kopete look horrible; it's better to admit that we just 
don't know for sure than make a possibly incorrect guess.


Comment 46 Thiago Macieira 2004-01-26 21:29:09 UTC
> I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed
> so that it is unlikely that a non-utf-8 string can recognized as utf-8. If
> the UTF-decoder cannot fail, then what does it do when it encounters an
> illegal sequence? 
 
That's actually the very reason we're getting this problem, Robin.

Since Qt 3.2.x, TrollTech introduced a modification to its UTF-8 decoder in response to a bug report from us. The original problem was that files whose names or paths were not encodable in the user's selected locale could not be opened by KDE applications nor renamed in Konqueror. We had proposed a solution, but TrollTech chose instead to accept any input as valid UTF-8: when it sees an invalid sequence, it encodes the bytes as a pair of UTF-16 surrogates. The decoder then restores the original byte.

This renders the operation ToUTF8(FromUTF8(any_string)) == any_string true in every case. The side-effect: Latin1 and other kinds of strings are accepted in Qt as valid UTF-8, but other programs don't accept them (our XML parser being one of those).

> Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The
> difference is that a few characters that were "never" used were replaces by
> some that actually are used. 

That's ok in principle, but not so from the technical point of view. The Latin1-to-Unicode conversion is very simple and fast, since all Latin1's 256 codepoints map 1:1 to Unicode's first 256 codepoints. For Latin9 and any other encoding, a non-trivial conversion through table lookups must be performed.

> if( userCodec == QTextCodec::codecForName("utf") ) 
 
Please don't write that. That requires a codec lookup internally by QTextCodec. Instead, use userCodec->mibEnum() == 106 to detect the UTF-8 encoder.

A couple more opinions from me:
- trying UTF-8 before the user's locale:
Makes sense, since we may catch UTF-8 being used. The probability of someone writing valid text in another encoding and it being valid UTF-8 is very low.

- the user-selected codec fails decoding:
Decode as Latin1, but let the user know about this fact (a non-intrusive warning or an "bug" icon like Konqueror's for JavaScript errors).

Important: KStringHandler::isUtf8 rejects control characters, including ASCII 3 used by mIRC-colouring in IRC.
Comment 47 Jason Keirstead 2004-01-31 16:10:50 UTC
*** Bug 73877 has been marked as a duplicate of this bug. ***
Comment 48 Jason Keirstead 2004-02-02 13:42:59 UTC
*** Bug 73625 has been marked as a duplicate of this bug. ***
Comment 49 Sami Nieminen 2004-02-04 19:48:25 UTC
For me, setting encoding for the channel to iso8859-15 fixes the problem with characters like 'öä' on normal messages, but when those characters appear on quit message, I still get the XML parsing error.

Another problem is that kopete doesn't store the encoding setting, and I need to set it each time I restart kopete.