Summary: | UTF8 and other cause XML parsing errors, only in IRC conversations | ||
---|---|---|---|
Product: | [Unmaintained] kopete | Reporter: | Iván Sánchez Ortega <ivansanchez> |
Component: | IRC Plugin | Assignee: | Kopete Developers <kopete-bugs-null> |
Status: | RESOLVED NOT A BUG | ||
Severity: | normal | CC: | hasso, swf001, thiago, volker.assmann |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | unspecified | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: | Ethereal dump from connecting to #selinux (se.linux.org). |
Description
Iván Sánchez Ortega
2004-01-18 23:39:10 UTC
You have to change the text codec for the chanel you are in to whatever the people are using. Right click on the channel contact, or use the IRC menu in the chat window. Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations > ------- You have to change the text codec for the chanel you are in to > whatever the people are using. > > Right click on the channel contact, or use the IRC menu in the chat window. The fact is that I can't do that - there is no option to change the tet codec in the context menu of the channel contact. Maybe that option was added just a few days/weeks ago? Is that option to be included in the per-protocol settings window? Regards, Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations[Scanned]
On January 19, 2004 9:27 am, Iván Sánchez Ortega wrote:
> The fact is that I can't do that - there is no option to change the tet
> codec in the context menu of the channel contact.
> Maybe that option was added just a few days/weeks ago?
Yes it was added recently.
Update your CVS or wait until 0.8 final / KDE 3.2 is released on Feb. 2
Changing character set does not work. I get the error as soon as I connct. Furthermore the user should be given a hint. Especially since XML is used in the IRC protocol.m The error message says "internal error", a phrase typically meaning "oops, programmer forgot to check something". Kopete could use the error message from the XML parser as a way of detecting the character set by trying another. 1. UTF-8, easy to detect usually as we have seen 2. The user's desktop setting 3. ISOLatin1. In any case, an error in handling the characters should be displayed to the user in a friendly way. Created attachment 4309 [details]
Ethereal dump from connecting to #selinux (se.linux.org).
One more reason for reaopening this bug. It is not reasonable to have to change the character set for every contact in IRC in forum with 200 connected users, especially since I cannot see who wrote what I couldn't see. I've just tried in a chat with 30 users, and got exhausted after setting the character set for 10 users. The only hint with regard to who sent the message is that some user's message pass through as they happen to contain only ASCII. I have version kopete 0.8.0 This message gave the XML parse error despite the fact that I had changed the character set for the user. ---------------------------ethereal----------------------- :sensei!~sensei@as2-6-3.lde.g.bonet.se PRIVMSG #Selinux :Man måste naturligtvis ställa in x så att den känner till de nya koderna. ---------------------------ethereal----------------------- Noticed a funny thing though. "sensei" uses the channel name #Selinux with a capital S. #Selinux instead of #selinux. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations also , i would like to see the message that gave the error .... so what if in a long line , there is a caracter that would be a square or even nonexistente ?!! at least i can try to guess what someone said, or who or etc... Em Sexta, 23 de Janeiro de 2004 09:34, o Robin Rosenberg escreveu: > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > > http://bugs.kde.org/show_bug.cgi?id=72917 > > > > > ------- Additional Comments From robin.rosenberg@dewire.com 2004-01-23 > 10:34 ------- Changing character set does not work. I get the error as soon > as I connct. > > Furthermore the user should be given a hint. Especially since XML is used > in the IRC protocol.m The error message says "internal error", a phrase > typically meaning "oops, programmer forgot to check something". > > Kopete could use the error message from the XML parser as a way of > detecting the character set by trying another. 1. UTF-8, easy to detect > usually as we have seen > 2. The user's desktop setting > 3. ISOLatin1. > > In any case, an error in handling the characters should be displayed to the > user in a friendly way. _______________________________________________ > Kopete-devel mailing list > Kopete-devel@kde.org > https://mail.kde.org/mailman/listinfo/kopete-devel Opps, I was following the "this is a duplicate" etc and found this to be "similar". I was getting the XML parser error. I cannot see the message, rather I see this: "An internal error occurred in kopete while parsing a message: XML document could not be parsed" Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations A fecha Viernes, 23 de Enero de 2004 12:45, Iori Yagami escribi A few points: >Furthermore the user should be given a hint. Especially since XML is used in >the IRC protocol.m The error message says "internal error", a phrase typically >meaning "oops, programmer forgot to check something" Kopete uses XML for all messages, on all protocols. A string can either be parsed as UTF 8, or it can't. If the string is not parseable (because there is no way to figure out the right codec), it can't be displayed in the XML document at all. You can see themin network debug and stull like that because they don't have this restriction. >It is not reasonable to have to change the character set for every contact in >IRC in forum with 200 connected users, especially since I cannot see who wrote >what I couldn't see. I've just tried in a chat with 30 users, and got >exhausted after setting the character set for 10 users. The only hint with Heh, this might be your largest problem. Setting the character set for all these people won't do anything anyways. You need to sett he character set for the CHANNEL you are in. Right click on the CHANNEL. Or use the IRC menu in the window. You only need to set the character set for a user if you are in a one on one chat with them and they are not using UTF-8. >Well, right now I'm switching to History mode to see the "error-giving" >messages. It seems that the logs are already parsed, but somehow distorted. This is the same problem. The history plugin will store messages correctly only if the codec was set correctly originally. ------------- The biggest probelm you are having is that *all* communications on this server is not in UTF-8. You need an "account default" codec that applies to all contacts on the server, and communication with the server itself. This was a requested feature, and it will be in the next release. It did not go into this release due to string and fearure freeze. Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations fredagen den 23 januari 2004 13.36 skrev Jason Keirstead: > >Furthermore the user should be given a hint. Especially since XML is used in >the IRC protocol.m The error message says "internal error", a phrase typically >meaning "oops, programmer forgot to check something" > > Kopete uses XML for all messages, on all protocols. I ran etherape on the connection (see attached file). It contains not XML whatsoever. The conclusion is that kopete internally creates invalid XML documents that were not there to start with. > >It is not reasonable to have to change the character set for every contact in > You need to sett he character set for the CHANNEL you are in. Right click on the CHANNEL. Or use the IRC menu in the window. I did that first, so I expeted the channel topic to be visible. Not! > > Heh, this might be your largest problem. Setting the character set for all these people won't do anything anyways. I had to do that. Setting the character set on the channle did nothing. > The biggest probelm you are having is that *all* communications on this server is not in UTF-8. You need an "account default" codec that applies to all contacts on the server, and communication with the server itself. > > This was a requested feature, and it will be in the next release. It did not go into this release due to string and fearure freeze. So I have to look for anther client. A pity since I liked kopete. In its current state it s U-N-U-S-A-B-L-E. -- robin Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations > So I have to look for anther client. A pity since I liked kopete. In its > current state it s U-N-U-S-A-B-L-E. > > -- robin > _______________________________________________ well , you do get the correct messages if you start kopete in a konsole. when a xml error parsing occurs , the correct message will be displayed on the konsole. >I ran etherape on the connection (see attached file). It contains not XML >whatsoever. The conclusion is that kopete internally creates invalid XML >documents that were not there to start with. As I said above. Etherrape and these other things don't use XML. They don't have to care. When you have an XML document, you *have* to specify the encoding it is in. It's a simple fact. If you don't specify the right encoding a parsing error occurs. We used to be able to somewhat reliably auto-detect the encoding, however QT releases since 3.1 have made this not possible any longer. >I did that first, so I expeted the channel topic to be visible. Not! Yes, if what you are referring to is things like channel topics etc, these will be unreadable until we have an encoding setting for the server itself. I hope to have the code to do this done by next week, and it will be in CVS as soon as it's opened. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations
Em Sexta, 23 de Janeiro de 2004 13:17, o Jason Keirstead escreveu:
> Yes, if what you are referring to is things like channel topics etc, these
> will be unreadable until we have an encoding setting for the server itself.
> I hope to have the code to do this done by next week, and it will be in CVS
> as soon as it's opened. _______________________________________________
when you have the code , post it as a patch against kde 3.2 rc1 and so on , so
that distro packagers can apply it automaticly.
since i am using gentoo , i would like for gentoo to automaticly patch
kdenetwork so that kopete in kde 3.2 rc1 and kde 3.2 has that codec.
> when you have the code , post it as a patch against kde 3.2 rc1 and so on , so
> that distro packagers can apply it automaticly.
I can't. The code will have lots of new labels and thus cannot go into the 3.2 branch, and thus packages could not apply it because it would all be untranslated.
PS: You do not need ot CC kopete-devel on replies to BKO, they are all sent there anyways.
Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations Em Sexta, 23 de Janeiro de 2004 14:15, o Jason Keirstead escreveu: > I can't. The code will have lots of new labels and thus cannot go into the > 3.2 branch, and thus packages could not apply it because it would all be > untranslated. well i dont know to what extent a patch cant be created , but it is only because only of an untranslated option , i would prefer to have that untranslated option , than not being able to chat in some irc servers. gentoo would be very smart on this and probably use an use variable for ppl to set it up. i know that kopete 3.3 or next one will have that feature. but until then (probably 6 months) , or ppl use cvs or ppl use your patch. its a very long time. i dont speak for myself , since i have a patched kopete already. i would not miss installing the feature. but to the masses , it could be very usefull. >PS: You do not need ot CC kopete-devel on replies to BKO, they are all sent >there anyways. hehe , just asked that on irc , last time i replied Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
fredagen den 23 januari 2004 15.15 skrev Jason Keirstead:
> I can't. The code will have lots of new labels and thus cannot go into the 3.2 branch, and thus packages could not apply it because it would all be untranslated.
It is a infinitely better if it works in english than not at all.
It is not up to me if the code I create will be able to be backported. It is up to the i18n team. I will try and make a backport patch and send it to them if I can. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations
On Friday 23 January 2004 15:51, Robin Rosenberg wrote:
> It is a infinitely better if it works in english than not at all.
Please read the KDE guidelines for the stable branches. Cases where string
changes are allowed there are *very* rare. I doubt the KDE release
coordinators and translation teams would approve such a change.
Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
fredagen den 23 januari 2004 16.42 skrev Martijn Klingens:
> Please read the KDE guidelines for the stable branches. Cases where string
> changes are allowed there are *very* rare. I doubt the KDE release
> coordinators and translation teams would approve such a change.
Some other "solution" would do, like making a best effort to display the text instead
of the XML parser error message. If I see something like
Meddelande från mamma vid 23:59.60:
Maten är klar
or
Meddelande från mamma vid 23:59.60:
Maten ?r klar
is legible but
Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
XML-dokument kunde inte tydas.
is not
-- robin
Martijn, and others, I have an idea how we can display thes emessages better. Since these errors are infrequent, a little inefficientcy on them is OK. So, what we could do is this... when the XML parsing fails, we throw the string through this jobbie, that takes the string we *thought* was as UTF-8 and *makes sure* it is UTF-8: QCString decodeAttempt( QCString &utf ) { QTextCodec *utfCodec = QTextCodec::codecForName( "utf8" ); QString resultString; for( uint i = 0; i < utf.length(); i++ ) { QChar thisChar = utf[i]; if( utfCodec->canDecode( thisChar ) resultString += thisChar; else resultString += QChar('?'); } return resultString.utf8(); } .. Thus replacing all the un-encodable characters with ? Then instead of An internal Kopete error occurred while parsing a message: XML document could not be parsed! We can display: WARNING: Kopete could not properly determine the encoding of the following message: Foobar message ?? foo. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations
On Friday 23 January 2004 17:18, Jason Keirstead wrote:
> We can display:
>
> WARNING: Kopete could not properly determine the encoding of the following
> message:
> Foobar message ?? foo.
Ah, seems like a reasonable workaround.
As for the encoding DETECTION, would using KStringHandler::isUtf8() before
using QString::fromUtf8() help?
Yes it would be a reasulable first check. If that fails there is no point trying to attempt the XML transform. Note that the reason we didn't use that in Kopete yet is it is an @Since 3.2 method Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On Friday 23 January 2004 17:27, Jason Keirstead wrote: > Yes it would be a reasulable first check. If that fails there is no point > trying to attempt the XML transform. That, too. But I was thinking of a much earlier stage: when you are parsing incoming IRC data and when Oscar is parsing incoming ICQ data. If isUtf8() fails it can try ::fromLatin1 because that one AFAIK can be reliably autodetected (unlike utf8() it doesn't accept invalid chars AFAIK), followed by your fallback. A simple static in libkopete (QString KopeteMessage::detectEncoding( char * ) ?) could handle it, and avoid the problem altogether. > Note that the reason we didn't use that in Kopete yet is it is an > @Since 3.2 method True, it wouldn't help KDE 3.1 users. But #ifdef'd out it would tremendously help those who will be upgrading to 3.2. And it allows us to tell users to upgrade KDE rather than telling them we can't fix the bug. Lastly, we could even duplicate the call in compat/ though I'm not too much in favour of that. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations Em Sexta, 23 de Janeiro de 2004 16:25, o Martijn Klingens escreveu: > ------- Additional Comments From klingens@kde.org 2004-01-23 17:25 ------- > Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only > in IRC conversations > > On Friday 23 January 2004 17:18, Jason Keirstead wrote: > > We can display: > > > > WARNING: Kopete could not properly determine the encoding of the > > following message: > > Foobar message ?? foo. > > Ah, seems like a reasonable workaround. > > As for the encoding DETECTION, would using KStringHandler::isUtf8() before > using QString::fromUtf8() help? > _______________________________________________ if you can do that , than you probably can make a iso8859-15 to utf8 conversion and then try to put it on xml right ?! even if this isnt 100 % full proof , it it would fix 25 % of cases would be already good. also , could the : " WARNING: Kopete could not properly determine the encoding of the following message: Foobar message ?? foo." be like : WARNING ?? : "message" as to not take too much space on the screen ( i see this going to happen alot ) Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On January 23, 2004 12:39 pm, Martijn Klingens wrote: > That, too. But I was thinking of a much earlier stage: when you are parsing > incoming IRC data and when Oscar is parsing incoming ICQ data. It would be useless here. The incoming IRC data is almost never going to be UTF 8, no one uses it ( I wish they did, but they don't ). > If isUtf8() fails it can try ::fromLatin1 because that one AFAIK can be > reliably autodetected (unlike utf8() it doesn't accept invalid chars > AFAIK), followed by your fallback. The IRC engine just uses the contact codec, which if not defined is UTF. Basically what you are saying is use latin1 if it is undefined, and if that craps out then try latin1, and if that fails just send the incorrect utf data to the XML parser, where it will fail and print the warning. > A simple static in libkopete (QString > KopeteMessage::detectEncoding( char * ) ?) could handle it, and avoid the > problem altogether. Only if you can pass a preferred codec to this static. I think this would be the best, a combination of your suggestion and my previous function: QString KopeteMessage::decodeString( QCString string, QTextCodec *preferredCodec = 0L ) { if( !preferredCodec ) preferredCodec = QTextCodec::codecForName("latin1"); if( !preferredCodec->canDecode( string ) ) { QTextCodec *utfCodec = QTextCodec::codecForName( "utf8" ); QString resultString; for( uint i = 0; i < utf.length(); i++ ) { QChar thisChar = utf[i]; if( utfCodec->canDecode( thisChar ) resultString += utfCodec->toUnicode( thisChar ); else resultString += QChar('?'); } return resultString; } else { return preferredCodec->toUnicode( string ); } } Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On Friday 23 January 2004 18:48, Jason Keirstead wrote: > The IRC engine just uses the contact codec, which if not defined is UTF. > Basically what you are saying is use latin1 if it is undefined, and if that > craps out then try latin1, and if that fails just send the incorrect utf > data to the XML parser, where it will fail and print the warning. Nope. What I was saying is to try (in this order) In the plugin (IRC or ICQ): - Decode as utf8. If isUtf8() is available, use it and continue if it fails. Otherwise we have to assume it's utf8 and continue at the XSLT part below. - Decode as latin1. - If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local encoding is neither utf8 nor latin1. - When all these failed, use your code that replaces invalid chars with question marks. Since we're doing it *here* that means the whole dreaded 'should never happen' XML error indeed no longer happens at all. In the XML/XSLT code: - Use the code that we have now - If the decoding fails, use a more verbose error. With the above changes this should however become an almost unused code path. > Only if you can pass a preferred codec to this static. I think this would > be the best, a combination of your suggestion and my previous function: > > (snip) Your code has the tremendous advantage that it allows a custom codec selection, moving even more code duplication from the plugins. I like that. Some things I miss in your code though: - if preferredCode is UTF-8 we're at square one, because canDecode() will always return true. Therefore UTF-8 should be special cased and use KStringHandler when available. - You use less fallbacks than I had in mind. See the above heuristics. (The encoding discussion is finally getting interesting again BTW after months of frustration :) Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On January 23, 2004 2:55 pm, Martijn Klingens wrote: > What I was saying is to try (in this order) > > In the plugin (IRC or ICQ): > - Decode as utf8. If isUtf8() is available, use it and continue if it > fails. Otherwise we have to assume it's utf8 and continue at the XSLT part > below. This will only work in like 0.01% of the cases in IRC, that's why I was saying its a waste of time. Not sure about ICQ but I suspect it is the same thing there.. not many people use UTF-8. That's why I want to leave that check until last. > - If both utf8 and latin1 fail, try local8Bit IF AND ONLY IF the local > encoding is neither utf8 nor latin1. > > - When all these failed, use your code that replaces invalid chars with > question marks. Since we're doing it *here* that means the whole dreaded > 'should never happen' XML error indeed no longer happens at all. But in all this, where is the user's chosen codec? The user's selected codec should *always* be tried first. > Your code has the tremendous advantage that it allows a custom codec > selection, moving even more code duplication from the plugins. I like that. > Some things I miss in your code though: > > - if preferredCode is UTF-8 we're at square one, because canDecode() will > always return true. I don't really see this as much of a problem. If the default codec for all contacts is Latin1, then the user has to manually change to UTF-8. If they manually do this I don't have a problem with it mis-detecting and failing with an error / warning; they are the ones who chose that. > Therefore UTF-8 should be special cased and use > KStringHandler when available. Use it for what? As I said, isUTF8() is pretty much useless, since it is hardly *ever* UTF-8. That's why my code tries everything else *first*, then falls back on UTF-8 with the ? replacement if needed. > - You use less fallbacks than I had in mind. See the above heuristics. Other than the local8bit() fallback (which is also useless... what does my local codec have to do with the sender's? There's really no correlation, it'd just be random luck to work), the only difference is that I move the UTF check to the end. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On Friday 23 January 2004 21:09, Jason Keirstead wrote: > I don't really see this as much of a problem. If the default codec for all > contacts is Latin1, then the user has to manually change to UTF-8. If they > manually do this I don't have a problem with it mis-detecting and failing > with an error / warning; they are the ones who chose that. Heh, you have a different goal than I have here :) I want first and foremost to have accurate and autodetected conversion. Second, I want it to be failsafe regardless of a user's setting. The user's setting should be TRIED first, but not FORCED. If it is broken utf8 we know it will break the parser, it makes no sense to obey the user at all. Arguably latin and utf could be switched, but whatever we do we should - make sure that whenever Utf8 is being used isUtf8() is called first and if it fails forget about using Utf8 - try Utf8 somewhere along the lines. I agree that the user's coded should be tried first, like I already said in the previous mail. But, again, tried != forced. See also Thiago's mails to the list BTW. I suggest to continue in that thread, because we're now talking in two branches. > Other than the local8bit() fallback (which is also useless... what does my > local codec have to do with the sender's? There's really no correlation, > it'd just be random luck to work), Not really. Generally contact lists tend to consist of people from mostly the same country. In western countries it is a bit useless, also because the by far the most used Utf and Latin1 are both tried anyway, so the local is a duplicate, but especially in the Russian and Greek countries the local encoding is a VERY important one to try. ICQ uses UTF-16BE or UTF-8 depending on the message type and the client in use, otherwise it just sends in the local encoding (at least official winblows clients did that until they knew how to spell UTF) Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
fredagen den 23 januari 2004 17.18 skrev Jason Keirstead:
> ------- Additional Comments From jason@keirstead.org 2004-01-23 17:18 -------
> Since these errors are infrequent, a little inefficientcy on them is OK. ...
They are AFAIC not uncommon. 80% of the messages I see are affeeted. A further 10% are
not because I have set the character set to isolatin1 for these users. Thus the chat looks like
kopete has flipped. It is better to say nothing than littering the chat log with warnings. This risk
of misinterpretations of the message due to the wrong charset is very small. Alternatve the
warning could be show the first time a user's message causes a problem, or it could be the
same line as the "message from".
Although I suggested it, I thiink "smart" heuristics here can outsmart itself.. On the other hand
a "per" topic or per server setting is the best approach. People tend to use the same character
set in IRC chats and they will get it right after one or two tries.
-- robin
*** Bug 73362 has been marked as a duplicate of this bug. *** Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors,
only in IRC conversations
>They are AFAIC not uncommon. 80% of the messages I see are affeeted. A further 10% are
>
>
As I said..the only reason you are seeing so many errors is you are on
an all SE server. Which
will be rectified with account encodings.
Under *normal* situations you don't see many encoding errors.
Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
lördagen den 24 januari 2004 01.15 skrev Jason Keirstead:
> As I said..the only reason you are seeing so many errors is you are on
> an all SE server. Which
> will be rectified with account encodings.
>
> Under *normal* situations you don't see many encoding errors.
I fail to see how chattin in swedish is considers abnormal?
There should not be /any/ error severe that all I see is an error message.
Could you reopen the bug (or any of the other reports referring to this problem)
until the "accont setting" gets there and kopete becomes usable again?
!Du har gått med i kanal #selinux
!
!Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
!XML-dokument kunde inte tydas.
!
!Ett internt fel uppstod i Kopete vid tolkning av ett meddelande:
!XML-dokument kunde inte tydas.
Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
fredagen den 23 januari 2004 15.30 skrev Iori Yagami:
> i dont speak for myself , since i have a patched kopete already. i would not
> miss installing the feature. but to the masses , it could be very usefull.
How do I get the patches?
Subject: RE: [Kopete-devel] UTF8 and other cause XML parsing errors,only in IRC conversations fredagen den 23 januari 2004 15.30 skrev Iori Yagami: > i dont speak for myself , since i have a patched kopete already. i would not > miss installing the feature. but to the masses , it could be very usefull. How do I get the patches? _______________________________________________ They aren't made yet , but I think brunes will make some the next few weeks. If he makes them , I will try to test them. You should try to test them too, and see if they are stable. If need any help on patching , just ask on irc Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations >I fail to see how chattin in swedish is considers abnormal? > > Arg. It isn't. Normal situations means when your encoding matches the server's. Most IRC servers use latin1 encoding, even if the channels are not latin1. I would indeed characterise a server that communicates it's protocol information in sweedish encoding to be abnormal. This will all be fixed when you can choose encoding for the server. >Could you reopen the bug (or any of the other reports referring to this problem) >until the "accont setting" gets there and kopete becomes usable again? > > It is not the same bug. It just looks the same to you, in reality they're not related. If you want to open a bug for server encodings, feel free. Subject: Re: UTF8 and other cause XML parsing errors, only in IRC conversations
lördagen den 24 januari 2004 20.39 skrev Jason Keirstead:
> Arg. It isn't. Normal situations means when your encoding matches the
> server's. Most IRC
> servers use latin1 encoding, even if the channels are not latin1. I
> would indeed characterise a
> server that communicates it's protocol information in sweedish encoding
> to be abnormal.
"Swedish" encoding *is* Latin1. Both the server and the the channels
in the server i referred to earlier is Latin1. My computer is also latin1.
Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On January 23, 2004 4:20 pm, Martijn Klingens wrote: > I want first and foremost to have accurate and autodetected conversion. This is impossible :P > The user's setting should be TRIED first, but not FORCED. If it is broken > utf8 we know it will break the parser, it makes no sense to obey the user > at all. So you mean, if the user chose UTF8 then we check isUTF8, and if it is not, then replace with ? characters wherever needed? I would go for that I guess. > - make sure that whenever Utf8 is being used isUtf8() is called first and > if it fails forget about using Utf8 No. See, this is the problem. You are assuming that you should try UTF then if UTF fails then you'll be able to guess something. This is backwards. UTF is the only codec that gives no failure, also it's the only one we have to scan over *twice (isUTF8() and then conversion ) so its the most expensive. And on top of all this, hardly anyone uses it. So it's most error prone, most expensive, and no one uses it. It *definitly* should be the last check. *UNLESS* the user chose it. If the user chose UTF, then attempt isUTF8, if that fails, then *maybe* try latin1, if that fails, just clean up wherever possible. There's no point trying local8bit, it's bound to fail. > Not really. Generally contact lists tend to consist of people from mostly > the same country. Eh huh? Not from my experience... I have people from here, from Europe, from Asia. Anyways, contact lists don't really have much to do with it, especially on IRC. Anyone could message you from anywhere out of the blue. My new proposed ordering in pseudo code: if( userCodec == QTextCodec::codecForName("utf") ) { if( isUTF8( string ) ) return tryCodec->decode( string ) else { try QTextCodec::codecForName("latin1")->decode( string ) if( success ) { return } else { return cleanString( string ); } } } else { if( userCodec && tryCodec->decode( string ) return; else { try QTextCodec::codecForName("latin1")->decode( string ) if( success ) { return } else { return cleanString( string ); } } } .. where cleanString strips all non-UTF-8 decodable characters from the string somehow. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations fredagen den 23 januari 2004 21.40 skrev Jason Keirstead: > On January 23, 2004 4:20 pm, Martijn Klingens wrote: [....] > > - make sure that whenever Utf8 is being used isUtf8() is called first and > > if it fails forget about using Utf8 > No. See, this is the problem. You are assuming that you should try UTF then if > UTF fails then you'll be able to guess something. > This is backwards. UTF is the only codec that gives no failure, also it's the > only one we have to scan over *twice (isUTF8() and then conversion ) so its > the most expensive. And on top of all this, hardly anyone uses it. So it's > most error prone, most expensive, and no one uses it. It *definitly* should > be the last check. I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed so that it is unlikely that a non-utf-8 string can recognized as utf-8. If the UTF-decoder cannot fail, then what does it do when it encounters an illegal sequence? On the other hand. How could an attempt to decode a string byes as IsoLatin1 fail? A human user can say that something isn't latin1, but the computer cannot unless we add a user specified blacklist, IMHO overkill. > > Not really. Generally contact lists tend to consist of people from mostly > > the same country. > > Eh huh? Not from my experience... I have people from here, from Europe, from > Asia. Anyways, contact lists don't really have much to do with it, especially > on IRC. Anyone could message you from anywhere out of the blue. I suppose experience can vary here. To me it's either isolatin1 or ascii that comes overr the wire.. With isolatin it's usually the same county or countries that use the same character set. Nevertheless, the future will become more and more utf8:ized. > My new proposed ordering in pseudo code: sounds reasonable. Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The difference is that a few characters that were "never" used were replaces by some that actually are used. -- robin Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On Monday 26 January 2004 15:08, Jason Keirstead wrote: > > I want first and foremost to have accurate and autodetected conversion. > > This is impossible :P True. But when using the right order (User Pref, Local Encoding, UTF-8, Latin1) at least you can make sure the chances of it failing are minimized. > > The user's setting should be TRIED first, but not FORCED. If it is broken > > utf8 we know it will break the parser, it makes no sense to obey the user > > at all. > > So you mean, if the user chose UTF8 then we check isUTF8, and if it is > not, then replace with ? characters wherever needed? Close. First, I would use QChar::replacement like Thiago mentioned instead of '?'. Second, instead of doing the replacement if isUtf8 fails I would use Thiago's order, which would mean that after a Utf-8 failure latin1 is used. Arguably we could better try Utf-8 BEFORE local encoding, because utf8 failure can be detected and local not in all cases (like when local is in Latin1). > No. See, this is the problem. You are assuming that you should try UTF then > if UTF fails then you'll be able to guess something. Exactly. > This is backwards. UTF is the only codec that gives no failure, Yes. HOWEVER, Latin1 is even worse, because it CANNOT FAIL. Whatever you feed as Latin1, it is BY DEFINITION LEGAL. Thus, you can't do Utf-8 after Latin1, it _HAS_ to be done before Latin1. > also it's the only one we have to scan over *twice (isUTF8() and then > conversion ) so its the most expensive. Like Thiago said, isUtf8() doesn't copy data and should be fairly inexpensive. Also, I would like to see figures of the additional load, I think it is in fact pretty much neglectable for most uses. After all QString is one of the most heavily optimized Qt classes. Do you have any KCacheGrind logs proving me wrong? > And on top of all this, hardly anyone uses it. More and more people start using it, especially with ICQ, which also needs this code. And, again, Utf-8 HAS to be checked before Latin1, because after trying Latin1 you cannot POSSIBLY get a failure. So whether it "should" be the last check for performance reasons or not, it CANNOT be the last check, no matter how much you'd want it. > There's no point trying local8bit, it's bound to fail. This too is wrong for most non-western locales. In fact, with ICQ in Russia it would be VERY IMPORTANT to have. > Eh huh? Not from my experience... I have people from here, from Europe, > from Asia. Anyways, contact lists don't really have much to do with it, > especially on IRC. Anyone could message you from anywhere out of the blue. Try thinking outside the IRC box :) (With IRC I tend to agree with the people on channels being diverse, although many people I know are only on Dutch language IRC channels, and almost all people I know have exclusively Dutch people on their contact list. We open source people are quite a different breed from the average user base.) Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations > > There's no point trying local8bit, it's bound to fail. > > This too is wrong for most non-western locales. In fact, with ICQ in Russia > it would be VERY IMPORTANT to have. Why? It would only work in the situation where you're talking Russia -> Russia. And in that case, the sender should have the russian codec set anyway in the message so it would not be a problem? > Try thinking outside the IRC box :) (With IRC I tend to agree with the > people on channels being diverse, although many people I know are only on > Dutch language IRC channels, and almost all people I know have exclusively > Dutch people on their contact list. We open source people are quite a > different breed from the average user base.) While I do agree in a North American sense, since we all speak the same language, I would think that you assuming that most people in European countries having people using the same codec on their list would be pretty incorrect... Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On Monday 26 January 2004 21:07, Jason Keirstead wrote: > Why? It would only work in the situation where you're talking Russia -> > Russia. Yes, by far the most common case to be talking to fellow landsmen :) > And in that case, the sender should have the russian codec set > anyway in the message so it would not be a problem? Because with ICQ quite a lot of clients don't provide an encoding at all, in which case ICQ has to do exactly the same guesswork as IRC? Don't you read any of the OSCAR/ICQ-related mails on this list, nor my several mentions of ICQ in this very bug report? ;) > While I do agree in a North American sense, since we all speak the same > language, I would think that you assuming that most people in European > countries having people using the same codec on their list would be pretty > incorrect... It's pretty correct actually in almost all eastern European and lots of Asian countries. In Western Europe it's a bit of a mixture of Latin1, the default Windows encoding of which I don't know the name and a very strong rise of Utf-8. Subject: Re: [Kopete-devel] UTF8 and other cause XML parsing errors, only in IRC conversations On January 26, 2004 4:13 pm, Martijn Klingens wrote: > > Why? It would only work in the situation where you're talking Russia -> > > Russia. > > Yes, by far the most common case to be talking to fellow landsmen :) I know, but who is to guarentee that they're even using the same codec? You could be using Cyrillic and they could be using UTF? I just have a very strong opinion that local8bit has absolutely no realistic relationship to the sender's codec, and it shouldn't influence our guessing, especially when you consider a string could be valid in multiple encodings. I would rather display a warning message saying "we don't know the codec" than try to guess the codec and be right 90% of the time, but 10% of the time end up displaying total garbage. Displaying garbage even once in awhile is not an option, it makes Kopete look horrible; it's better to admit that we just don't know for sure than make a possibly incorrect guess. > I don't know KDE/QT that well. How come utf cannot fail. Utf-8 is designed > so that it is unlikely that a non-utf-8 string can recognized as utf-8. If > the UTF-decoder cannot fail, then what does it do when it encounters an > illegal sequence? That's actually the very reason we're getting this problem, Robin. Since Qt 3.2.x, TrollTech introduced a modification to its UTF-8 decoder in response to a bug report from us. The original problem was that files whose names or paths were not encodable in the user's selected locale could not be opened by KDE applications nor renamed in Konqueror. We had proposed a solution, but TrollTech chose instead to accept any input as valid UTF-8: when it sees an invalid sequence, it encodes the bytes as a pair of UTF-16 surrogates. The decoder then restores the original byte. This renders the operation ToUTF8(FromUTF8(any_string)) == any_string true in every case. The side-effect: Latin1 and other kinds of strings are accepted in Qt as valid UTF-8, but other programs don't accept them (our XML parser being one of those). > Perhaps Latin9 (ISO-8859-15) should be attempted instead of Latin1. The > difference is that a few characters that were "never" used were replaces by > some that actually are used. That's ok in principle, but not so from the technical point of view. The Latin1-to-Unicode conversion is very simple and fast, since all Latin1's 256 codepoints map 1:1 to Unicode's first 256 codepoints. For Latin9 and any other encoding, a non-trivial conversion through table lookups must be performed. > if( userCodec == QTextCodec::codecForName("utf") ) Please don't write that. That requires a codec lookup internally by QTextCodec. Instead, use userCodec->mibEnum() == 106 to detect the UTF-8 encoder. A couple more opinions from me: - trying UTF-8 before the user's locale: Makes sense, since we may catch UTF-8 being used. The probability of someone writing valid text in another encoding and it being valid UTF-8 is very low. - the user-selected codec fails decoding: Decode as Latin1, but let the user know about this fact (a non-intrusive warning or an "bug" icon like Konqueror's for JavaScript errors). Important: KStringHandler::isUtf8 rejects control characters, including ASCII 3 used by mIRC-colouring in IRC. *** Bug 73877 has been marked as a duplicate of this bug. *** *** Bug 73625 has been marked as a duplicate of this bug. *** For me, setting encoding for the channel to iso8859-15 fixes the problem with characters like 'öä' on normal messages, but when those characters appear on quit message, I still get the XML parsing error. Another problem is that kopete doesn't store the encoding setting, and I need to set it each time I restart kopete. |