55355 – Implement a working encoding detection

Bug 55355 - Implement a working encoding detection

Summary: Implement a working encoding detection

Status:	RESOLVED FIXED

Alias:	None

Product:	kate
Classification:	Applications
Component:	general (show other bugs)
Version:	unspecified
Platform:	Debian testing Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	KWrite Developers

URL:
Keywords:

Duplicates (7):	64139 85686 93466 96149 97660 110707 113654 (view as bug list)
Depends on:
Blocks:

Reported:	2003-02-28 22:17 UTC by Tom Schüler
Modified:	2010-06-03 22:51 UTC (History)
CC List:	17 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Tom Schüler 2003-02-28 22:17:57 UTC

Version:            (using KDE KDE 3.1)
Installed from:    Debian testing/unstable Packages

Whenever I open a file that contains UTF-8 encoded characters, all the non-ASCII characters are displayed wrong because Kate assumes that the text is encoded in ISO-8859-15.

So it would be great if Kate could automatically detect if a text is encoded in UTF-8 or ISO-8859-1.

I don't know how the Windows guys solved it but in MS-Notepad it works.

Comment 1 Nicolas Goutte 2003-03-01 02:39:35 UTC

Are you really sure that MS Notepad can handle UTF-8 automatically? 
 
As far as I know, MS used UTF-16, not UTF-8. Finding UTF-16 is easier as mostly there 
is a Byte Order Mark at the start. But this does not exists in UTF-8. 
 
As you have probably seens, typical (western European) UTF-8 files are also valid 
ISO-8859-15 files from the point of the data. Of course, a user sees that it is wrong but 
how could a computer see it? 
 
Have a nice day/evening/night!

Comment 2 Tom Schüler 2003-03-01 13:44:47 UTC

I did some further investigation:

When I compared a UTF-8 file produced by MS Notepad to a Kate UTF-8 file with a
hex editor I realized that at the beginning of the Notepad file there is a "EF
BB BF" before *my* letters begin. Maybe that makes Notepad know that this is a
UTF-8 file. I suppose this would be non-standard behaviour.

So when there's no proper way to dectect UTF-8 files forget about my wish.

Comment 3 Nicolas Goutte 2003-03-01 14:18:40 UTC

Subject: Re:  UTF-8 encoding should be detected automatically

EF BB BF is the Byte Order Mark coded in UTF-8.

As there is no big-endian-versus-little-endian problem in UTF-8, it is barely 
a need in UTF-8 and so nobody uses it.

Have a nice day/evening/night!

On Saturday 01 March 2003 13:44, Tom "Schüler" wrote:
(...)
> ------- Additional Comments From tom@lasbo.de  2003-03-01 13:44 -------
> I did some further investigation:
>
> When I compared a UTF-8 file produced by MS Notepad to a Kate UTF-8 file
> with a hex editor I realized that at the beginning of the Notepad file
> there is a "EF BB BF" before *my* letters begin. Maybe that makes Notepad
> know that this is a UTF-8 file. I suppose this would be non-standard
> behaviour.
>
> So when there's no proper way to dectect UTF-8 files forget about my wish.

Comment 4 Nicolas Goutte 2003-03-01 23:28:14 UTC

On the Unicode site, there is a FAQ : UTF & Byte Order Mark 
http://www.unicode.org/faq/utf_bom.html 
 
Have a nice day/evening/night!

Comment 5 Christoph Cullmann 2003-03-14 23:17:07 UTC

in the end a sane autodetection is not possible, or I am wrong ?

Comment 6 Nicolas Goutte 2003-03-14 23:52:55 UTC

Subject: Re:  UTF-8 encoding should be detected automatically

Personally, without a Byte Order Mark, I do not see any method to detect 
UTF-8, as most bytes would be valid for other encodings too. 

However if the text starts with a Byte Order Mark, it could be indeed done. So 
perhaps this wish could be changed to: find a typical Byte Order Mark (UTF-8, 
UTF-16 little endian, UTF-16 big endian) and if found, load the file in the 
correspondant UTF encoding (perhaps after having asked the user.)

Have a nice day/evening/night!

On Friday 14 March 2003 23:17, Christoph Cullmann wrote:
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
>
> http://bugs.kde.org/show_bug.cgi?id=55355
>
>
>
>
> ------- Additional Comments From crossfire@babylon2k.de  2003-03-14 23:17
> ------- in the end a sane autodetection is not possible, or I am wrong ?

Comment 7 Jim Dabell 2003-06-04 18:42:11 UTC

kate appears to take notice of your locale, so export LC_CTYPE=en_GB.UTF-8
before starting kate from a command-line does the trick, assuming you are using
British English (substitute your own locale in the obvious place otherwise). 
The methods of settings this to be permanent vary amonst operating systems, I
just added it it ~/.bashrc

Still, there should be a way of setting a default from within kate.

Comment 8 Christoph Cullmann 2003-07-21 21:34:02 UTC

have no possiblity to set kate wide default independent of kde default, no autodetection 
planned, only confusing

Comment 9 Nicolas Goutte 2003-08-21 11:00:15 UTC

By chance, I have found out that, unlike what I have written, RFC2279 claims   
that UTF-8 can be detected by a "simple algorithm". However, it does not give   
any reference to such an algorithm and I have still no idea how to do it.  
  
I am re-opening this bug to re-new the discussion if this should be added to 
Kate or not. 
 
(If there is really such an algorithm, I am interested in putting it in 
KWord's plain text import filter.) 
   
RFC2279: http://www.ietf.org/rfc/rfc2279   
   
Have a nice day!

Comment 10 Allan Sandfeld 2003-08-21 11:34:12 UTC

RFC2279 is preaty clear on how to make the algorithme, but is not entirely 
safe.  
 
You simply scan the file for out of ASCII-characters(<127), and check if 
they are acceptable UTF-8 (starting with 192, 224, 240, 248, 252 or 254, 
and continues with values ranging 128-191). If the entire file _can_ be 
interpreted as UTF-8 we then guess that it _is_ UTF-8.

Comment 11 Christian Loose 2003-08-21 13:43:51 UTC

We have a similar problem in Cervisia when we save a merged file in our resolve 
dialog (s. kdesdk/cervisia/resolvedlg.cpp). We also need to know, if the 
resulting file should be UTF-8. 

Somebody worked around this problem with a more than ugly hack 
(s. DetectCodec() in same source file). I would love to see a general solution 
to this problem.

Christian

Comment 12 Allan Sandfeld 2003-08-21 18:22:55 UTC

I just wrote a small program that can test this, but out of curiosity I checked 
the Qt-source and they already have something like it (Doh!). 
 
Just use QTextCodec::codecForContent(const char*, int) or call  
QUtf8Codec::heuristicContentMatch(const char*, int) directly and see if 
you get a positive return value.

Comment 13 Christoph Cullmann 2003-09-04 14:45:20 UTC

I dislike that guessing, won't use such heuristic which may fail for such an 
critical task, atm at least the user can be sure that the normal default 
encoding is taken by kate or the encoding he has selected, but just guessing 
that a perhaps just english plaintext file is utf-8 is no good idea (the 
codecForContent is nice, but not save, have tested that too before)

Comment 14 Nicolas Goutte 2003-09-04 15:05:27 UTC

For KWord's plain text import filter, I just plan to change the default in the 
dialog box, so that the user can override it. 
 
Of course for Kate it is not possible, as the encoding selection is in the 
file dialog box. But perhaps something like a dialog warning the user: "The 
file seems to be in UTF-8 but you have chosen <encoding>. Do you wish to load 
in UTF-8?" Yes would mean UTF-8, no the previously chosen encoding, cancel 
would return to the file dailog or to Kate. 
 
(Something similar could be done if a Byte Order Mark is found. But be careful 
that Qt overrides anyway with a UTF-16 BOM.) 
 
Have a nice day!

Comment 15 Ismail Donmez 2003-09-04 16:31:45 UTC

kgpg has a nice checkIfUnicode function that may worth looking at.

P.S: Function name might be checkIsUnicode , I am not sure :)

Comment 16 Allan Sandfeld 2003-09-04 17:37:57 UTC

Subject: Re: 

On Thursday 04 September 2003 14:45, you wrote:
> ------- Additional Comments From cullmann@kde.org  2003-09-04 14:45 -------
> I dislike that guessing, won't use such heuristic which may fail for such
> an critical task, atm at least the user can be sure that the normal default
> encoding is taken by kate or the encoding he has selected, but just
> guessing that a perhaps just english plaintext file is utf-8 is no good
> idea (the codecForContent is nice, but not save, have tested that too
> before)

Using UTF-8 for a completly english text is totally safe, since the ASCII-part 
is the same.

Comment 17 Jim Dabell 2003-09-04 17:52:08 UTC

Using UTF-8 for for US-ASCII documents is *not* safe.  Scenario:

1.  Somebody has their environment set up to use ISO-8859-1 or ISO-8859-15, as I
would guess many English people do.

2.  They open a document containing only US-ASCII characters.

3.  This is treated as UTF-8.

4.  They input a character outside of US-ASCII (£ is a common one).

5.  They save their document as UTF-8.

This means that, even though they set up their environment to use something
else, kate would go ahead and ignore their wishes, saving as UTF-8.

Probably the easiest, safest, but least efficient way of doing things properly
would be to convert documents to UTF-8 or UTF-16 upon opening, and when saving,
check to see if the document can be represented with the user's chosen character
encoding.  If so, then kate should covert to that and save.  If not, then it
should save as UTF-8 or UTF-16.

The only problem I see with that is when you are opening documents encoded in
something other than your chosen character encoding.  For example:

ISO-8859-1 document.
UTF-8 environment
Open
Save
Result: UTF-8 document.

I don't think that's particularly bad behaviour, but I'm sure some people would
disagree.

Comment 18 Nicolas Goutte 2003-09-04 18:56:54 UTC

But Kate is converting documents to UTF-16 (QString) and this is exactly the problem. Users protest that their files are not loaded correctly.  As for such automatism at save they are not worth anything if Kate cannot find that the document is UTF-8 at load, which is exactly the content of this bug.  Have a nice day!

Comment 19 Christoph Cullmann 2003-09-04 22:10:20 UTC

Subject: Re:  missing encoding detection

> ------- Additional Comments From nicolasg@snafu.de  2003-09-04 18:56
> ------- But Kate is converting documents to UTF-16 (QString) and this is
> exactly the problem. Users protest that their files are not loaded
> correctly.  As for such automatism at save they are not worth anything if
> Kate cannot find that the document is UTF-8 at load, which is exactly the
> content of this bug.  Have a nice day!
than provide me please with an non-disturbing and save way to detect it, but 
as there are normally no unicode byte order mark and you can't be sure that 
an as utf-8 detected file is really utf-8 (just because the current file only 
has < 127 chars inside means really nothing, it can be latin1 english, but 
not utf-8, like someone said before, if you type then some non-english chars 
and save the file you will break it), I see no way to do that in a usefull 
way. People which use utf-8 should set kate's default encoding to utf-8 or 
chose it in the filedialog, as each guessing around is just messy and having 
popups over and over again for each source file asking "uhh, that could be 
utf-8, but I am not sure" are not that nice, too.

cu
Christoph

Comment 20 Nicolas Goutte 2003-09-04 23:25:45 UTC

I do not think that a file with ASCII-like characters should trigger the 
dialog. (Of course, this probably means not to use Qt's function but to 
program a new one.) 
 
But even if your environment is not UTF-8, you get confronted with UTF-8, like 
the .desktop files. Of course, .desktop files get automatically corrected by 
scripty if the translation have been corrupted. But that it is the sort of 
case where to tell the user: this is probably an UTF-8 file! 
 
So for this problem an UTF-8 file is a file having at least one correct UTF-8 
multi-byte sequence. If there is not any high-order bit set, there will be no 
dialog. Incorrect UTF-8 sequence, no dialog either. The user has chosen UTF-8 
in the file dialog, no further dialog either. 
 
Of course having a valid UTF-8 file (with high bits set) does not always means 
that it is UTF-8. There is always a rest of risk. Therefore the dialog. 
 
But it is not a dialog that will be happen often. (Just test, take any desktop 
file and read it as ISO-8859-15. That are sequences of characters that you 
will probably never find in a real-life text.) 
 
Of course, the biggest drawback is that you have to test all the file. That is 
indeed pretty constraining.

Comment 21 Allan Sandfeld 2003-09-05 02:49:31 UTC

Subject: Re:  missing encoding detection

On Thursday 04 September 2003 22:10, Christoph Cullmann wrote:
>
> than provide me please with an non-disturbing and save way to detect it,
> but as there are normally no unicode byte order mark and you can't be sure
> that an as utf-8 detected file is really utf-8 (just because the current
> file only has < 127 chars inside means really nothing, it can be latin1
> english, but not utf-8, like someone said before, if you type then some
> non-english chars and save the file you will break it), I see no way to do
> that in a usefull way. People which use utf-8 should set kate's default
> encoding to utf-8 or chose it in the filedialog, as each guessing around is
> just messy and having popups over and over again for each source file
> asking "uhh, that could be utf-8, but I am not sure" are not that nice,
> too.
>
That's easy. I wrote an algoritm to detect UTF-8 before I discovered that Qt 
already had one. It is really easy to divide cases in three outcomes:
1. ASCII only, no problem = stick with default
2. >127 values, and not UTF-8 like, choose non UTF-8 locale
3. >127 values, and matches UTF-8, ask user to accept UTF-8 conversion

Especially when it comes to latin-1 or latin-0 there is an exactly 0% chance 
off miss detecting UTF-8, the only problems arises in languages using only 
non-latin(english) characters. (because a UTF-8 latin1 character is always 
represented as first a value from 128-191 and then one from 192-223, and that 
just doesnt happen everytime otherwise).

I can post my small test-function with three outcomes if anyone likes it.

`Allan

Comment 22 Joseph Reagle 2003-12-05 15:49:24 UTC

If it's an XML file, one could at least look to the encoding declaration and these heuristics:
  http://www.w3.org/TR/xml11/#sec-guessing

Comment 23 Christoph Cullmann 2003-12-17 23:55:30 UTC

just relabel the bug

Comment 24 Christoph Cullmann 2003-12-17 23:55:46 UTC

*** Bug 64139 has been marked as a duplicate of this bug. ***

Comment 25 Thiago Macieira 2003-12-18 16:42:36 UTC

There's a function in libkdecore that can be used to detect UTF-8 and another to load text "from locale or UTF-8". I don't remember in what class/namespace they are, but the function is called isUtf8.

Also note that Qt's UTF-8 codecs loads ANYTHING without errors. So, for it, any content is valid UTF-8.

There's also an encoding guesser in khtml. You may want to reuse that.

As for English text, all ISO-2022-compatible encodings would match, since they all have ASCII as the lower part of the table.

Comment 26 Joao S . O . Bueno 2004-06-28 15:41:58 UTC

At least for Python files, the encoding could be read from the second line of the file, if it is present now. It is now deprecated pratice not to hgave the encoding explicitily declarated on the second line, if the file contains any characters out of the 32-127 range. It will become illegal as of python 2.4

Comment 27 Kde 2004-09-08 21:40:57 UTC

Kate should at least check the first node of an XML file, since it usually contains the encoding used.
E.g. of the first line of a file:
<?xml version="1.0" encoding="ISO-8859-1"?>

This is currently not honored and very annoying to reset the encoding every time I open a file.

Comment 28 Wilbert Berendsen 2004-09-09 07:15:02 UTC

Yes that would be very nice, also for HTML files, which often carry a meta tag with the correct encoding.

Comment 29 Christoph Cullmann 2005-03-24 12:18:08 UTC

*** Bug 97660 has been marked as a duplicate of this bug. ***

Comment 30 Ismail Donmez 2005-03-24 12:20:24 UTC

For a working unicode detection check kdeextragear-2/konversation/konversation/unicode.cpp

Comment 31 Christoph Cullmann 2005-03-24 12:20:41 UTC

*** Bug 85686 has been marked as a duplicate of this bug. ***

Comment 32 Christoph Cullmann 2005-03-24 12:43:12 UTC

*** Bug 96149 has been marked as a duplicate of this bug. ***

Comment 33 Pavel Simerda 2005-03-24 17:11:35 UTC

I believe a document that *could be* utf-8 is 99.9% either (7bit) ascii or real utf-8.
You can examine my example of detection code at http://www.gyarab.cz/users/pavel.simerda/detectencoding.c

It is much simpler then detection of utf-16 without BOM.

I'd be very happy to see this feature in Kate. It could then be extended to other unicode and non-unicode encodings.

Comment 34 Gilles Schintgen 2005-03-25 20:51:20 UTC

If this is implemented, please don't forget LaTeX users ;-)
In LaTeX the encoding is specified with a line like 
 \usepackage[utf8]{inputenc} 
 or 
 \usepackage[latin1]{inputenc} 
 etc. 
Please have a look at http://home.imf.au.dk/burner/Manualer/TeX/inputenc.pdf for a complete list.

Comment 35 Anders Lund 2005-03-25 21:39:43 UTC

In my opinion we should implement some sort of events in the highlight parser, allowing us to emit a signal when a rule matches.

Comments Jowenn, Cullmann, Dominik?

Comment 36 Hasso Tepper 2005-05-08 14:02:55 UTC

I would extend it to not just UTF-8 vs. some other 8 byte charset detection, but general charset detection. There have been complaints from users - "why Konqueror doesn't show me the text files with correct charset? Mozilla and IE do." Both Mozilla and IE have general charset detection implemented. Info about Mozilla charset detection can be found here - http://www.mozilla.org/projects/intl/chardet.html

Comment 37 Wilbert Berendsen 2005-05-14 19:47:00 UTC

Anders wrote:
> In my opinion we should implement some sort of events in the
> highlight parser, allowing us to emit a signal when a rule matches. 

It would be a good way to do it, but the encoding detection would also need to work if the user has highlighting switched off.

This feature would be soo nice... today I again fully mangled a .po file by just editing one message, but (unknowingly) saving it in the wrong encoding...

Comment 38 Anders Lund 2005-05-14 20:05:50 UTC

Saturday 14 May 2005 19:47 skrev Wilbert Berendsen:
> It would be a good way to do it, but the encoding detection would also need
> to work if the user has highlighting switched off.
>
> This feature would be soo nice... today I again fully mangled a .po file by
> just editing one message, but (unknowingly) saving it in the wrong
> encoding...


... meaning one more class that needs to know a lot of rules and read 
potentially many lines of the document.

Comment 39 Wilbert Berendsen 2005-05-14 22:18:19 UTC

Anders wrote:
> ... meaning one more class that needs to know a lot of rules and
> read potentially many lines of the document.

Yes, but maybe we can combine things in a good design:

Syntax highlighting files are already becoming more like language files. So a syntax file could contain different kinds of information, like how to fold, how to highlight, how to indent and how to detect encoding.

A little drawback would be that when turning syntax highlighting off, the language files would still be read for the other relevant information.

Comment 40 Anders Lund 2005-05-14 22:27:40 UTC

Saturday 14 May 2005 22:18 skrev Wilbert Berendsen:
> Syntax highlighting files are already becoming more like language files. So
> a syntax file could contain different kinds of information, like how to
> fold, how to highlight, how to indent and how to detect encoding.
>
> A little drawback would be that when turning syntax highlighting off, the
> language files would still be read for the other relevant information.


My main idea was that during parsing we'd match the exact information, and set 
the encoding in that event. Just keeping the information in the file is not 
the same.

Comment 41 Pavel Simerda 2005-10-09 18:57:53 UTC

It seems you are still not convinced to use the utf-8 detection which is pretty simple and as many said before also quite safe.

Comment 42 Christoph Cullmann 2005-10-31 10:04:51 UTC

*** Bug 113654 has been marked as a duplicate of this bug. ***

Comment 43 Christoph Cullmann 2005-11-07 20:34:05 UTC

*** Bug 93466 has been marked as a duplicate of this bug. ***

Comment 44 Ferdinand Gassauer 2005-12-31 13:59:31 UTC

utrac -p <filename> reports the encoding
http://utrac.sourceforge.net/index.html
kate could (optionally)
* use this information to convert the file to UTF-8
* open it 
* and convert it back for saving

Comment 45 Dominik Haumann 2006-01-01 19:14:58 UTC

Kat, the desktop search engine, has encoding detection as well. An interesting read might be http://robertocappuccio.blogspot.com/2005/10/work-in-progress.html
Maybe such a lib can be shared between kate and kat.

Comment 46 Pavel Simerda 2006-02-19 23:22:10 UTC

I am now playing with encoding detection... It doesn't seem so difficult now. I'd just need much more testing data in different languages...

I'm going to try some foreign websites.

For now I have working detection for unicode's utf8/16/32 based on zero bytes (works well for european /and some other/ languages) and validation.

And I have a general mechanism for 8-bit encodings... and data only for czech.

-------

I'm testing it in python because it's quite simple - the script now counts 189 lines.

All tested on a czech text translated by iconv into several encodings:

pavlix@tiger:~/svn/editor$ python encoding.py testdata/*
testdata/text.ascii: ascii
testdata/text.ibm852: ibm852
testdata/text.iso88592: iso-8859-2
testdata/text.utf16: utf-16le
testdata/text.utf16be: utf-16be
testdata/text.utf16le: utf-16le
testdata/text.utf32: utf-32le
testdata/text.utf32be: utf-32be
testdata/text.utf32le: utf-32le
testdata/text.utf8: utf-8
testdata/text.windows1250: windows-1250

-------------------------------------------------------

Will write more, 

Pavel Simerda

Comment 47 Pierre Renié 2006-05-24 16:57:55 UTC

i have a method : detect encoding in this order :

1) With the metas, like
<?xml version="1.0" encoding="UTF-8"?> or <meta content="text/html; charset=ISO-8859-1" http-equiv="content-type"> or for python :
# -*- coding: utf-8 -*-

2) Use utrac ( http://utrac.sourceforge.net/index.html )
Or if utrac cannot be used, for any reason :

3) Try UTF-8. In UTF-8, all the special characters like é, è, à... are composed of 2 bytes between 128 and 255. So if a byte in this range is found alone, this is not UTF-8, so the charset decoder will return en error.

4) Try ISO-8859-1. The decoder will return an error if some characters are not allowed.

5) Try CP-1252. This is an extended version of ISO-8859-1

6) Try UTF-16. This test should be done after all other tests because in this charser all alphanumeric characters have a byte 0, which is not allowed in the previous encodings.

Of course there may be some optimization, because if a 0 is found it's UTF-16.

Comment 48 Jaison Lee 2006-06-07 21:37:04 UTC

*** Bug 128785 has been marked as a duplicate of this bug. ***

Comment 49 Ferdinand Gassauer 2006-06-21 01:12:44 UTC

gedit detects the encoding perfectly and automagically.

I assigned it as a bug, because 
a) kate behavour is ot not user friendly
b) kate states that an UTF-16 file is binary, which is obviously wrong and does not guide the user to the take correct actions - i.e. changing the encoding manually. BTW - How should one choose the correct encoding out of the many available? Trial and error?

Comment 50 Julian Fleischer 2006-06-22 15:43:19 UTC

Although i'm aware of the problem which is discussed here i have to say, it's not that annoying as it is stated here... i think you can specify different file-types in kate, which kate automatically detects... for example I was really annoyed that my PHP-Files (which are ISO-8859-1) were interpreted as UTF-8 - but there is a settings-options whehre i could add PHP-Files and tell Kate that these files are to be interpreted as ISO-8859-1 - so, what's the point?

Comment 51 Gaël de Chalendar (aka Kleag) 2006-06-22 17:13:12 UTC

There is a lot of file types that can be either UTF-8, latin1 or any specific encoding when coming from various sources. Minimaly, incoherences have to be searched for.

Comment 52 Jakob Petsovits 2006-06-22 17:42:36 UTC

> but there is a settings-options whehre i could add PHP-Files and tell Kate
> that these files are to be interpreted as ISO-8859-1 - so, what's the point? 

The point is that if you don't do that then Kate breaks your files because it has read the wrong character by believing it's encoded in your standard setting.

Worse, you can't generalize a "PHP file encoding" (and neither for HTML files, or normal text files, or...) so if you open the PHP file on another web server and it's UTF-8, Kate gets it wrong. Maybe it only contains one upper-range character and you don't even notice it. And then, by saving the file you break it.

My old text files are still in ISO-8859-1 whereas my newer files are all UTF-8. Not having them detected properly means data loss (or major hurdles to get the upper-range characters back).

So, relying on an LC_* environment variable or even filetype-specific settings is not sufficient (it's only fine for creating new files), especially when there are so many different files from all kinds of sources - legacy files, stuff from the web, different coding policies, and stuff.

How is encoding detection done in GEdit? It sounds like they do it sufficiently well, and maybe it's not so hard to port. Anyways, imho an encoding detection that doesn't work perfectly is still better than no detection at all - after all, you can still change the encoding like you can at the moment.

Comment 53 Kai Krakow 2006-08-01 18:10:37 UTC

I learned to love vim's encoding detection. It never did it wrong for me. I correctly distinguishes latin1 from utf-8, and it leaves the file in that encoding when I save it. If I use characters which cannot be encoded, there's feedback that the file cannot be saved. It would be no problem for Kate to show a dialog to ask for the encoding to use.

Okay, it may wrongly detect plain us-ascii as utf-8 or latin1 or whatever my native locale is set to - but this usually doesn't matter because when the text is plain us-ascii I problably won't use characters from other charsets anyway. If in doubt, Kate could present a dialog to tell me that my document was detected as us-ascii and now contains characters which need another encoding which I could choose then.

Default encoding for new file should then be KDE's global default. Current behaviour of Kate always breaks a file if I forget to set or don't know the correct charset. It would only sometimes break the file if detection goes wrong. And if a file cannot be represented with the detected or default charset, Kate can always provide me with a warning dialog to choose a compatible encoding.

BTW: Also the "file" command supports heuristics to detect a charset. It should be reliable for any case.

Comment 54 Jörg Walter 2006-08-17 13:45:21 UTC

I second the last two comments. If you are afraid of what happens with US-ASCII text after adding international characters, add a plain US-ASCII encoding, then try US-ASCII / UTF-8 / <some user specified 8-bit-encoding> in turn, the first one that parses the file without errors wins. Since Kate already warns when some text characters can't be saved in the given encoding, a US-ASCII file with international chars won't be saved without warning. 

I have munged so many files simply because it was ISO and kate loaded it as UTF-8, and I didn't notice that. Any auto-detection is way better than none. IMHO, it is a bug that a text is loaded as UTF-8 while it obviously isn't.

BTW, it is important not to use ISO-8859-1 as third choice, but leave it up to the user (think eastern Europe, ISO-8859-2, or Euro-land using ISO-8859-15 these days).

Comment 55 Allan Sandfeld 2006-08-17 14:06:51 UTC

Another thing that is easier than full auto-detection is to recognize text-files that start with UTF-8 BOM. 

For browsers the detection mechanism is Specified -> Inherited -> DOM -> Auto-detect

Currently the BOM are just loaded by kate as an unknown character.

Comment 56 Nicolas Goutte 2006-08-17 19:01:20 UTC

On Thursday 17 August 2006 13:45, Jörg Walter wrote:
(...)
> Additional Comments From trouble garni ch  2006-08-17 13:45 -------
> I second the last two comments. If you are afraid of what happens with
> US-ASCII text after adding international characters, add a plain US-ASCII
> encoding, then try US-ASCII / UTF-8 / <some user specified 8-bit-encoding>
> in turn, the first one that parses the file without errors wins. Since Kate
> already warns when some text characters can't be saved in the given
> encoding, a US-ASCII file with international chars won't be saved without
> warning.


As Qt does not know ASCII, I doubt that Kate can discover that the characters 
are not ASCII-compatible. (If it uses the KCharsets class to get the 
encoding, it will get ISO 8859-1 INSTEAD of ASCII.)

>
> I have munged so many files simply because it was ISO and kate loaded it as
> UTF-8, and I didn't notice that. Any auto-detection is way better than
> none. IMHO, it is a bug that a text is loaded as UTF-8 while it obviously
> isn't.
>


> BTW, it is important not to use ISO-8859-1 as third choice, but leave it up
> to the user (think eastern Europe, ISO-8859-2, or Euro-land using
> ISO-8859-15 these days).


The locale encoding could be used.

Have a nice day!

Comment 57 Jörg Walter 2006-08-17 19:33:17 UTC

Actually, I don't care about ASCII and what happens if I accidentally save in the wrong encoding - if that happens by accident, it can be fixed, it's the lesser evil. It's the data loss that bothers me, situations like the above happen at least once a week.

Using the locale's encoding as third choice is a good idea, unless you are using a UTF-8 locale, which is always true if you are in danger of accidentally trashing files. If you were using an 8-bit locale and opened a UTF-8 file, you would see some weird chars, but at least the file can be saved back safely.

Comment 58 Kai Krakow 2006-08-17 19:53:29 UTC

Jörg wrote:
> I second the last two comments. If you are afraid of what happens with
> US-ASCII text after adding international characters, add a plain US-ASCII
> encoding, then try US-ASCII / UTF-8 / <some user specified 8-bit-encoding>
> in turn, the first one that parses the file without errors wins. Since Kate
> already warns when some text characters can't be saved in the given encoding,
> a US-ASCII file with international chars won't be saved without warning.

I really like this idea. Sounds like a working solution.

Nicolas wrote:
> As Qt does not know ASCII, I doubt that Kate can discover that the characters
> are not ASCII-compatible. (If it uses the KCharsets class to get the
> encoding, it will get ISO 8859-1 INSTEAD of ASCII.)

Maybe Qt needs improving then... ;-) Well I don't think Qt is the problem - does Qt do the detection? An independant detection routine should be used anyway. If US-ASCII is feeded as ISO-8859-1 to Qt or KDE it shouldn't be the problem, because before saving another routine should check again if the data is still compatible to US-ASCII which was detected before. The routines should be smart enough to know that Qt encoded this as ISO-8859-1 silently. As long as no other characters as US-ASCII are used it is of no importance anyway if Qt used ISO-8859-1 because both encodings are the same in the matching set of both character sets. Or may I be wrong?

Comment 59 Nicolas Goutte 2006-08-17 20:19:23 UTC

On Thursday 17 August 2006 19:53, Kai Krakow wrote:
(...)
> Jörg wrote:
> > I second the last two comments. If you are afraid of what happens with
> > US-ASCII text after adding international characters, add a plain US-ASCII
> > encoding, then try US-ASCII / UTF-8 / <some user specified
> > 8-bit-encoding> in turn, the first one that parses the file without
> > errors wins. Since Kate already warns when some text characters can't be
> > saved in the given encoding, a US-ASCII file with international chars
> > won't be saved without warning.
>
> I really like this idea. Sounds like a working solution.
>
> Nicolas wrote:
> > As Qt does not know ASCII, I doubt that Kate can discover that the
> > characters are not ASCII-compatible. (If it uses the KCharsets class to
> > get the encoding, it will get ISO 8859-1 INSTEAD of ASCII.)
>
> Maybe Qt needs improving then... ;-) Well I don't think Qt is the problem -
> does Qt do the detection? An independant detection routine should be used
> anyway. If US-ASCII is feeded as ISO-8859-1 to Qt or KDE it shouldn't be
> the problem, because before saving another routine should check again if
> the data is still compatible to US-ASCII which was detected before. The
> routines should be smart enough to know that Qt encoded this as ISO-8859-1
> silently. As long as no other characters as US-ASCII are used it is of no
> importance anyway if Qt used ISO-8859-1 because both encodings are the same
> in the matching set of both character sets. Or may I be wrong?


Anyway, Kate's loading/saving dialogs uses the KCharset's list of encoding, 
which does not have ASCII in it.

Have a nice day!

Comment 60 Airbag 2006-10-15 19:49:50 UTC

Well, I can confirm this 3-year old (!) bug is still around for KDE 3.5.4.
And still *very* annoying.
Anyway, I vote for it.

Comment 61 auxsvr 2007-02-23 17:29:01 UTC

There is a program called chardet at http://chardet.feedparser.org/ that does  character encoding detection and is based on an algorithm by netscape, as used in mozilla. Perhaps this should be included in kdelibs?

Comment 62 Oded Arbel 2007-02-25 14:03:41 UTC

Another character detection program is enca ( http://trific.ath.cx/software/enca ) which is has the following benefits over the above mentioned chardet: 
- It is available in many linux distributions
- Written in portable C code.
- Available as a library with a well documented API 
- Seems a bit more active then chardet

Still - IMHO charset detection is at best a "best effort" algorithm and should never be done automatically without user interaction. 

While trying not to contradict the above comment, I would like to note that gedit - the GNOME text editor - seems to implement some kind of auto-detection: I have tested loading a text file with some hebrew text in both utf-8, utf-16, and the relevant ISO-8859 character set, and in all cases it loaded the file properly.

Comment 63 Kai Krakow 2007-04-03 11:05:02 UTC

I tried to "fix" this by putting a .kateconfig modeline in the projects root directory - but that doesn't seem to work on remote files. I didn't try to test this locally so it may be totally broken. So if someone has time to test if .kateconfig modelines can set the proper charset used in a project and that doesn't work locally, that would be worth a bug report. Would be nice if someone could point me there then - thanks.

Comment 64 Nick Shaforostoff 2007-04-09 15:51:13 UTC

http://cia.vc/stats/author/shaforo/.message/58524
hmm, maybe i should put a space afetr ":"?

Comment 65 Kai Krakow 2007-04-09 16:45:03 UTC

Thanks, nice to see someone working on this instead of discussion if it is the ultimate perfect detect-anything-right-and-be-smarter-than-the-user solution. Any detection is better than just opening the files braindead wrong.

Comment 66 Nick Shaforostoff 2007-04-09 16:58:24 UTC

and of course, those who dont like, can leave it disabled :)
btw, any heuristics tuning is always welcome.

Comment 67 Malte 2008-10-07 13:13:21 UTC

As of KDE 3.5.10 the detection works nicely. Thanks.

Comment 68 Dominik Haumann 2010-02-19 14:32:34 UTC

*** Bug 110707 has been marked as a duplicate of this bug. ***

Comment 69 Kai Krakow 2010-02-21 12:46:33 UTC

This still doesn't work... In fact is has even become worse: It always opens files that are not UTF-8 encoded in read-only mode with garbaged characters - and no matter which encoding you choose from the menu bar, it stays that way.

So this is not fixed (Kate from KDE 4.4)

Comment 70 H.H. 2010-02-21 13:43:47 UTC

I have made similar experience: with detection most times very strange charsets are selected, which scramble all.

Comment 71 Szczepan Hołyszewski 2010-06-03 22:22:37 UTC

Same experience here. KWrite failed to detect that a file was in Windows-1250 encoding, then garbled it upon saving, by replacing EVERY character above 127 with THE SAME four-character gibberish sequence.

Reopen this bug!

Comment 72 Dominik Haumann 2010-06-03 22:51:32 UTC

Please build the development version according to the howto (very easy and safe) if you have KDE 4.4 and try again:
http://gitorious.org/kate/pages/Building%20Kate