Bug 236291 - make katepart add BOM mark into the beginning of UTF-8 files
Summary: make katepart add BOM mark into the beginning of UTF-8 files
Status: RESOLVED INTENTIONAL
Alias: None
Product: kate
Classification: Applications
Component: general (other bugs)
Version First Reported In: unspecified
Platform: Debian testing Unspecified
: NOR wishlist
Target Milestone: ---
Assignee: KWrite Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-04 10:56 UTC by Nick Shaforostoff
Modified: 2016-01-31 15:53 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
file produced by kate right now (31.29 KB, application/bzip2)
2010-05-04 10:57 UTC, Nick Shaforostoff
Details
file with exactly same text, but reencoded by notepad (31.32 KB, application/bzip2)
2010-05-04 10:59 UTC, Nick Shaforostoff
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Shaforostoff 2010-05-04 10:56:06 UTC
Version:            (using KDE 4.3.4)
Installed from:    Debian testing/unstable Packages

if I save file as UTF-8 in kate, it cannot be opened correctly by some windows programs, including notepad, videoplayers.

after I opened this file in notepad++ (open source editor for windows), manually specifying its encoding, and saved the text in notepad (not as ANSI, but as UTF-8), it became immediately readable by both notepad/videoplayers and kate (it is still UTF-8 for kate)

if I open the second vesion of the file in okteta, I see EF BB EF in the beginning of the file (BOM mark -- see wikipedia). But it is not the only difference actually, so I'll atach both versions of the file.


Wish: please make kate encode files in the way that they are readable by windows and linux apps (i.e. make kate produce win-utf.txt instead of kate-utf.txt).
Comment 1 Nick Shaforostoff 2010-05-04 10:57:32 UTC
Created attachment 43216 [details]
file produced by kate right now
Comment 2 Nick Shaforostoff 2010-05-04 10:59:00 UTC
Created attachment 43217 [details]
file with exactly same text, but reencoded by notepad

note that it is readable by both sides.

AND also BOM mark will allow us to always autodetect UTF-8 encoding 100%.
Comment 3 Milian Wolff 2010-05-04 11:09:01 UTC
hey Nick!

Please try the same in KDE 4.4 or better yet 4.5 - esp. the latter saw quite some improvements in encoding support.

Furthermore BOM should never be inserted by default, many scripting languages have problems with them (e.g. PHP) and afaik even some XML parsers.
Comment 4 Nick Shaforostoff 2010-05-04 12:50:27 UTC
there is one positive change from 4.3 to 4.4: In 4.4 if I open any of *.txt files and save them under different name, it is equal to binary copy (i.e. everything is preserved). In 4.3 kwrite removes BOM mark.

But If I open win-utf8.txt and copy paste it to a new kwrite window, then save it, it creates file equal to kate-utf8.txt.

But I would like to have an option to explicitly specify encoding way for files,
i.e. windows-friendly or simple(cmd-line friendly). Also it would be cool to have it automatically select windows-friendly for certain types of files: for example .srt files.
Comment 5 Matthew Woehlke 2010-05-04 22:43:44 UTC
Tools -> Add Byte Order Mark (BOM) ??

I don't know if there is a corresponding modeline.
Comment 6 Nick Shaforostoff 2010-05-04 23:54:30 UTC
i thought about extending file save dialog. it already allows us to specify encoding, so why not extend it with another option?
Comment 7 Nick Shaforostoff 2014-02-11 12:10:26 UTC
what if we add BOM mark to file only if it has .txt extension?

this way no xml files will be harmed, and we'll get nice interoperability with osx and win.
Comment 8 Dominik Haumann 2014-02-11 12:35:55 UTC
Extending the file dialog is not so easy, as the QFileDialog API in Qt5 must support that first somehow.

You can add a bom already now by specifiying it in the filetype through a variable:
http://docs.kde.org/stable/en/applications/kate/config-variables.html#variable-byte-order-marker

Besides that, are you proposing to add a BOM by default to unicode encoded files?
Comment 9 Nick Shaforostoff 2014-02-11 16:57:10 UTC
"are you proposing to add a BOM by default to unicode encoded files?"
yes, but only for those that get saved with .txt extension (so we get around the mentioned use-case of broken xml processors).

that is its purpose, after all. http://en.wikipedia.org/wiki/Byte_order_mark
Comment 10 Christoph Feck 2014-02-26 01:04:05 UTC
Wikipedia says: "The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8.[ http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf ] The presence of the UTF-8 BOM may cause interoperability problems with existing software that could otherwise handle UTF-8[...]"
Comment 11 Christoph Cullmann 2016-01-31 15:53:49 UTC
Sorry, per default we won't add BOMs, that only leads to problems.