Version: (using KDE KDE 3.2.2)
Installed from: Unspecified Linux
[mimo@localhost mimo]$ cat arabic_file
[mimo@localhost mimo]$ cut -b1 arabic_file
The problem is in the leading spaces. I can understand that the squares mean non-valid bytes but I can't understand why there are leading spaces!!
This may also proves useful:
[mimo@localhost mimo]$ hexdump arabic_file
0000000 b9d8 d80a 0ab1 a8d8 d90a 0a8a a1d9 d90a
0000010 0aa2 a3d9 d90a 0aa4
[mimo@localhost mimo]$ cut -b1 arabic_file |hexdump
0000000 0ad8 0ad8 0ad8 0ad9 0ad9 0ad9 0ad9 0ad9
Which encoding is this?
The original arabic_file file is utf-8
$ file arabic_file
arabic_file: UTF-8 Unicode text
But since an Arabic character in utf-8 is two bytes, cutting the first byte won't generate something useful. Squares may mean not a valid utf-8 sequence but no leading spaces should be embedded anyway.
Just curious, does this behavior occur using xterm or another terminal program?
Also could you post the file you're using. I don't have the skills to fix it, but I was wondering if it still does this in KDE 3.3.
Created attachment 8739 [details]
a sample arabic file
Yes, it also happens in xterm. I've just attached a sample arabic_file
Fixed in Konsole for KDE 4 as a side effect of changing the way in which the incoming character stream is decoded for display.
To clarify, the new behaviour when running cut -b1 on the above file is to print 4 blank lines (ie. invalid character sequences produce nothing at the output).