Version: (using KDE KDE 3.2.2) Installed from: Unspecified Linux OS: Linux [mimo@localhost mimo]$ cat arabic_file ع ر ب ي ١ ٢ ٣ ٤ [mimo@localhost mimo]$ cut -b1 arabic_file � � � � � � � � [mimo@localhost mimo]$ The problem is in the leading spaces. I can understand that the squares mean non-valid bytes but I can't understand why there are leading spaces!! This may also proves useful: [mimo@localhost mimo]$ hexdump arabic_file 0000000 b9d8 d80a 0ab1 a8d8 d90a 0a8a a1d9 d90a 0000010 0aa2 a3d9 d90a 0aa4 0000018 [mimo@localhost mimo]$ cut -b1 arabic_file |hexdump 0000000 0ad8 0ad8 0ad8 0ad9 0ad9 0ad9 0ad9 0ad9 0000010
Which encoding is this?
The original arabic_file file is utf-8 $ file arabic_file arabic_file: UTF-8 Unicode text But since an Arabic character in utf-8 is two bytes, cutting the first byte won't generate something useful. Squares may mean not a valid utf-8 sequence but no leading spaces should be embedded anyway.
Just curious, does this behavior occur using xterm or another terminal program? Also could you post the file you're using. I don't have the skills to fix it, but I was wondering if it still does this in KDE 3.3. Thanks
Created attachment 8739 [details] a sample arabic file
Yes, it also happens in xterm. I've just attached a sample arabic_file
Fixed in Konsole for KDE 4 as a side effect of changing the way in which the incoming character stream is decoded for display.
To clarify, the new behaviour when running cut -b1 on the above file is to print 4 blank lines (ie. invalid character sequences produce nothing at the output).