Bug 209046

Summary:	krosspython: UTF-8 python strings are encoded as ASCII when converted to QString
Product:	[Developer tools] bindings	Reporter:	Daniel Calviño Sánchez <danxuliu>
Component:	general	Assignee:	kde-bindings
Status:	RESOLVED FIXED
Severity:	normal	CC:	mail
Priority:	NOR
Version:	unspecified
Target Milestone:	---
Platform:	Compiled Sources
OS:	Linux
Latest Commit:		Version Fixed In:
Sentry Crash Report:
Attachments:	Patch for unittest.py showing the bug Patch for pythonvariant.h with a workaround to the bug

Description Daniel Calviño Sánchez 2009-10-01 00:38:43 UTC

Version: (using Devel)
Compiler: gcc 4.3.2
OS: Linux
Installed from: Compiled sources

First of all, I have scarce Python knowledge, so maybe I'm doing something wrong and this is not a real bug. I apologize if that's the case.

In Python scripts encoded with UTF-8, when a string is passed to a C++ method, the QString received by the method is the ASCII version of the UTF-8 string. That is, UTF-8 characters that need more than 1 byte aren't stored as a single character in the QString, but using 1 character for each byte. For example, a "ñ" in Python would be stored as "Ã±" in the QString.

This does happen not only with hardcoded strings, but also with UTF-8 strings passed from a QString to Python and back again. So that means that translated strings got from Kross::TranslationModule (http://http//api.kde.org/4.x-api/kdelibs-apidocs/kross/html/classKross_1_1TranslationModule.html) in the scripts aren't correctly converted to QString.

Note, however, that UTF-8 QStrings are correctly passed to Python. The problem only appears when passing a UTF-8 string from Python to C++.

In the next comments I'm going to post a unit test showing the problem and a workaround for it.

Oh, and although I suppose that it is not related, at least in my system the test in kdelibs/kross/tests/unittest.py:65 fails:
self.assert_( self.object1.func_qstring_qstring(unicode("abcdef")) == "abcdef" )

Printing self.object1.func_qstring_qstring(unicode("abcdef")) I saw that it only printed the two first characters, so I debugged it and found in kdebindings/python/krosspython/pythonvariant.h:225 that the string is set with s.setUtf16( (quint16*)t, sizeof(t) / 4 ), being t a Py_UNICODE*. sizeof(t) seems to be always 8, so only the first two characters in the string are taken into account. I think that it is not related to the other bug, and I suppose that this happens in other systems too, not only in mine's, but you can judge better than me :)

Comment 1 Daniel Calviño Sánchez 2009-10-01 00:40:29 UTC

Created attachment 37271 [details]
Patch for unittest.py showing the bug

Here it is a unit test showing the problem: the file is encoded in UTF-8, so are the strings, but the string returned by the method is different than the Python one.

However the problem happens, as already said, not when the string is returned by the method and thus converted from QString to Python string, but when it is passed to the method and converted from Python string to QString.

Comment 2 Daniel Calviño Sánchez 2009-10-01 00:42:47 UTC

Created attachment 37272 [details]
Patch for pythonvariant.h with a workaround to the bug

Here it is a workaround for the bug. I am not familiar with Kross code, so it may be a better way to do it. The problem was that Py::String(obj).as_string().c_str(); returned the const char* data of a std::string, and therefore the UTF-8 characters were split in several C++ char. As toVariant(const Py::Object& obj) (the method the sentence belongs to) returns a QString, the const char* was "casted" to QString using QString(const char * str) constructor, which expects an ASCII string and thus ignores that there are multibyte characters.

In order to initialize a QString from a UTF-8 const char* taking into account multibyte characters, static method QString::fromUtf8(const char*) has to be used. As 7 bit characters in UTF-8 are the same as in ASCII, this keeps backwards compatibility with already existing code that used ASCII.

Of course, when a different encoding than UTF-8 is used in Python strings the same problem of bad encoded QStrings arises again, although personally I think that it is pretty safe to assume that the majority of strings will be encoded in UTF-8.

Comment 3 Sebastian Sauer 2009-10-03 17:55:46 UTC

SVN commit 1030960 by sebsauer:

fix UTF-8 python strings encoding
Patch by Daniel Calviño Sánchez
BUG:209046



 M  +1 -1      pythonvariant.h  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1030960

Comment 4 Sebastian Sauer 2009-10-03 17:56:24 UTC

SVN commit 1030961 by sebsauer:

backport r1030960 from trunk to 4.3 branch;

fix UTF-8 python strings encoding
Patch by Daniel Calviño Sánchez
BUG:209046



 M  +1 -1      pythonvariant.h  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1030961

Comment 5 Sebastian Sauer 2009-10-03 18:04:04 UTC

great work, thanks :)