Bug 382362

Summary: Locale database mismatch between Windows and Linux
Product: [Applications] digikam Reporter: Thomas Debesse <dev>
Component: Database-EngineAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: normal CC: caulier.gilles, kde, manuel, matt.z, schultzter
Priority: NOR    
Version: 7.4.0   
Target Milestone: ---   
Platform: Microsoft Windows   
OS: Microsoft Windows   
Latest Commit: Version Fixed In: 8.0.0
Sentry Crash Report:
Attachments: digiKam complaining Linux's locale “UTF-8” mismatch Windows' “System” one

Description Thomas Debesse 2017-07-15 06:52:10 UTC
Created attachment 106638 [details]
digiKam complaining Linux's locale “UTF-8” mismatch Windows' “System” one

I'm hosting a database for some digiKam users. Current digiKam users are running digiKam 5.6.0 on Windows because they also use some special Windows-only tools that are unavailable on Linux. So, the database has never seen any Linux client yet and was entirely populated by Windows digiKam instances.

It would be possible in the future to have digiKam users running Linux, so I tried to connect a digiKam 5.6.0 (appImage build) on Debian Stretch. But once I entered the MySQL database credentials, digiKam raises a pop-up window saying me my “locale changed since this album was last opened”, saying my current locale is UTF-8 and the current database one is “System” (probably an internal Windows name). See the screenshot attached.

So, as a precaution, I canceled the configuration and haven't opened the Database from Linux.

Do you think there is some corruption risk if opening from Linux a MySQL digiKam databases created from digiKam running on Window10?

Also, even if there would not been any risk for the database content itself, it would be very annoying for users if anytime a Linux user got the pop-up warning after a Windows user, and the Windows user having the warning anytime after a Linux user.

It looks like the databases are currently using the “latin1_swedish_ci” collation for everything including file paths and tags, something that is not making me happy because I wonder if digiKam would handle correctly tags like “Lætitia « mémé » Çitöl” (including the non-breakable thin space you wouldn't notice). But that's probably another issue.
Comment 1 Thomas Debesse 2017-07-15 07:25:00 UTC
See bug #375710 for similar bug on Kdenlive side.
Comment 2 matt.z 2018-04-21 16:48:04 UTC
I am having a similar issue between my Win10 and a Mac.  Using MariaDB.  Each time digikam is opened it resets the locale based on the computer, Win10="System" and Mac="UTF-8".  Using US English on both.  I can find no way to switch Win10 to UTF-8 (which I think would be the best route).  This conflict causes digikam to become unresponsive or fail to load a few times when computers simultaneously access the database.

Is there a way / can we request to "fake" the locale that gets reporting to Digikam so it sees "UTF-8" from the undescriptive Win10 "System" locale?
Thank you,
Comment 3 Nick Cross 2020-07-29 10:35:43 UTC
This is happening for me on my Fedora 32 system and Windows 10 system. Linux is en_GB.UTF-8.
Comment 4 manuel 2021-04-02 20:38:42 UTC
I have the same problem. I use one digikam instance on a MacOS machine and a second digikam instance on a Windows machine, both accessing the same MySQL database. Each time I switch to the other system, I get that warning about differing locales.

Fortunately, I was able to change one line in the code to force it to always use UTF-8 for the DB and built my own Windows-version of digikam. This is, of course, not the correct approach, but at least I got rid of that warning. And it is working for me because I use latin characters only anyways, no special characters like german Umlauts, in file/folder names. So there are no problems to be expected (for me).

This issue should really be addressed. IMHO, it is not an uncommon case to access a digikam database from different platforms.
Comment 5 Nick Cross 2021-04-03 08:47:45 UTC
Doesn't UTF-8 cover umlaut (etc) anyway? Maybe it's because in my case it's defaulting to Windows codeset which is a subset of Linux? So we need an option to force codeset in use?
Comment 6 manuel 2021-04-06 07:47:39 UTC
Yes, UTF-8 covers umlauts and of course any other unicode character.

I took closer look at the code. Since digikam uses QT it already uses unicode strings. Digikam’s database schema for MySQL is hardcoded to UTF-8. For the data that comes into the app the system’s character encoding is used. So any "incoming" data is turned into unicode strings and stored using UTF-8 encoding in the database. (correct me if I am wrong - I am a programmer, but not using cpp at all :) )

What digikam now does is that it stores the name of the system’s character encoding in the database. During startup it compares the current encoding to the one stored in the database and shows a warning in case they differ.
On MacOS a real encoding name like "UTF-8" is detected, while on Windows only "System" is returned (see: https://doc.qt.io/qt-5/qtextcodec.html#codecForLocale ) and MultiByteToWideChar/WideCharToMultiByte is used (see https://wiki.qt.io/QtTextCodec).
Although the warning informs about a potential problem, it might be totally fine to use different encodings across multiple systems as long as only characters are used which are supported by both. The warning message actually says something similar. In the end, strings are stored as UTF-8 in the MySQL database.

And this is probably the use case people usually have when accessing digikam’s MySQL database from multiple systems: they use different systems with different encodings (like when using digikam on Windows and MacOS), but still use the same characters, because both systems are probably run using the same language, let’s say German where umlauts are frequently used. It doesn’t matter that both systems use different encodings as long as the used characters are supported by both encodings.
A problem can only arise if a user uses two systems with two different encodings and then use characters that cannot be encoded by one of the encodings. For example, one system supports only pure ASCII and on the second system Chinese characters are used.

So, what we actually need is a "don’t show this message again - I know what I am doing" checkbox for the warning message box. Forcing some encoding is IMHO unnecessary.

For me, I just removed that part showing the message box and compiled my own version, because I know that I will run digikam on different platforms using different encodings, but I also know that I will stick to the same language and use the same characters on all systems. So I am happy now, but I would be glad to see some official effort here :)
Comment 7 Nick Cross 2021-04-06 08:02:55 UTC
Interestingly Windows 10 has recently introduced an option "Beta: Use Unicode UTF-8 for worldwide language support" - although I haven't tried it. If both systems are then correctly using UTF-8 and reporting same codepage it would seem to solve it

https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
Comment 8 caulier.gilles 2021-04-06 08:36:22 UTC
If i remember, historically Windows support UTF16 and not UTF8 for char encoding in memory...

https://stackoverflow.com/questions/13499920/what-unicode-encoding-utf-8-utf-16-other-does-windows-use-for-its-unicode-da

https://en.wikipedia.org/wiki/UTF-16#Usage

Of course UTF8 is so far enough to solve all strings encoding over the world, and it has the advantage to be backward compatible with the legacy ASCII encoding (where UTF16 do not). This is why Linux support UTF8 instead UTF16.

Gilles Caulier
Comment 9 manuel 2021-04-06 09:54:01 UTC
If Windows would use UTF-8 it would indeed solve the issue. But only if QT recognizes that. digikam just gets the QTextCodec of the current locale (QTextCodec::codecForLocale()) and then uses its name as the encoding name. So QT would need to return the UTF-8 QTextCodec on Windows systems instead of the "System" one when Windows is set up to UTF-8. Hopefully that's the case then ;)

But what I was trying to say:
For now, if it is just you who has access to your database and all your systems support the same characters (for example all your operating systems are set up to the same language), then it doesn't matter what encoding each system uses. QT should take care of correctly decode, e.g. file and folder names, according to the system's character encoding. The results are QStrings (unicode strings) which are then used by digikam and saved using UTF-8 encoding in the database.
Let's say you have file/folder names containing special characters. When they are retrieved from the database, the UTF-8 strings are turned back into unicode QStrings. And when accessing the file, QT can use the system's character encoding to get the correctly encoded path string for the current system.
At least that is my understanding from looking at the code and QT docs. But again, I am not a cpp or QT developer ;) I am a little bit unsure especially about the database part: I see that the table schemas are fixed to UTF-8, but I do not know if QT's MySQL driver uses that information when encoding/decoding QStrings (but I would expect it, and in my case everything works fine so far :) ) Can somebody confirm this?

However, if you have a database with a lot of different users using very different systems and potentially completely different character encodings, then of course, problems have to be expected. That is why I suggest an "do not show again" checkbox for the message box ;) It would be easy to implement.
Comment 10 Maik Qualmann 2022-06-06 12:03:27 UTC
Git commit 31c2aeed866d2eac1a811e74291af7d5d5c550fc by Maik Qualmann.
Committed on 06/06/2022 at 12:02.
Pushed by mqualmann into branch 'master'.

we now check the database character set for changes
FIXED-IN: 8.0.0

M  +2    -1    NEWS
M  +10   -57   core/libs/album/manager/albummanager_database.cpp
M  +23   -0    core/libs/database/coredb/coredb.cpp
M  +7    -0    core/libs/database/coredb/coredb.h

https://invent.kde.org/graphics/digikam/commit/31c2aeed866d2eac1a811e74291af7d5d5c550fc