Bug 377621

Summary:	Incorrect text symbols when seeing non ASCII file names inside ZIP file
Product:	[Applications] krusader	Reporter:	Rafael Linux User <rafael.linux.user>
Component:	krarc	Assignee:	Krusader Bugs Distribution List <krusader-bugs-null>
Status:	RESOLVED INTENTIONAL
Severity:	normal	CC:	alex.bikadorov, krusader-bugs-null, rafael.linux.user
Priority:	NOR
Version First Reported In:	2.5.0
Target Milestone:	---
Platform:	openSUSE
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	Characters non ASCII not showed correctly

Description Rafael Linux User 2017-03-14 16:00:52 UTC

When I have some ZIP file with non-ASCII characters and I open it on Krusader, it show a wrong character instead the right one. "Ark" KDE archive tool show correctly those filenames. If I extract that files, the names finally are created correctly.

Example ZIP:
https://goo.gl/g6pt8D
with filenames with non ASCII characters that are showed like a question mark inside a diamond in Krusader.

├── 01.- Recorte & Nitidez.jpg
├── 02.- Brillo.jpg
├── 03.- Contraste.jpg
├── 04.- Escala de Colores.jpg
├── 05.- Barras de Color de Calibración HD.jpg
├── 06.- Barras de Color de Calibración HD (Negro).jpg
├── 07.- Patrones de Verificación
│   ├── 01.- Croma 4-4-4 & 4-2-2.png
│   ├── 02.- Prueba de degradado.jpg
│   ├── 03.- 0-100%.jpg
│   ├── 04.- Widscreen.jpg
│   ├── 05.- Academia.jpg
│   ├── 06.- Panavisión.jpg
│   └── 07.- 4-3.jpg
├── 08.- Patrones Avanzados
│   ├── 01.- Patron de Color & Tinte.jpg
│   └── 02.- 75% Para televisores con C.M.S. activado.jpg
└── Información sobre Calibración HD.pdf

Comment 1 Rafael Linux User 2017-03-14 16:01:55 UTC

Created attachment 104563 [details]
Characters non ASCII not showed correctly

Comment 2 Alex Bikadorov 2017-03-14 17:55:16 UTC

Please upload the test zip to some service that does not require a Google account, e.g. here.

Comment 3 Rafael Linux User 2017-03-15 13:19:10 UTC

(In reply to Alex Bikadorov from comment #2)
> Please upload the test zip to some service that does not require a Google
> account, e.g. here.

Sorry, I tried, but it's 9MiB size, so I'm looking another way to share it, but will be late.

Comment 4 Christoph Feck 2017-03-15 17:00:49 UTC

You could also create a simple ZIP file with the same (broken) names, but 0 byte files. We don't need the 9 MiB data :)

Comment 5 Rafael Linux User 2017-03-15 17:08:38 UTC

Archive file (10MiB) at http://www.filedropper.com/fullhdcalibracionhd

Comment 6 Rafael Linux User 2017-03-15 17:13:51 UTC

(In reply to Christoph Feck from comment #4)
> You could also create a simple ZIP file with the same (broken) names, but 0
> byte files. We don't need the 9 MiB data :)

Sorry, I think from my own computer I could recreate that issue. In fact, when I tried to reduce size (uncompressing and deleting the PDF file inside) and then I zipped from Krusader again, I notice I had TWO ARCHIVE FILENAMES IDENTICAL. So I investigate a little more and created a bug (I think is a bug) in OpenSUSE bugs:
https://bugzilla.opensuse.org/show_bug.cgi?id=1029568

That's why I prefer to send you the original one archive. My apologizes.

Comment 7 Alex Bikadorov 2017-03-15 17:40:51 UTC

About having two archives with the same name: this is not a bug. It looks like the files have the same name but the characters are actually

Comment 8 Alex Bikadorov 2017-03-15 17:43:27 UTC

(damn) 

...different. This is due to the  encoding with UTF-8.
Try this command in a shell:
> LC_ALL=C ls -1b

Comment 9 Rafael Linux User 2017-03-15 18:01:45 UTC

(In reply to Alex Bikadorov from comment #8)
> (damn) 
> 
> ...different. This is due to the  encoding with UTF-8.
> Try this command in a shell:
> > LC_ALL=C ls -1b

Thank you for sharing your knowledge, I didn't tried LC_ALL or the "b" parameter for "ls" command  ;)

Effectively, the result is:
Full\ HD\ \302\251Calibracio\314\201n\ HD.zip
Full\ HD\ \302\251Calibraci\303\263n\ HD.zip

But I think there should be some os filesystem rule to avoid this problem. Maybe is a GDrive error (because I can see something weird in the filename when GDrive show the filename in the browser page ...). But I don't want to make you have a headhache about that, caused is not related and I'm sure you are very very busy.

THANK YOU FOR YOU GOOD WORK  ;)

Comment 10 Alex Bikadorov 2017-03-15 18:51:49 UTC

The problem is the UTF-8 encoding that allows the same shown letter to have different encodings, see http://stackoverflow.com/a/6153713/6286694.
If different platforms (operating systems) decide different a encoding we can't 

To come back to the filenames *inside* the archive: The zip was created on another OS with an encoding that is not portable (but i couldn't find out which one). You will see the same characters when running "unzip -l" and the krarc protocol is doing exactly this: running the archive tool and parsing the output.
So, if unzip can't handle this, krarc can't do this, too.

The zip:/ protocol seems to work differently but it will probably impossible to fix this in krarc without changing the entire code.

The archive is actually to blame, so this is a "wontfix" for me. Of course, somebody else can spend more time on this and reopen if wanted.

Comment 11 Alex Bikadorov 2017-03-15 18:54:04 UTC

more to read:
http://unix.stackexchange.com/a/252000
https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/
https://github.com/rubyzip/rubyzip/wiki/Files-with-non-ascii-filenames

Comment 12 Rafael Linux User 2017-03-16 11:52:11 UTC

Hi Alex

The articles you send my are very interesting and show the complexity of the case. Anyway IMHO, in these cases, it's preferable to show a "weird" character (like OpenSUSE terminal does) instead of let user think that "all is right".

I mean, I think Plasma should not represent visually "o'" like "ó", because really they are NOT the same character and can't get obtained thru a natural typing on a keyboard. In fact, "áéíóú" "ÁÉÍÓÚ" (in our case "ó") are obtained typing first "'" and after "o". So, why Plasma is showing "\314\201n\" like "\303\263n\" (this second one is a "typable" character, the first one not)?. 
This will bring problems even in Krusader o a KDE terminal, cause both characters are showed like equals, but they are not.

Anyway, in fact, Krusader is showing that characters like a pair "+?" or "-?", instead like they are showed by a Linux terminal ("?"). Result is that Krusader don't let me extract any file of the ZIP archive  :(  so I think the status should be "SHOULDFIX"  ;)

Thank you

(In reply to Alex Bikadorov from comment #10)
> The problem is the UTF-8 encoding that allows the same shown letter to have
> different encodings, see http://stackoverflow.com/a/6153713/6286694.
> If different platforms (operating systems) decide different a encoding we
> can't 
> 
> To come back to the filenames *inside* the archive: The zip was created on
> another OS with an encoding that is not portable (but i couldn't find out
> which one). You will see the same characters when running "unzip -l" and the
> krarc protocol is doing exactly this: running the archive tool and parsing
> the output.
> So, if unzip can't handle this, krarc can't do this, too.
> 
> The zip:/ protocol seems to work differently but it will probably impossible
> to fix this in krarc without changing the entire code.
> 
> The archive is actually to blame, so this is a "wontfix" for me. Of course,
> somebody else can spend more time on this and reopen if wanted.

Comment 13 Alex Bikadorov 2017-03-16 12:43:18 UTC

wait, you mixing something up.

> I mean, I think Plasma should not represent visually "o'" like "ó", because
> really they are NOT the same character and can't get obtained thru a natural
> typing on a keyboard. In fact, "áéíóú" "ÁÉÍÓÚ" (in our case "ó") are
> obtained typing first "'" and after "o". So, why Plasma is showing
> "\314\201n\" like ""n\" (this second one is a "typable" character,
> the first one not)?. 
> This will bring problems even in Krusader o a KDE terminal, cause both
> characters are showed like equals, but they are not.

This has nothing to do with zip archives but only about filename representation with UTF-8.

The character "ó" can have multiple encodings in UTF-8, namely "\303\263" and "\314\201". The first one is one character 
>U+00F3	ó	0303 0263	LATIN SMALL LETTER O WITH ACUTE
and the second is the accent
>U+0301	́	0314 0201	COMBINING ACUTE ACCENT
which is the same character with a prior "0". Both are valid representations of "ó" and one application/library uses the first another the second one. Again: There is nothing wrong here.

(And you should close the bug report for OpenSuse.)

> Anyway, in fact, Krusader is showing that characters like a pair "+?" or
> "-?", instead like they are showed by a Linux terminal ("?"). Result is that
> Krusader don't let me extract any file of the ZIP archive  :(  so I think
> the status should be "SHOULDFIX"  ;)

This is another issue about the filename encoding IN a zip archive.

The point is that the archive was created with an invalid, non-standard, not-portable charset (not UTF-8). The KIO zip:/ protocol is using an own library (KArchive/KZip) and can compensate this.
But Krusader is using the unzip tool. If unzip cannot correctly read the archive, Krusader can't either.

You can also create a correct archive with the very same filenames and everything works correctly. This proves that there is no bug here. You can blame the creator of the archive.)

Comment 14 Rafael Linux User 2017-03-19 15:48:27 UTC

I wish I could blame it, but no way to contact with him. Anyway, from a year to now, it's the first time it happens again. And always, as you explained, is related with zip archives (I HATE them).

As you suggest me, I will close OpenSUSE bug just after this message.

Anyway, thank you for ALL DETAILED information about this issue.

Regards  ;)