Bug 377621 - Incorrect text symbols when seeing non ASCII file names inside ZIP file
Summary: Incorrect text symbols when seeing non ASCII file names inside ZIP file
Status: RESOLVED INTENTIONAL
Alias: None
Product: krusader
Classification: Applications
Component: krarc (show other bugs)
Version: 2.5.0
Platform: openSUSE Linux
: NOR normal
Target Milestone: ---
Assignee: Krusader Bugs Distribution List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-14 16:00 UTC by Rafael Linux User
Modified: 2018-05-06 00:16 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Characters non ASCII not showed correctly (30.72 KB, image/png)
2017-03-14 16:01 UTC, Rafael Linux User
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rafael Linux User 2017-03-14 16:00:52 UTC
When I have some ZIP file with non-ASCII characters and I open it on Krusader, it show a wrong character instead the right one. "Ark" KDE archive tool show correctly those filenames. If I extract that files, the names finally are created correctly.

Example ZIP:
https://goo.gl/g6pt8D
with filenames with non ASCII characters that are showed like a question mark inside a diamond in Krusader.

├── 01.- Recorte & Nitidez.jpg
├── 02.- Brillo.jpg
├── 03.- Contraste.jpg
├── 04.- Escala de Colores.jpg
├── 05.- Barras de Color de Calibración HD.jpg
├── 06.- Barras de Color de Calibración HD (Negro).jpg
├── 07.- Patrones de Verificación
│   ├── 01.- Croma 4-4-4 & 4-2-2.png
│   ├── 02.- Prueba de degradado.jpg
│   ├── 03.- 0-100%.jpg
│   ├── 04.- Widscreen.jpg
│   ├── 05.- Academia.jpg
│   ├── 06.- Panavisión.jpg
│   └── 07.- 4-3.jpg
├── 08.- Patrones Avanzados
│   ├── 01.- Patron de Color & Tinte.jpg
│   └── 02.- 75% Para televisores con C.M.S. activado.jpg
└── Información sobre Calibración HD.pdf
Comment 1 Rafael Linux User 2017-03-14 16:01:55 UTC
Created attachment 104563 [details]
Characters non ASCII not showed correctly
Comment 2 Alex Bikadorov 2017-03-14 17:55:16 UTC
Please upload the test zip to some service that does not require a Google account, e.g. here.
Comment 3 Rafael Linux User 2017-03-15 13:19:10 UTC
(In reply to Alex Bikadorov from comment #2)
> Please upload the test zip to some service that does not require a Google
> account, e.g. here.

Sorry, I tried, but it's 9MiB size, so I'm looking another way to share it, but will be late.
Comment 4 Christoph Feck 2017-03-15 17:00:49 UTC
You could also create a simple ZIP file with the same (broken) names, but 0 byte files. We don't need the 9 MiB data :)
Comment 5 Rafael Linux User 2017-03-15 17:08:38 UTC
Archive file (10MiB) at http://www.filedropper.com/fullhdcalibracionhd
Comment 6 Rafael Linux User 2017-03-15 17:13:51 UTC
(In reply to Christoph Feck from comment #4)
> You could also create a simple ZIP file with the same (broken) names, but 0
> byte files. We don't need the 9 MiB data :)

Sorry, I think from my own computer I could recreate that issue. In fact, when I tried to reduce size (uncompressing and deleting the PDF file inside) and then I zipped from Krusader again, I notice I had TWO ARCHIVE FILENAMES IDENTICAL. So I investigate a little more and created a bug (I think is a bug) in OpenSUSE bugs:
https://bugzilla.opensuse.org/show_bug.cgi?id=1029568

That's why I prefer to send you the original one archive. My apologizes.
Comment 7 Alex Bikadorov 2017-03-15 17:40:51 UTC
About having two archives with the same name: this is not a bug. It looks like the files have the same name but the characters are actually
Comment 8 Alex Bikadorov 2017-03-15 17:43:27 UTC
(damn) 

...different. This is due to the  encoding with UTF-8.
Try this command in a shell:
> LC_ALL=C ls -1b
Comment 9 Rafael Linux User 2017-03-15 18:01:45 UTC
(In reply to Alex Bikadorov from comment #8)
> (damn) 
> 
> ...different. This is due to the  encoding with UTF-8.
> Try this command in a shell:
> > LC_ALL=C ls -1b

Thank you for sharing your knowledge, I didn't tried LC_ALL or the "b" parameter for "ls" command  ;)

Effectively, the result is:
Full\ HD\ \302\251Calibracio\314\201n\ HD.zip
Full\ HD\ \302\251Calibraci\303\263n\ HD.zip

But I think there should be some os filesystem rule to avoid this problem. Maybe is a GDrive error (because I can see something weird in the filename when GDrive show the filename in the browser page ...). But I don't want to make you have a headhache about that, caused is not related and I'm sure you are very very busy.

THANK YOU FOR YOU GOOD WORK  ;)
Comment 10 Alex Bikadorov 2017-03-15 18:51:49 UTC
The problem is the UTF-8 encoding that allows the same shown letter to have different encodings, see http://stackoverflow.com/a/6153713/6286694.
If different platforms (operating systems) decide different a encoding we can't 

To come back to the filenames *inside* the archive: The zip was created on another OS with an encoding that is not portable (but i couldn't find out which one). You will see the same characters when running "unzip -l" and the krarc protocol is doing exactly this: running the archive tool and parsing the output.
So, if unzip can't handle this, krarc can't do this, too.

The zip:/ protocol seems to work differently but it will probably impossible to fix this in krarc without changing the entire code.

The archive is actually to blame, so this is a "wontfix" for me. Of course, somebody else can spend more time on this and reopen if wanted.
Comment 12 Rafael Linux User 2017-03-16 11:52:11 UTC
Hi Alex

The articles you send my are very interesting and show the complexity of the case. Anyway IMHO, in these cases, it's preferable to show a "weird" character (like OpenSUSE terminal does) instead of let user think that "all is right".

I mean, I think Plasma should not represent visually "o'" like "ó", because really they are NOT the same character and can't get obtained thru a natural typing on a keyboard. In fact, "áéíóú" "ÁÉÍÓÚ" (in our case "ó") are obtained typing first "'" and after "o". So, why Plasma is showing "\314\201n\" like "\303\263n\" (this second one is a "typable" character, the first one not)?. 
This will bring problems even in Krusader o a KDE terminal, cause both characters are showed like equals, but they are not.

Anyway, in fact, Krusader is showing that characters like a pair "+?" or "-?", instead like they are showed by a Linux terminal ("?"). Result is that Krusader don't let me extract any file of the ZIP archive  :(  so I think the status should be "SHOULDFIX"  ;)

Thank you

(In reply to Alex Bikadorov from comment #10)
> The problem is the UTF-8 encoding that allows the same shown letter to have
> different encodings, see http://stackoverflow.com/a/6153713/6286694.
> If different platforms (operating systems) decide different a encoding we
> can't 
> 
> To come back to the filenames *inside* the archive: The zip was created on
> another OS with an encoding that is not portable (but i couldn't find out
> which one). You will see the same characters when running "unzip -l" and the
> krarc protocol is doing exactly this: running the archive tool and parsing
> the output.
> So, if unzip can't handle this, krarc can't do this, too.
> 
> The zip:/ protocol seems to work differently but it will probably impossible
> to fix this in krarc without changing the entire code.
> 
> The archive is actually to blame, so this is a "wontfix" for me. Of course,
> somebody else can spend more time on this and reopen if wanted.
Comment 13 Alex Bikadorov 2017-03-16 12:43:18 UTC
wait, you mixing something up.

> I mean, I think Plasma should not represent visually "o'" like "ó", because
> really they are NOT the same character and can't get obtained thru a natural
> typing on a keyboard. In fact, "áéíóú" "ÁÉÍÓÚ" (in our case "ó") are
> obtained typing first "'" and after "o". So, why Plasma is showing
> "\314\201n\" like ""n\" (this second one is a "typable" character,
> the first one not)?. 
> This will bring problems even in Krusader o a KDE terminal, cause both
> characters are showed like equals, but they are not.

This has nothing to do with zip archives but only about filename representation with UTF-8.

The character "ó" can have multiple encodings in UTF-8, namely "\303\263" and "\314\201". The first one is one character 
>U+00F3	ó	0303 0263	LATIN SMALL LETTER O WITH ACUTE
and the second is the accent
>U+0301	́	0314 0201	COMBINING ACUTE ACCENT
which is the same character with a prior "0". Both are valid representations of "ó" and one application/library uses the first another the second one. Again: There is nothing wrong here.

(And you should close the bug report for OpenSuse.)

> Anyway, in fact, Krusader is showing that characters like a pair "+?" or
> "-?", instead like they are showed by a Linux terminal ("?"). Result is that
> Krusader don't let me extract any file of the ZIP archive  :(  so I think
> the status should be "SHOULDFIX"  ;)

This is another issue about the filename encoding IN a zip archive.

The point is that the archive was created with an invalid, non-standard, not-portable charset (not UTF-8). The KIO zip:/ protocol is using an own library (KArchive/KZip) and can compensate this.
But Krusader is using the unzip tool. If unzip cannot correctly read the archive, Krusader can't either.

You can also create a correct archive with the very same filenames and everything works correctly. This proves that there is no bug here. You can blame the creator of the archive.)
Comment 14 Rafael Linux User 2017-03-19 15:48:27 UTC
I wish I could blame it, but no way to contact with him. Anyway, from a year to now, it's the first time it happens again. And always, as you explained, is related with zip archives (I HATE them).

As you suggest me, I will close OpenSUSE bug just after this message.

Anyway, thank you for ALL DETAILED information about this issue.

Regards  ;)