Bug 378904 - Ark should use charset auto-detection for filenames
Summary: Ark should use charset auto-detection for filenames
Status: REPORTED
Alias: None
Product: ark
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR wishlist
Target Milestone: ---
Assignee: Elvis Angelaccio
URL:
Keywords:
: 324978 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-04-18 05:48 UTC by Nicolas F.
Modified: 2024-03-02 09:18 UTC (History)
10 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
MARU164.zip opened using KEncodingProber (138.47 KB, image/png)
2019-07-16 19:22 UTC, Ragnar Thomsen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Nicolas F. 2017-04-18 05:48:38 UTC
ZIP archives have no standardised encoding for filenames and do not supply the charset information in the archive itself. This results in a problem that occurs when opening a ZIP created by, say, a Japanese Windows user, in Ark, as the filenames will come out as a garbled mess since Ark will expect a different encoding than the one that was used.

This can be worked around by using a library such as uchardet, which can guess the used encoding from the binary filename string. Allowing users to manually specify which encoding should be tried would also be a helpful addition.

A file demonstrating this issue can be found here: http://maltinerecords.cs8.biz/release/164/MARU164.zip
Comment 1 sowieso 2018-09-28 07:10:46 UTC
I can confirm this behaviour, it's really annoying when working together with non-UTF-8 systems.
I tried to use other software, but it looks like there is none who handles this correctly. Unzip, file-roller, p7zip (sadly dead), peazip (crashed) all failed. Looks like there is no software for Linux that can do it currently.
Comment 2 Patrick Silva 2019-03-01 16:35:02 UTC
ark 18.12.2 has the same problem on Arch Linux.

Operating System: Arch Linux 
KDE Plasma Version: 5.15.2
KDE Frameworks Version: 5.55.0
Qt Version: 5.12.1
Comment 3 Alexander Trufanov 2019-04-22 11:33:22 UTC
As I found out this is a very old problem with roots in ZIP specification. ZIP can contain non-UTF filenames, UTF-8 filenames, or non-UTF filenames with additional field that contain UTF-8 filename (since 2007). Same isapplied to ZIP archive commentary.

The problem is that by design the non-UTF charset is IBM 437 charset which does not support non-Western languages.
On practice Windows encode filenames with one of its DOS charsets (CP*), for example for Russian it'll be CP866 (IBM 866). And there is no field in ZIP to specify which exactly charset was used.
Even worse the fact that by default many Windows achievers don't use UTF-8 encoding but this DOS one.

As I understand ZIP authors don't want to fix this and suggesting everyone to switch to UTF-8 for non-English systems.

There are several different patches, libs, tools proposed by developers to workaround the problem decade ago.

Also maintainers of some linux systems patch zip/unzip tools in their systems to workaround that. For example, here is discussion about unzip patch for Ubuntu systems: https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961/

Which end up with a patch that has been accepted for Ubuntu main branch. But this take years.
I think this is a mirror of this patch:
https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8

As I can see from code they request locale from system and try to match it with DOS charset based on hardcoded table. And additionally provide command line args to allow user to specify the filename encoding by himself. I would say their predefined encoding list is rather small and oriented to russian-speakers. Or perhaps that's a wrong patch.

Anyway still no GUI archivers implemented something like that.

I don't believe much in automatic encoding detection. At least if one not bet on fact that all non-UTF encodings coming from Win shall be CPxxx, and not Windows-12xx. Bcs even for russian there are 4-5 charsets and some of them very hard to distinguish without a dictionary or text statistics. So it may be a heuristic but not 100% reliable method.

But I think Ark can do something like Ubuntu's unzip have:

1. A small prebuilt table to match current locale to encoding supposed to come from Win-created ZIPs (like here: https://github.com/zip-i18n/unzip/blob/master/debian/patches/20-unzip60-alt-iconv-utf8#L36) in assumption that Linux and Windows users spoke same language.

2. Ark can copy-paste cool menu from Kate (Tools/Encodings) that will let user switch to one of encodings available in his system in GUI. And use this choice to display filenames and archive commentaries in GUI as well as for I/O operations while extracting files. This will allow user to find proper charset and get files extracted.
Comment 4 Zeno Endemann 2019-07-14 22:11:07 UTC
I've recently run into this problem as well. So I've looked at the Ark sources, but I don't see a good way to add this feature, both from the coding (no other archive plugin needs special open-time options, so there is understandably no infrastructure) as well as the UI standpoint (there is no dialog when opening an archive via the main navigation, and adding one would be weird). I would very much understand if an intrusive code change would not be acceptable just for this problem.

Thus I would propose the following solution: Use auto detection of the encoding (via KEncodingProber) per default (which should hopefully work for most people, it worked at least on my zip file with Japanese encoding), but also have an override via command line switch or environment variable that would force to use an encoding for all opened zip files for the running Ark process. While not ideal that would be good enough for me, and only require minimal changes and no changes to the UI.

There is one risk though, in that using encoding auto detection could potentially introduce regressions for other users. Note though that some kind of encoding auto detection is already in use (see ZIP_FL_ENC_GUESS flag here: https://libzip.org/documentation/zip_name_locate.html), but that is not working sufficiently apparently.

Anyway, if my suggested approach is acceptable, I could prepare a patch.
Comment 5 Ragnar Thomsen 2019-07-16 19:19:35 UTC
I tried using KEncodingParser with the libzip-plugin to open the attached Japanese zip archive. It seems like it could correctly detect the encoding for all the files (see attached screenshot), so this seems like a promising approach.
I also tried using the uchardet library but it detected ASCII encoding for all the files.

One concern is the overhead of probing for the encoding of each archive entry. Opening the linux kernel source in zip format took 106 secs with probing vs 5 secs without, so there is significant overhead to this approach.
I think we either need to be smart and only probe when needed (can't see how though) or we add a menu item in the GUI to reload the archive with probing of filename encodings. If we could assume that all archive entries have the same encoding, we could only probe the first entry, but I think this assumption doesn't hold in real life, e.g. in the attached archive the first entry is detected as UTF8 since it doesn't contain Japanese characters.
Comment 6 Ragnar Thomsen 2019-07-16 19:22:09 UTC
Created attachment 121560 [details]
MARU164.zip opened using KEncodingProber
Comment 7 Zeno Endemann 2019-07-16 20:41:11 UTC
Right, probing should probably be limited to maybe the first 30 entries or so.

But as it is pretty clear that this auto detection won't always work I'd really like to have a manual override as well. On the other hand I can definitely understand not wanting to make the UI more complex for this corner case that only applies to  the zip format, and there only to files created by legacy software (does anyone know which programs produce these zip files actually?), so the best compromise I can come up with is a command line flag (think "--libzip-plugin-force-char-encoding=SHIFT-JIS").

One more thing, in the zip spec there is a file global flag that, if set, requires the zip file to be utf8. If that flag is set and we encounter a non-valid utf8 string that probably should be treated as an error. Unfortunately I haven't seen any way to get the value of the flag via the libzip API.
Comment 8 Zeno Endemann 2019-07-16 23:09:40 UTC
Oh, and in response to the sentence "If we could assume that all archive entries have the same encoding, we could only probe the first entry": It would not make any sense for a zip file to have multiple entries with different encoding, no one would be able to decompress such a file reliably. So we don't need to worry about this case and once we have detected an encoding that should be used for the whole file. But only probing the first entry would be less reliable, after all character encoding probing gets more reliable the more text it sees. There needs to be a balance between performance and reliablilty, that's why I said 30 entries or so.
Comment 9 Zeno Endemann 2019-07-28 12:59:22 UTC
After skimming the zip format spec (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) a little, there are actually a few flags and optional extra fields that have influence on the character encoding that should be used. Unfortunately, investigating how to deal with this 'properly' looks like a lot of work, and most likely would need changes to libzip as well. I won't have time for that after all, sorry.
Comment 10 2wxsy58236r3 2019-08-05 06:35:12 UTC
The problem can usually be solved by using `unar` with the `-e` parameter (see Bug 324978 Comment 13 for examples). Ark's has a cliunarchiverplugin that uses `unar`, but apparently it is only used for RAR archives.
Comment 11 Nicolas F. 2019-10-27 12:01:07 UTC
>and there only to files created by legacy software (does anyone know which programs produce these zip files actually?)

Windows creates ZIP files with filenames encoded in the system locale's charset. So any ZIP file created on Windows with "send to->zip compressed folder" by someone using a locale that doesn't map to utf8 is affected.
Comment 12 unxed 2020-06-24 12:36:04 UTC
I recently wrote patches to p7zip and unzip for OEM charset detection based on system locale. It's exactly that windows internal zip encoder does.

https://sourceforge.net/p/infozip/patches/29/
https://sourceforge.net/p/p7zip/bugs/187/

To get correct file names you just need to install patched p7zip and set your system locale correctly. Or do something like
alias 7z='LC_ALL=el_GR.UTF-8 7z'
if you prefer opening archives using the locale different from system one.

Alkis Georgopoulos is planning to package patched p7zip to .deb's and upload to  ppa: https://github.com/mate-desktop/engrampa/issues/5#issuecomment-648410042
Comment 13 leohearts 2022-01-20 08:29:45 UTC
I'm also getting trouble with this problem. I often get some archives with GBK encoding and have to end up using 7z.exe with wine to unzip them. Maybe adding a command line option which provides encoding auto defection can be acceptable?
Comment 14 Firestar-Reimu 2022-11-14 03:43:09 UTC
Reply leohearts@leohearts.com:

You can use `unarchiver` which gives `unar` command. I can confirm this issue and I use Arch Linux with Ark 22.08.3

This happens on GBK zips from some of my teachers. Filenames unarchived will be awful like `▒╛┐╞-88-2000018412-▓╠░╪│σ-í╢▓╗┐╔─µ╡─╝╙║═ú║▓╩╔½íó║┌░╫╝░╞Σ╦√í¬í¬╙░╩╙╓╨╜¿╓■╡─╔½╙δ╣Γí╖.docx`.

PS: I think Ark has to many plugins, it's good to have one best plugin for one archive format.
Comment 15 Elvis Angelaccio 2022-12-04 12:18:21 UTC
*** Bug 324978 has been marked as a duplicate of this bug. ***
Comment 16 Elvis Angelaccio 2022-12-04 12:20:24 UTC
Update on this issue: I played a bit with encoding probing using both KEncodingProber and ICU.

The biggest issue with this approach is that filenames are usually very short, so the prober does not have enough data to properly guess the correct encoding.

One possible solution could be the following: we add KEncodingProber support in the libzip plugin (Ark's default plugin for zip files). If KEncodingProber detects one or more non-unicode encodings, Ark would show a notification to the user asking if they want to attempt to fix garbled filenames, if any. If the user confirms, the libzip plugin would then reload the archive and convert the filenames from the detected encoding to the standard UTF-16 encoding used by Qt. This "opt-in" step is required because if we do it automatically we could break the normal workflow for valid zip archives that only contain UTF-8 filenames (since again, the probing is not precise and could detect a wrong encoding for a valid UTF-8 filename, and there is the addition overhead problem mentioned in previous comments).