Bug 324978 - Ark couldn't correctly extract files with non-Unicode filename
Summary: Ark couldn't correctly extract files with non-Unicode filename
Status: RESOLVED DUPLICATE of bug 378904
Alias: None
Product: ark
Classification: Applications
Component: plugins (show other bugs)
Version: 2.19
Platform: unspecified Linux
: NOR wishlist
Target Milestone: ---
Assignee: Ragnar Thomsen
URL:
Keywords:
: 312478 439392 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-09-17 01:46 UTC by Franklin Weng
Modified: 2022-12-04 12:18 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
screenshot of ark (117.60 KB, image/jpeg)
2013-09-17 01:47 UTC, Franklin Weng
Details
sample zip file that ark couldn't correctly Chinese filename (250.02 KB, application/zip)
2013-09-17 23:48 UTC, Franklin Weng
Details
japanese zip file - ark can not read the character (121.16 KB, application/zip)
2017-01-20 07:22 UTC, R. Sato
Details
I add a screenshot also (52.00 KB, image/png)
2017-01-20 07:25 UTC, R. Sato
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Franklin Weng 2013-09-17 01:46:25 UTC
When a zip file contains files that has Chinese characters in its filename, preview will show question mark in filenames, and the extracted filenames are wrong too.

Reproducible: Always

Steps to Reproduce:
1. Open a zip file containing files with Chinese filename
2. See it in preview
3. extract it
Actual Results:  
Couldn't show or extract the filename correctly

Expected Results:  
Should show or extract it correctly
Comment 1 Franklin Weng 2013-09-17 01:47:14 UTC
Created attachment 82348 [details]
screenshot of ark

As you can see, the title of the zip file is correct, but the contents of it is not.
Comment 2 Jekyll Wu 2013-09-17 15:02:47 UTC
Could you please upload a sample archive ?
Comment 3 Franklin Weng 2013-09-17 23:48:12 UTC
Created attachment 82385 [details]
sample zip file that ark couldn't correctly Chinese filename
Comment 4 Raphael Kubo da Costa 2013-11-21 16:27:01 UTC
*** Bug 312478 has been marked as a duplicate of this bug. ***
Comment 5 Franklin Weng 2014-09-11 14:02:11 UTC
in plugins/libarchive/libarchivehandler.cpp, emitEntryFromArchiveEntry(), the archive_entry_pathname_w() would return nothing with big5 filename while archive_entry_pathname() would return original (big5) filename.

ark(18998) LibArchiveInterface::emitEntryFromArchiveEntry: 2:  0x0
ark(18998) LibArchiveInterface::emitEntryFromArchiveEntry: 2.1:  ¶}©ñ¦¡¥­¥x¨t²Î³nÅéºûÅ@³Ò°È©e¥~/

where 2: prints archive_entry_pathname_w(aentry)  and 2.1: prints archive_entry_pathname(aentry).

Therefore the filename is empty and in ArchiveModel would have the following debug message:
ark(18998) ArchiveModel::newEntry: Weird, received empty entry (no filename) - skipping

This is tested on the latest git version.
Comment 6 Franklin Weng 2014-09-30 07:13:08 UTC
The problem is in libarchive.  When my environment locale is zh_TW.UTF-8, and got a zip file with Big5 filename inside, the mbstowcs would return EILSEQ because it couldn't identify the encoding.

Is it possible to add a fallback encoding option in ark, so that when failed to get the archive filename ("Weird ..." messages above), it could retry with fallback encoding?
Comment 7 Elvis Angelaccio 2016-01-14 15:56:04 UTC
Hi Franklin. Could you upload a test archive for the libarchive plugin? (e.g. a .tar.gz).

Could you also check whether Ark 15.12 + chinese locale can extract zip files? (e.g. the one you already attached here).
Comment 8 R. Sato 2017-01-16 08:18:43 UTC
I can confirm the issue, because I got the same issue with Japanese. Do you need some screen shot or other things to fix the issue?
Comment 9 Elvis Angelaccio 2017-01-16 10:15:42 UTC
(In reply to 佐藤 from comment #8)
> I can confirm the issue, because I got the same issue with Japanese. Do you
> need some screen shot or other things to fix the issue?

Yes, screenshots and test archives please. It would be awesome if you could attach both .zip and .tar.gz test archives.
Comment 10 R. Sato 2017-01-20 07:22:36 UTC
Created attachment 103555 [details]
japanese zip file - ark can not read the character

I added a ZIP file. It includes Japanese filesnames. Ark can not show the right character within the software. If you extract the ZIP file, you get wrong file names also.
Comment 11 R. Sato 2017-01-20 07:25:27 UTC
Created attachment 103556 [details]
I add a screenshot also
Comment 12 Elvis Angelaccio 2017-01-21 11:03:00 UTC
(In reply to 佐藤 from comment #10)
> Created attachment 103555 [details]
> japanese zip file - ark can not read the character
> 
> I added a ZIP file. It includes Japanese filesnames. Ark can not show the
> right character within the software. If you extract the ZIP file, you get
> wrong file names also.

Thanks! Can you add also a tar.gz file?
Comment 13 2wxsy58236r3 2018-12-15 13:11:18 UTC
I believe that the problem is related to the filename encoding.

In Franklin Weng's case, the zip (attachment 82385 [details]) can be extracted with `unar -e Big5 test.zip`, and in R. Sato's case (attachment 103555 [details]), `unar -e Shift_JIS nenngajyou-data.zip`.
Comment 14 Franklin Weng 2018-12-15 13:39:39 UTC
(In reply to qdzcuypq from comment #13)
> I believe that the problem is related to the filename encoding.
> 
> In Franklin Weng's case, the zip (attachment 82385 [details]) can be
> extracted with `unar -e Big5 test.zip`, and in R. Sato's case (attachment
> 103555 [details]), `unar -e Shift_JIS nenngajyou-data.zip`.

It is, from the very beginning.  Windows seems still use old encodings in some cases, and files generated from Winzip are mostly problematic.  In the old days I will use wine to run 7zip which can uncompress the (Chinese-name) files successfully, but in recent years there are more and more files that 7zip failed to uncompress.
Comment 15 unxed 2020-06-24 12:36:19 UTC
I recently wrote patches to p7zip and unzip for OEM charset detection based on system locale. It's exactly that windows internal zip encoder does.

https://sourceforge.net/p/infozip/patches/29/
https://sourceforge.net/p/p7zip/bugs/187/

To get correct file names you just need to install patched p7zip and set your system locale correctly. Or do something like
alias 7z='LC_ALL=el_GR.UTF-8 7z'
if you prefer opening archives using the locale different from system one.

Alkis Georgopoulos is planning to package patched p7zip to .deb's and upload to  ppa: https://github.com/mate-desktop/engrampa/issues/5#issuecomment-648410042
Comment 16 2wxsy58236r3 2021-07-04 04:18:40 UTC
*** Bug 439392 has been marked as a duplicate of this bug. ***
Comment 17 Elvis Angelaccio 2022-12-04 11:46:51 UTC
Update on this issue: I played a bit with encoding probing using both KEncodingProber and ICU.

The biggest issue with this approach is that filenames are usually very short, so the prober does not have enough data to properly guess the correct encoding.

One possible solution could be the following: we add KEncodingProber support in the libzip plugin (Ark's default plugin for zip files). If KEncodingProber detects one or more non-unicode encodings, Ark would show a notification to the user asking if they want to attempt to fix garbled filenames, if any. If the user confirms, the libzip plugin would then reload the archive and convert the filenames from the detected encoding to the standard UTF-16 encoding used by Qt. This "opt-in" step is required because if we do it automatically we could break the normal workflow for valid zip archives that only contain UTF-8 filenames (since again, the probing is not precise and could detect a wrong encoding for a valid UTF-8 filename).
Comment 18 Elvis Angelaccio 2022-12-04 12:18:21 UTC
Actually, there is bug #378904 which track the same issue and has more information. Let's keep the discussion in a single place.

*** This bug has been marked as a duplicate of bug 378904 ***