Bug 450597 - Incorrect handling of zip files with data descriptors
Summary: Incorrect handling of zip files with data descriptors
Status: RESOLVED FIXED
Alias: None
Product: frameworks-karchive
Classification: Frameworks and Libraries
Component: general (other bugs)
Version First Reported In: unspecified
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: David Faure
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-20 05:04 UTC by Lauri Koskela
Modified: 2025-03-30 19:05 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
A zip file that karchive cannot read (1.09 KB, application/zip)
2022-02-20 05:04 UTC, Lauri Koskela
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lauri Koskela 2022-02-20 05:04:41 UTC
Created attachment 146961 [details]
A zip file that karchive cannot read

SUMMARY
Kzip in karchive does not correctly handle zip files with data descriptors (ones where file headers do not include the lengths and CRCs, but they must be read from the data descriptor after the file contents).


STEPS TO REPRODUCE
1. Download test.zip file from attachments
2. Try to open the zip in Dolphin, or run `kziptest list test.zip`

OBSERVED RESULT
Reading the zip file will not succeed

EXPECTED RESULT
Opening the file in Dolphin should work

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Kubuntu 21.10
(available in About System)
KDE Plasma Version: 5.22.5
KDE Frameworks Version: 5.91.0 (karchive compiled from sources) 
Qt Version: 5.15.2

ADDITIONAL INFORMATION

I stumbled into this when trying to open a zip generated by Google Drive in Dolphin. Trying to open the zip resulted in this error: "Could not open the file, probably due to an unsupported file format." Ark and other tools that I tried opened the zip just fine.

My problematic zip included multiple epub files that are zips themselves. This seems to confuse karchive. Running `kziptest list` on this file prints the following: "Invalid ZIP file. Unrecognized header at offset 22988782". Right before that offset in my file there's a data descriptor block from _inside_ an epub file in the zip. 

When parsing a zip, after a file header, karchive seeks to the next data descriptor magic value after it. This is incorrect, because the compressed file contents may include the magic value. This apparently can happen with nested zips, and likely others, too. A zip can be altered to trigger this issue (that breaks CRCs, but other tools will still read the file):

$ head -c 1000 /dev/random | zip -fd -fz- test.zip -
$ printf 'PK\x07\x08' | dd of=test.zip seek=300 bs=1 count=4 conv=notrunc
$ unzip -v test.zip                                             
Archive:  test.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    1000  Defl:N     1005  -1% 2022-02-19 19:17 edfd80a2  -
--------          -------  ---                            -------
    1000             1005  -1%                            1 file

$ ./kziptest list test.zip
Could not open "test.zip" for reading. ZIP file doesn't exist or is invalid: "Invalid ZIP file. Unrecognized header at offset 316"


Other similar files do not cause errors but can cause incorrect decompression results:

$ echo "aaaa" > a; echo "bbbb" > b; echo "cccc" > c
$ zip -fd -fz- inner.zip a b c
  adding: a (deflated -39%)
  adding: b (deflated -39%)
  adding: c (deflated -39%)
$ zip -fd -fz- -0 outer.zip inner.zip
  adding: inner.zip (stored 0%)
$ unzip -v outer.zip 
Archive:  outer.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     481  Stored      481   0% 2022-02-20 06:40 72170cd6  inner.zip
--------          -------  ---                            -------
     481              481   0%                            1 file

$ ./kziptest list outer.zip 
mode=0100664 luryus luryus "c" size: 5 pos: 223 isdir=0
mode=0100664 luryus luryus "a" size: 5 pos: 31 isdir=0
mode=0100664 luryus luryus "b" size: 5 pos: 141 isdir=0
$ ./kziptest print-all foo/outer.zip
Opening zip file
Listing toplevel of zip file
Printing "b"
SIZE=0
DATA=
Printing "c"
SIZE=0
DATA=
Printing "a"
SIZE=0
DATA=

Here Kzip incorrectly jumps to read the inner zip contents while reading the outer file. Even then it cannot correctly read the files inside the inner zip (this can be seen in the `print-all` output).


Wikipedia's ZIP file page (https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure) has this notion about parsing the files:
"Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because (as previously mentioned in this section) only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures."

Karchive seems use the bad method, resulting in this issue.
Comment 1 Bug Janitor Service 2022-03-23 23:20:04 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/karchive/-/merge_requests/34
Comment 2 Bug Janitor Service 2025-03-27 01:42:57 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/karchive/-/merge_requests/109
Comment 3 Stefan Brüns 2025-03-30 19:05:12 UTC
Git commit ce749a2cb1bccc02f89cfe3536207d71b8cf0657 by Stefan Brüns.
Committed on 30/03/2025 at 19:03.
Pushed by bruns into branch 'master'.

kzip: Add various test cases

testZip64NestedStored: Verify KZip does not trip over the signatures of
the nested ZIP file, but only uses the signatures/headers of the outer
file.

testZip64NestedStoredStreamed: Verify KZip does not trip over the
signatures of the nested ZIP file, but only uses the signatures/headers
of the outer file. Currently fails, as data descriptors of the inner file
are taken instead of the outer ones.

testZip64EndOfCentralDirectory: Zip64 files may contain "Zip64 end
of central directory record"s, with signature "PK\6\6". Currently fails.

testZip64DataDescriptor: Zip64 data descriptors use 64 bit fields
for compressed and uncompressed size instead of the normal 32 bit fields,
thus the next record offset differs by 8 bytes. Currently fails.

The new files have been created with Info-ZIP `zip`:

> echo -ne "abcd" | zip zip64_end_of_central_directory.zip -

> echo -ne "abcd" | zip - - | cat > zip64_datadescriptor.zip
> zip - zip64_datadescriptor.zip zip64_end_of_central_directory.zip \
  | cat > zip64_nested_stored_streamed.zip
Related: bug 403899

A  +-    --    autotests/data/zip64_datadescriptor.zip
A  +-    --    autotests/data/zip64_end_of_central_directory.zip
A  +-    --    autotests/data/zip64_nested_stored_streamed.zip
M  +94   -0    autotests/karchivetest.cpp
M  +4    -0    autotests/karchivetest.h

https://invent.kde.org/frameworks/karchive/-/commit/ce749a2cb1bccc02f89cfe3536207d71b8cf0657
Comment 4 Stefan Brüns 2025-03-30 19:05:15 UTC
Git commit 6db83e409131de9f3e00ae18a1cd6113250f9611 by Stefan Brüns.
Committed on 30/03/2025 at 19:03.
Pushed by bruns into branch 'master'.

kzip: Fix misdetection of nested signatures, handle Zip64 data descriptors

In case a zip file entry has an undetermined size in the local header
the end of the binary data is marked by a data descriptor, with
signature 'PK\7\8'.

This will occasionally fail for two different reasons:
1. Compressed data randomly matching the signature
2. Stored, nested ZIP files.

In the first case, almost any sanity check for the following data
will do.

In the second case, the 'compressed size' field of the data descriptor
will not match the current position difference. Only for the data
descriptor of the outer ZIP file (start position + uncompressed size ==
current position) will match.

Also fix the size of the Zip64 data descriptors, and skip to the next
header correctly. Zip64 data descriptors are marked by a "needed version"
of 45 or later in the preceding local header.

The header seeking is mostly rewritten. Instead of seeking backwards
and forwards, and reading one character at a time from the underlying
QIODevice, a larger block is read using peek(). Scanning through
the block for a 'PK' signature is significantly faster than the
one-byte-at-a-time approach.

M  +0    -2    autotests/karchivetest.cpp
M  +92   -79   src/kzip.cpp

https://invent.kde.org/frameworks/karchive/-/commit/6db83e409131de9f3e00ae18a1cd6113250f9611