Created attachment 146961 [details] A zip file that karchive cannot read SUMMARY Kzip in karchive does not correctly handle zip files with data descriptors (ones where file headers do not include the lengths and CRCs, but they must be read from the data descriptor after the file contents). STEPS TO REPRODUCE 1. Download test.zip file from attachments 2. Try to open the zip in Dolphin, or run `kziptest list test.zip` OBSERVED RESULT Reading the zip file will not succeed EXPECTED RESULT Opening the file in Dolphin should work SOFTWARE/OS VERSIONS Linux/KDE Plasma: Kubuntu 21.10 (available in About System) KDE Plasma Version: 5.22.5 KDE Frameworks Version: 5.91.0 (karchive compiled from sources) Qt Version: 5.15.2 ADDITIONAL INFORMATION I stumbled into this when trying to open a zip generated by Google Drive in Dolphin. Trying to open the zip resulted in this error: "Could not open the file, probably due to an unsupported file format." Ark and other tools that I tried opened the zip just fine. My problematic zip included multiple epub files that are zips themselves. This seems to confuse karchive. Running `kziptest list` on this file prints the following: "Invalid ZIP file. Unrecognized header at offset 22988782". Right before that offset in my file there's a data descriptor block from _inside_ an epub file in the zip. When parsing a zip, after a file header, karchive seeks to the next data descriptor magic value after it. This is incorrect, because the compressed file contents may include the magic value. This apparently can happen with nested zips, and likely others, too. A zip can be altered to trigger this issue (that breaks CRCs, but other tools will still read the file): $ head -c 1000 /dev/random | zip -fd -fz- test.zip - $ printf 'PK\x07\x08' | dd of=test.zip seek=300 bs=1 count=4 conv=notrunc $ unzip -v test.zip Archive: test.zip Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 1000 Defl:N 1005 -1% 2022-02-19 19:17 edfd80a2 - -------- ------- --- ------- 1000 1005 -1% 1 file $ ./kziptest list test.zip Could not open "test.zip" for reading. ZIP file doesn't exist or is invalid: "Invalid ZIP file. Unrecognized header at offset 316" Other similar files do not cause errors but can cause incorrect decompression results: $ echo "aaaa" > a; echo "bbbb" > b; echo "cccc" > c $ zip -fd -fz- inner.zip a b c adding: a (deflated -39%) adding: b (deflated -39%) adding: c (deflated -39%) $ zip -fd -fz- -0 outer.zip inner.zip adding: inner.zip (stored 0%) $ unzip -v outer.zip Archive: outer.zip Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 481 Stored 481 0% 2022-02-20 06:40 72170cd6 inner.zip -------- ------- --- ------- 481 481 0% 1 file $ ./kziptest list outer.zip mode=0100664 luryus luryus "c" size: 5 pos: 223 isdir=0 mode=0100664 luryus luryus "a" size: 5 pos: 31 isdir=0 mode=0100664 luryus luryus "b" size: 5 pos: 141 isdir=0 $ ./kziptest print-all foo/outer.zip Opening zip file Listing toplevel of zip file Printing "b" SIZE=0 DATA= Printing "c" SIZE=0 DATA= Printing "a" SIZE=0 DATA= Here Kzip incorrectly jumps to read the inner zip contents while reading the outer file. Even then it cannot correctly read the files inside the inner zip (this can be seen in the `print-all` output). Wikipedia's ZIP file page (https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure) has this notion about parsing the files: "Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because (as previously mentioned in this section) only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures." Karchive seems use the bad method, resulting in this issue.
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/karchive/-/merge_requests/34
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/karchive/-/merge_requests/109
Git commit ce749a2cb1bccc02f89cfe3536207d71b8cf0657 by Stefan Brüns. Committed on 30/03/2025 at 19:03. Pushed by bruns into branch 'master'. kzip: Add various test cases testZip64NestedStored: Verify KZip does not trip over the signatures of the nested ZIP file, but only uses the signatures/headers of the outer file. testZip64NestedStoredStreamed: Verify KZip does not trip over the signatures of the nested ZIP file, but only uses the signatures/headers of the outer file. Currently fails, as data descriptors of the inner file are taken instead of the outer ones. testZip64EndOfCentralDirectory: Zip64 files may contain "Zip64 end of central directory record"s, with signature "PK\6\6". Currently fails. testZip64DataDescriptor: Zip64 data descriptors use 64 bit fields for compressed and uncompressed size instead of the normal 32 bit fields, thus the next record offset differs by 8 bytes. Currently fails. The new files have been created with Info-ZIP `zip`: > echo -ne "abcd" | zip zip64_end_of_central_directory.zip - > echo -ne "abcd" | zip - - | cat > zip64_datadescriptor.zip > zip - zip64_datadescriptor.zip zip64_end_of_central_directory.zip \ | cat > zip64_nested_stored_streamed.zip Related: bug 403899 A +- -- autotests/data/zip64_datadescriptor.zip A +- -- autotests/data/zip64_end_of_central_directory.zip A +- -- autotests/data/zip64_nested_stored_streamed.zip M +94 -0 autotests/karchivetest.cpp M +4 -0 autotests/karchivetest.h https://invent.kde.org/frameworks/karchive/-/commit/ce749a2cb1bccc02f89cfe3536207d71b8cf0657
Git commit 6db83e409131de9f3e00ae18a1cd6113250f9611 by Stefan Brüns. Committed on 30/03/2025 at 19:03. Pushed by bruns into branch 'master'. kzip: Fix misdetection of nested signatures, handle Zip64 data descriptors In case a zip file entry has an undetermined size in the local header the end of the binary data is marked by a data descriptor, with signature 'PK\7\8'. This will occasionally fail for two different reasons: 1. Compressed data randomly matching the signature 2. Stored, nested ZIP files. In the first case, almost any sanity check for the following data will do. In the second case, the 'compressed size' field of the data descriptor will not match the current position difference. Only for the data descriptor of the outer ZIP file (start position + uncompressed size == current position) will match. Also fix the size of the Zip64 data descriptors, and skip to the next header correctly. Zip64 data descriptors are marked by a "needed version" of 45 or later in the preceding local header. The header seeking is mostly rewritten. Instead of seeking backwards and forwards, and reading one character at a time from the underlying QIODevice, a larger block is read using peek(). Scanning through the block for a 'PK' signature is significantly faster than the one-byte-at-a-time approach. M +0 -2 autotests/karchivetest.cpp M +92 -79 src/kzip.cpp https://invent.kde.org/frameworks/karchive/-/commit/6db83e409131de9f3e00ae18a1cd6113250f9611