Bug 503364

Summary: K7Zip Appears to Extract the Whole Archive When Calling KArchive::open
Product: [Frameworks and Libraries] frameworks-karchive Reporter: bryantcoke
Component: generalAssignee: KIO Bugs <kio-bugs-null>
Status: REPORTED ---    
Severity: major CC: azhar-momin, kdelibs-bugs-null
Priority: NOR    
Version First Reported In: 6.13.0   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed/Implemented In:
Sentry Crash Report:

Description bryantcoke 2025-04-25 20:19:46 UTC
SUMMARY
KArchive takes a long time to open 7z archives compared to other software. Opening a 7z archive and getting the entry list in Ark is essentially instant. Noticed the time it took for KArchive to open the same 7z archive seemed to be about the same as it would take for Ark to extract that archive. Looking at memory usage when my software was opening said archives I noticed the memory usage was much higher than when opening other archives. It would spike up initially when opening by several times the size of the archive then drop down. Even after it was opened it is still using much more memory for the instance of the K7Zip compared to other archives. Just having it open appears to have about as much memory in use by it as the entire archive. Which makes me think that K7Zip is extracting the whole archive to memory when calling KArchive::open.

STEPS TO REPRODUCE
For repro steps I will give two options since the behavior can be seen with any program utilizing KArchive for 7z support. Also will mention the archive in question I was using for testing was a folder with 200 jpgs in it at around 150MB in total size.

Option 1: Okular
1. Using Okular open a .cb7 archive compressed with either LZMA, LZMA2, or BZip2. BZip2 is recommended since it will take the longest so will be most noticeable. Any of them are fine though.
2. Check the memory it is using
3. Now make a cbz of the contents of that archive and look at the difference in memory usage when opening it in Okular

Option 2: Create a small snippet of code
1. Create a small snippet of code that simply opens an archive and gets the entry list. With debug messages between the steps.

OBSERVED RESULT
For Okular you will notice that it takes significantly longer to open a a cb7 archive compared to cbz while using more memory.

For the snippet of code you will notice it hanging on the open.


EXPECTED RESULT
For it to open in a timely manner and not have essentially the whole archive loaded into RAM.

SOFTWARE/OS VERSIONS
Linux: 6.14.2
KDE Plasma Version: 6.3.4
KDE Frameworks Version: 6.13.0
Qt Version: 6.8.2

ADDITIONAL INFORMATION
This behavior dates quite far back I have a Pop!_OS 20.04 system I still use for some things which also experiences the exact same issue with a long time to open and high memory usage with K7Zip. So this issue appears to have been around for at least 5 years it seems.

Hope my explanation of the bug/issue is good enough it might be a little confusing with the wording opening. Some may think I am referring to extracting it or something, but I am just referring to creating an IODevice essentially and just reading the entry list which shouldn't take any real amount of time with modern processors I believe. Certainly shouldn't if Ark and other archive tools which don't rely on KArchive can show the entry list near instantly.
Comment 1 bryantcoke 2025-04-25 20:23:33 UTC
Should correct my statement of it "hanging on open" by that I meant it takes a significantly long time to finish compared to other KArchive types, but it does not crash. Meant hanging as in takes a long time should of phrased that better.
Comment 2 bryantcoke 2025-05-01 22:11:25 UTC
So after poking around trying to wrap my head around the code every now and then I think I've somewhat understood what is going on.

It isn't a bug, but how K7Zip is designed. It extracts all the entries data for each file entry on openArchive() and returns that on calling data() rather than doing what KZip, KTar, and the other archive types of KArchive that I am aware of do. Which is uses a KLimitedIODevice to read the data at that file entries position and size then decompress it with a KCompression device. This difference means that unlike the other archive types which open quickly because all they need is the position and size K7Zip has to decompress everything to memory.

This is why when you try to open an unsupported compression type for K7Zip it will near instantly have the entry list with all it's information with the only thing being missing from the private entry is the data since it was unable to decompress it. Which is honestly how I expected it to behave just in general since none of the other archive types in KArchive extract the data when calling open().

The design leads to other issues like trying to open archives that are larger than system memory is not possible as far as I can tell. When I made a 7z archive larger than my memory and tried to open it via open() it just printed out "decode error" which is line 1844 and in the function  K7Zip::K7ZipPrivate::readAndDecodePackedStreams. Unlike unsupported compression types though there was no entry list generated. Which makes sense I guess since that is one of the final things done in openArchive() in K7Zip. Although this could be some other issue because I have still yet to fully wrap my head around all the code in K7Zip.

So while this is not a bug in that it's not functioning as it is designed I would call it's design a bug as far as the library is concerned. No other archive type in KArchive functions this way as far as I am aware. The general expectation is that it's an IO device for a file. So the memory footprint should be small at least compared to the size of the archive. It is simply just a way to interface with it. Unless there is some specific reason 7z can't be handled in that way, but seeing how other software out there seems to handle it I don't think that's the case.
Comment 3 Azhar Momin 2025-05-08 14:46:48 UTC
I think K7Zip extracts the whole archive up front because 7z uses solid compression (https://en.wikipedia.org/wiki/Solid_compression#Costs) i.e, multiple files can be packed into a single compressed block (or "folder"), so to get one file, you often need to decompress the entire block it’s in. This makes lazy loading a bit tricky.

That said, the 7z format does support splitting archives into smaller solid blocks. So in theory, K7zip could be smarter: instead of extracting everything in open(), it could defer that and just decompress the relevant block when K7ZipFileEntry::data() is called.

I’d actually be interested in trying to implement something like that, if only the 7z format had proper official documentation.

A simpler option though might be to add API like K7Zip::setFlags(K7ZIP_DO_NOT_DECOMPRESS) or something, so users can choose whether they want just the entry list or full data up front. That would help apps avoid the big memory hit just for listing files.
Comment 4 bryantcoke 2025-05-09 01:54:29 UTC
Was aware that 7z archives generally use solid compression but wasn't too sure how K7Zip was handling it since I saw no mentions of blocks anywhere. Now I realize after your comment the "folders" bit I was confused by in the source code are the blocks. 

The thing is though 7z archives are not always solidly compressed you can choose to have them be non solid when creating the archive. The default block size is not fully solid either they're not by default just a giant contiguous block of data. For a 7z generally as far I am aware the default block size will vary depending on compression method and rate. There is also the Store method which does not compress the data at all and I would assume also has no blocks. So at the moment it looks like K7Zip only supports 7z archives made up of blocks of solidly compressed data and not non solid ones if I am understanding things correctly? Ideally I would like K7Zip to support non solid 7z archives as well but that might be quite a bit of work to add support for. 

Now the flag solves part of my problem but sadly not all of it. Ideally I really would like a way to simply get a single file from the relevant blocks like you mentioned. Which should be possible since Ark can extract a single file from a 7z archive in a timely manner. Due to how things currently function there are some serious slowdowns when interacting with 7z archives in some programs. Dolphin's thumbnail generator shouldn't take forever on cb7 archives due to it extracting the whole archive when all Dolphin and other programs generating thumbnails for comic book archives are only interested in a single file. What led me to discover this is I am working on a comic viewer/image viewer and when testing with a 7z archive of images to make sure everything worked as it should it took over a minute to open. Which felt like a bit of an absurd length of time to just display a single image so I decided to look into things.

This is a bit of a tangent but is somewhat related to the topic that I stumbled upon while investigating this issue which may warrant it's own ticket but I will just ask it here since it may be minor enough. Why does isOpen() return true even when the archive's compression method is unsupported? Believe it also returned true for the 7z archive that was too large to fit into memory as well. Shouldn't it only return true if it properly opened it and you can interact with it? As far as I am aware archives generally don't support multiple compression methods so what are you going to do with said "open" archive that you are unable to read any data from and I would assume are unable to write any data to?
Comment 5 bryantcoke 2025-05-09 02:08:45 UTC
Will just add a clarification I only said "The default block size is not fully solid either they're not by default just a giant contiguous block of data. " Which I said because of this in your post "That said, the 7z format does support splitting archives into smaller solid blocks." that made it seem as if by default it was always a solid contiguous block of data and not always made up of smaller blocks. Where the generally the norm is that 7z archives are generally made up by smaller blocks. Although it can vary with settings as I mentioned. This also depends on the total size of the archive and files being dealt with of course.