Bug 73821

Summary: kioslaves should be nestable; e.g. browsing a zip inside a zip should work
Product: [Frameworks and Libraries] frameworks-kio Reporter: Stéphane Gourichon <stephane_kde>
Component: generalAssignee: Thiago Macieira <thiago>
Status: RESOLVED FIXED    
Severity: wishlist CC: 2Kmm, andy.koppe, ansla80, dmoyne, faure, gfraiteur, ingomar, jos, jpalecek, kdelibs-bugs, laurens, m.debruijne, mail, nate, nicolasg, obxtarheel, oded, projects.gg.aaron, pscholl, rebel, samjnaa, thiago, tim-ri, tonysivori
Priority: NOR    
Version: 5.45.0   
Target Milestone: ---   
Platform: Mandrake RPMs   
OS: Linux   
Latest Commit: Version Fixed In: 5.47

Description Stéphane Gourichon 2004-01-30 14:22:19 UTC
Version:            (using KDE KDE 3.1.3)
Installed from:    Mandrake RPMs
OS:          Linux

* Summary :
Browsing a zip inside a zip or through http doesn't work. The URL scheme used by midnight commander is okay for that but the current URL scheme used by KDE cannot. Real and realistic examples and discussion included.

* How to reproduce : http://www.gutenberg.net/etext95/cavep11.zip is a zip archive containing files and two zips (prehistoric painting in a cave in France). Download the archive to /tmp then use konqueror to browse /tmp and click on the zip. konqueror uses kio_zip to display the files. That's ok. Now click on CAVEJ11.ZIP (which is inside the cavep11.zip).

Expected result : the content of CAVEJ11.ZIP shows up in the Konqueror window

Actual result : If ark is available, konqueror asks if it should use it. If not available, konqueror ask for an application to open it.

Ok a zip with in a zip is not ideal but it has to be dealt with. But a zip through http is also a problem at the moment. 
* How to reproduce : browse http://www.gutenberg.net/browse/BIBREC/BR249.HTM , click on a link to a zip (bottom of the page, for example cavep11.zip).

Expected result : konqueror downloads the zip into the local disk cache and shows the content in the window, allowing to browse it as if it were local (it is actually in the local cache).

Actual result : Same as other case. If ark is available, konqueror asks if it should use it. If not available, konqueror ask for an application to open it.

** Comments : 

* (rant, sorry) is ark useful at all ? It looks like an obsolete piece of software unaware about the power of kioslaves.

* (more powerful) kioslaves should be stackable, it looks like they aren't, but I may be wrong.

The zip:/path/ URL scheme is insufficient for that. IMHO when browsing a zip locally, the URL should not be zip:/tmp/cavep11.zip/ but something like file:/tmp/cavep11.zip#zip/ like midnight commander does.

With that scheme, one would be able to browse from a website a zip containing a gzipped tar containing a zip containing another zip and have KDE do the right thing. Now the URL would say :
http://server/path/file.zip#zip/mygippedtar.tar.gz#gz#tar/a.zip#zip/another.zip#zip/afileinside.txt

("#" has to be a character that is different from what is allowed to send to a server. If you simply ask http://server/path/file.zip/afileinside.txt the whole path would be sent to the server, not relying on local chains of kioslaves to do the trick. It could work with a very smart server but we're talking client issues here).

* philosophy from above

An http prefix is ok because http means the information is really elsewhere (http is in essence a *link*). But zip and the rest are only *containers* which may contain other containers, hence needing a reentrant chaining mechanism. The presence of a #kioslavename like #zip or #gz, etc. in the URL after the name of a file should be enough to trigger the "browse" method of the corresponding kioslave. In this example : file:/tmp/cavep11.zip#zip/cavep.eng and http://www.gutenberg.net/etext95/cavep11.zip#zip/cavep.eng 
http://www.gutenberg.net/etext95/cavep11.zip#zip/CAVEJ11.ZIP#zip/HYENE11.JPG
should all do what you understand.

* (more powerful ideas left unsaid so far)

* Conclusion

Sorry I don't know the internals of KDE well enough to suggest something more precise. I'm open to discussion.

Thanks a lot for KDE. It looks like a powerful and useful system, getting better all the time. Kudos to everyone involved !
Comment 1 Sashmit Bhaduri 2004-03-16 07:06:43 UTC
Doesn't kio already use # for sub-protocols? I'm pretty sure it does..
Comment 2 Stéphane Gourichon 2004-05-03 20:41:28 UTC
Sashmit Bhaduri wrote :
> Doesn't kio already use # for sub-protocols? I'm pretty sure it does.. 

On a fresh account on a vanilla Mandrake 10.0 community (KDE 3.2 BRANCH >= 20040204)
I tried things like
http://www.gutenberg.net/etext95/cavep10.zip#zip/cavep.fr
http://www.gutenberg.net/etext95/cavep10.zip#uzip/cavep.fr
http://www.gutenberg.net/etext95/cavep10.zip#cavep.fr
The # and the rest is always ignored.

Can you give examples ? I've never seen Konqueror display or understand # for sub-protocols.

Comment 3 Stéphane Gourichon 2004-05-03 20:46:56 UTC
** The experiments below show that Konqueror (KDE 3.2) does not use any reentrancy property of kioslaves, nor # for sub protocols. **

I recently installed Mandrake 10.0 community. I used a new empty account, to avoid problems with old config files. The KDE there is labelled KDE 3.2 BRANCH >= 20040204. I did again the tests in this bug description.

First test : browsing a local zip file in Konqueror (downloaded from http://www.gutenberg.net/etext95/cavep10.zip ), which contains another zip. What happens when clicking on the outer then the inner zip ?

Result : on clicking on the outer zip, Konqueror uses the zip kioslave. Fine. The URL becomes zip:/tmp/cavep10.zip/ which is a limited scheme (see bug description). Clicking on the first file CAVEG10.ZIP, Konqueror asks "Open « zip:/tmp/cavep10.zip/CAVEG10.ZIP » ? I answer yes. Now URL becomes zip:/tmp/cavep10.zip/CAVEG10.ZIP and the display looks like a zip kpart, not a zip kioslave. 

Comment : This is different from KDE 3.1 and better. Still, it is a bit strange that a kioslave is used first, then a kpart. The display is different in a kpart (no big icon views, different context menu, etc...), whereas a kioslave looks more integrated. Still no # in URL.


Second test : clicking on a remote zip accessible via http http://www.gutenberg.net/etext95/cavep10.zip . Http server sets  Content-Type: application/zip which is correct IMHO. 

Result : Konqueror displays the ark kpart, with its drawbacks. URL displayed is http://www.gutenberg.net/etext95/cavep10.zip which is nice.

Comment : similar to the zip-in-a-zip case. It is indeed zip-in-a-http, or more precisely, a kpart accessing a file through the http kioslave.

Second test one step furhter : now clicking on a zip inside the zip accessible through http. Konqueror opens the ark application and displays the content.

Overall conclusion : this version of KDE, compared with KDE 3.1 (vanilla Mandrake 9.1) inserted the use of an ark kpart between kioslave and ark application.  But this is not reentrant. The limited URL scheme zip:/path is still used. I couldn't haeve Konqueror understand any usage of # for sub-protocol.

Is this intentionnal ? I can understand that sometimes thing mustn't become too transparent and seem so easy, or the user may abuse it, like innocently editing a text file in an enormous archive on a remote server and wondering why it crawls.

Please comment.
Comment 4 Tim Tye 2004-05-26 16:47:11 UTC
A very common use for this feature would be browsing Java Enterprise Archives (ear) files.  The ear file (basically a zip file) contains jar files (also zip files) which may also contain more jar files.  Currently you must extract the jar in the ear and then extract the jar in the jar to finally browse for the information you were looking for.
Comment 5 Nicolas Goutte 2005-05-15 17:57:56 UTC
See also bug #96629
Comment 6 Thiago Macieira 2005-05-15 19:29:20 UTC
*** Bug 96629 has been marked as a duplicate of this bug. ***
Comment 7 Thiago Macieira 2005-05-15 19:31:40 UTC
*** Bug 102265 has been marked as a duplicate of this bug. ***
Comment 8 Jos van den Oever 2005-05-16 11:30:21 UTC
Please read the comments in bug 102265. They contain useful suggestions for url schemes.
Comment 9 Thiago Macieira 2005-06-08 05:54:38 UTC
*** Bug 106957 has been marked as a duplicate of this bug. ***
Comment 10 Michal Svec 2005-06-08 14:17:11 UTC
BTW midnight commander uses # for referencing further recursion. For example
if you want to access a file C from ta.tar.bz2 which is in tb.zip, use:

  /tmp/x/tb.zip#uzip/ta.tar.bz2#utar/C
Comment 11 Nick Shaforostoff 2005-11-13 11:25:30 UTC
remote zip files should not be downloaded (cached localy) by ioslave just to list their contents - ioslave must check if there is a dlresume available and download _zip header only_

based on the header there is a way to dl only selected files from zip-archive, but not all of them
Comment 12 Nicolas Goutte 2006-02-12 17:04:17 UTC
See also bug #121773 

To comment #11: the header of a ZIP file is split between somewhere in the file (local headers) and at the end (central directory). So you will have to download the whole ZIP file.

Have a nice day!
Comment 13 Shriramana Sharma 2006-03-03 03:49:07 UTC
Even Windows nowadays uses D:\Docs\foo.zip\readme.txt. The kioslaves zip:/ and tar:/ should be made transparent to file:// and http:// so that we do not get a separate zip:/ and tar:/ display which we cannot pass as URLs to other people. Added 20 votes.
Comment 14 Stéphane Gourichon 2006-03-03 09:35:00 UTC
Following comment #12 : I think that Nick in comment #11 meant that if the protocol used can download only part of the file (ftp command REST and many http servers can), then the kioslave does not need to download everything.

To browse the zip, it may only download a small header at the beginning, see at what position in the remote file the "central directory" is, then download only the "central directory" to list files by asking the server only a specific range of bytes to download. So, no need to download the whole zip file to list the content.

Again, when opening a file inside the zip, the same trick can be used, do transfer only the part of the zip needed to decode that file.
Comment 15 Thiago Macieira 2006-03-03 10:14:15 UTC
To Comment #13: yes, that's right. Windows has caught up with us now. Who will implement Zip-over-HTTP first is the question now (since they don't).

Comment #14: you guys still do not understand the concept of retrieving a remote file. You cannot retrieve a *section* of a remote file from FTP: you can only restart the transfer from a certain point and it will continue until the *end* of the file.

If I have the time, I'll work on support for this kind of filtering for local files *only*. This is what this wish is asking for.
Comment 16 Andy 2006-03-03 10:22:45 UTC
If you can (re)start the transfer at any point, isn't that sufficient? You don't have to download the file to the end, you can always terminate the connection earlier. Of course that incurs the overhead of having to reestablish the connection for the next access, but depending on the size of the zip and the speed of the connection that might well be worth it compared to having to download the whole thing.
Comment 17 Michal Svec 2006-03-03 10:33:15 UTC
Please note, that for most users is IMHO sufficient any way of transparent access, they don't care very much what are they going to transfer (now they have to download the file complete anyways).

So I'd suggest to start with some simple implementation (like download/cache complete files if not possible to list them remotely) and improve later on.

Using that Midnight Commander as an example again, it of course downloads whole ZIP files over FTP, but nobody cares, because it _just works_.
Comment 18 Stéphane Gourichon 2006-03-03 11:13:32 UTC
To Thiago, comment #15 : the http Content-Range field allows to specify any segment. RFC 2616 gives this example on http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16

>       . The second 500 bytes:
>       bytes 500-999/1234

To Michael, comment #17 : you're right. This bug is not about optimizing bandwidth issues in zip kioslave but about a handy URL scheme and its support to access files in zip-in-zip and zip-through-another-kioslave.

To Shriramana, comment #13 : feature seems to works only partially in WinXP SP2.

C:\path\myzip.zip\mydir opens mydir in explorer (but does not show the whole path, only "mydir", although explorer preferences are set to "show full path")

C:\path\myzip.zip\mydir\myfile.txt does not open myfile. 

Browsing in a zip and opening a file associated with a browser shows that the URL sent to the browser looks like C:\SomePathToATempDir\Temporary Directory X for myzip.zip\mydir\my\path\to\file.html .
So, zip in Windows explorer does not have an integrated design. KDE is more advanced than Windows IMHO.

By the way, it is funny how explorer uses forward slashes when browsing inside zips. Using forward slashes where backslashes are expected has worked for ages in Windows (since DOS actually), did you know that ?

Back to the issue, what we need now is attention of a/some person(s) fluent enough in C++ / KDE internals (or motivated to learn/try) to offer a patch... and, in the meantime, bring more votes, bring more attention.
Comment 19 Thiago Macieira 2006-03-03 11:19:16 UTC
You have me already.
Comment 20 Andy 2006-03-03 11:22:01 UTC
Comment #13: the suggestion to make zip transparent to just http: and file: is not a good  one. What about ftp:, sftp:, media:, and so on? Hacking transparency into all these is not a scalable solution.

The Windows solution is dodgy anyway. It relies on file name extensions, which were always a bad idea and which might lead to problems when combined with MIME types. It's also ambiguous, because foo.zip in D:\Docs\foo.zip\readme.txt could just be the name of a directory.

On the KDE side, the problem with things like zip or tar is that they don't really fit in with the concept of URIs and IO slaves. They are data filters rather than data sources like http: or file: are.

Midnight commander has the concept right, but the '#' character is an unfortunate choice for the actual syntax, because that's already used as the fragment identifier in URIs.
Comment 21 Stefan Monov 2006-03-03 11:31:24 UTC
Re: comment #20
It's not ambiguous, since you can't have a file and directory with the same name in the same directory.
To the user zip, tar.gz, etc. are just directories that take up less space. And a tar is just a directory to them - the user should not *need* to care about implementation.
Therefore, the scheme proposed in comment 13 seems reasonable.
The # syntax would do nothing but complicate things for users because it'd require them to be careful about things that they don't want to know about.
Comment 22 Nicolas Goutte 2006-03-03 12:14:02 UTC
On Friday 03 March 2006 09:35, Stéphane wrote:
(...)
>
> To browse the zip, it may only download a small header at the beginning,
> see at what position in the remote file the "central directory" is, then
> download only the "central directory" to list files by asking the server
> only a specific range of bytes to download. So, no need to download the
> whole zip file to list the content.


To tell it again: retrieving the central directory mean retrieving 100% of the 
ZIP file, as this directory is at the end of the file at a non-predetermined 
place. (There is no "small header at the beginning" of a ZIP file. A ZIP file 
starts with the header of the first file.)

(...)

Have a nice day!
Comment 23 P Scholl 2006-03-03 12:32:10 UTC
Perhaps this would be possible using Fuse KIO Gateway (http://kde.ground.cz/tiki-index.php?page=KIO+Fuse+Gateway) or a similar Mechanism. FUSE stands for File System in Userspace, so the possibilty to write application that handle linux VFS Layer requests. 

Think of some special directory (/home/user/kioslaves/) which includes all possible kioslaves (/home/user/kioslaves/file:/ or /home/user/kioslaves/http:/ or /home/user/kioslaves/smb:/ or /home/user/kioslaves/zip:/). If implemented right chaining would be possible without changing any kioslaves, simply by opening a "local file" like /home/user/kioslaves/zip://../http://somehost.org/test.zip, which would select the zip kioslave opening /home/user/http://somehost.org/test.zip. 
Comment 24 Thiago Macieira 2006-03-03 12:38:30 UTC
KDE will not use FUSE in KDE4: we will use KIO. If people want to write FUSE gateways, we'll be happy to accomodate. It's even possible to distribute it as part of the standard KDE for the benefit of non-KDE applications.

But the semantics of remote file access and local file access are very different. We believe applications should know they're accessing slow, remote files, so we won't use FUSE in KDE applications.
Comment 25 Nicolas Goutte 2006-03-03 12:54:28 UTC
On Friday 03 March 2006 10:14, Thiago Macieira wrote:
(...)
> If I have the time, I'll work on support for this kind of filtering for
> local files *only*. This is what this wish is asking for.


This exact wish perhaps but not some of the duplicate wishes. The wanted goal 
is really chaining of KIO slaves.

Have a nice day!
Comment 26 Andy 2006-03-03 13:25:18 UTC
Re comment #20:

A Windows-style scheme would be ambiguous in the sense that you couldn't tell from the address alone how to actually access a file. Instead of just requesting a file with its full path from the server or local file system, you'd need to look at every "directory" in the path to see whether it isn't in fact an archive of some sort and handle it accordingly.

This complicates the implementation and it also has a (small) performance impact.

Question is, are those drawbacks worth it? With the Midnight Commander approach archives can already be used like directories. The only visible difference to the Windows approach is in what appears in the address bar, which users usually doesn't care about anyway. Yet sometimes they might actually want to know exactly where the file they're looking at is located. 
Comment 27 Thiago Macieira 2006-03-03 13:58:51 UTC
I will probably use # or an ugly syntax starting with multi:.

(I'm not asking for input; I'm saying what I plan to do)
Comment 28 Jos van den Oever 2006-03-03 15:49:59 UTC
With respect to chaining io slaves, the original topic.

What I want for an indexer for my desktop is to search all files even if they are in a nested zip file. I am now writing code to access this data easily and fast for any type of nested file. This means that something like
/home/oever/work.zip/old.zip/resume.html will be indexed.
When this work is finished, I will propose it for Kat.
Indexing this data is less usefull if Konqueror cannot handle the mentioned URL. I'm not exactly sure on how to implement that.

My current implementation is similar to the java class java.io.InputStream. There you can also read nested files as a stream. This is great for indexing all content, but less suited, though doable for random access.

Discussing an extention to the KIO API for nested files in a streaming manner would be ok and such an implementation is doable.

If someone's interested I can post the work in progress. Currently I can read zip and gzip. Work on tar and bz2 is next. Other nested formats such as rpm may follow. Jar files and opendocument files are really zip files, so they are covered.

Comment 29 Thiago Macieira 2006-03-03 16:15:22 UTC
The URI format I will use for "chained" IOSlaves will probably be:

multi:(original-URL)#(filter1)#(filter2)#...

So, your URL will become:
multi:file:///home/oever/work.zip#zip:/old.zip#zip:/resume.html

Why repeat "zip:/"? Because "extensions" are overrated. You could have:
multi:file:///home/oever/MyZipFile#zip:/FileInsideZip#zip:/HtmlFile

I don't care if it's ugly. I don't care if users can't type it: they are not *supposed* to. This URI should be generated by browsing through files.

This will also allow for:
multi:fish://username@remotehost/home/oever/MyZipFile#zip:/FileInsideZip#zip:/HtmlFile

But it will retrieve the whole file.
Comment 30 Jos van den Oever 2006-03-03 16:19:36 UTC
I completely agree that extentions are overrated. So are the first 4 bytes of a file. I have no strong opinion on the URL, but a nicely readable format would be nice. The icons in kde are not determined by extention alone and neither should the choice for the kioslave. It should also look at content.

This is the reason why the streaming implementation i'm working on has a rewind functionality. If one kio_slave fails, just rewind and try another one.
The size of the rewind buffer is of course finite and not very big, but it needn't be.

Comment 31 Michal Svec 2006-03-03 18:15:46 UTC
Comment #29: sounds very good.

But is there a need for multi: keyword in the beginning? I mean do we need to differentiate between chained and unchained URIs?
Comment 32 Thiago Macieira 2006-03-03 18:24:46 UTC
That depends on how I implement it. But it's very likely, yes.

After discussing on IRC, we came to the conclusion that the filter name after the # will be unnecessary. But I'll refrain from further speculation until I or someone else writes the code.
Comment 33 Michal Svec 2006-03-03 18:30:26 UTC
OK, that sounds surprising (how will you know the file type then), but anyways, let's see how it works out. Thanks!
Comment 34 Stefan Monov 2006-03-03 18:51:49 UTC
Re: comment #26
Exposing technical details to the user may be worse than complicating internals, depending on mindset and resources. However, Thiago said that the user won't deal with this, so it's okay.
Comment 35 Thiago Macieira 2006-03-09 21:58:10 UTC
*** Bug 19229 has been marked as a duplicate of this bug. ***
Comment 36 Kevin Ottens 2006-06-10 12:08:52 UTC
*** Bug 122078 has been marked as a duplicate of this bug. ***
Comment 37 Jos van den Oever 2006-09-14 09:41:46 UTC
An actual implementation of chaining kio-slaves is now provided by Strigi.
http://websvn.kde.org/trunk/playground/base/strigiapplet/src/jstream/
This kioslave is required there because it must be possible to open search results from nested files.
The jstream kioslave can open any file that Strigi can parse, which includes files nested in rpm, deb, tar, bz2, gz, zip and email files.
The uris are very simple, e.g. jstream:/home/a/b.zip/c.rpm/README

Although this does not solve the complete problem as described above it does handle the most common use case and therefor this code might be a good start for doing this in KDE4.
Comment 38 Martin Rehn 2006-09-14 18:15:27 UTC
What does "jstream" mean? (e.g., what is the "j" and the "stream"?)
Comment 39 Jos van den Oever 2006-09-14 21:44:02 UTC
In java you have a class InputStream. Many derived classes take an InputStream as input. You are then chaining these streams together. There is e.g. the ZipInputStream. This takes one InputStream and gives out a new InputStream for each file in the ZipInputStream. Strigi uses this idea, hence the 'j'. The streams in Strigi are much more efficient than those in java because they share their buffers as much as possible.
Comment 40 Christian Loitsch 2006-10-30 16:10:18 UTC
I think one of the summer of code projects implemented random access with kio-slaves.  Would this avoid "downloading" the whole file as mentioned in comment #29, or is there another reason I don't see?

Comment 41 Tommi Tervo 2006-12-05 14:17:16 UTC
*** Bug 138388 has been marked as a duplicate of this bug. ***
Comment 42 Richard Hartmann 2008-02-17 11:40:38 UTC
As it has been more than a year since the last message, I just wanted to ask if KDE4 & Strigi solve this item. Sorry, no current KDE4 handy to check :/
Comment 43 Laurens Vanhove 2009-01-26 20:42:28 UTC
One more year has passed and nothing seems to have changed :-(
Is anyone working on this ?
Comment 44 Jos van den Oever 2009-01-26 20:52:09 UTC
I am porting jstream:/ to KDE4. This allow you to open e.g. jstream://home/you/x.zip/y.tar.bz2/hello.jpg

http://websvn.kde.org/trunk/playground/base/strigiplasmoid/src/jstream/
Comment 45 Tony Sivori 2010-02-08 17:51:26 UTC
Konqueror won't open Web Archive files (.war) that have been created and then copied to CD or to a second internal hard drive. KDE 3.5.10 on Kubuntu 8.04.

Is no one interested in fixing this six year old bug?
Comment 46 Nate Graham 2018-04-16 19:31:34 UTC
Still an issue in KDE Frameworks 5.45.
Comment 47 David Faure 2018-05-09 21:22:24 UTC
https://phabricator.kde.org/D11155 is the pending patch to parse nested zip files correctly.
Comment 48 Albert Astals Cid 2018-05-20 21:48:48 UTC
Git commit 07e5678981c61158aeb7226ca02bfa5891e1b5e8 by Albert Astals Cid, on behalf of Martin Tobias Holmedahl Sandsmark.
Committed on 20/05/2018 at 21:48.
Pushed by aacid into branch 'master'.

handle zip files embedded within zip files

Summary:
if we need to fetch the sizes from a PK78 header, there might be a PK34 header before it if there is a zip file embedded (e. g. an epub within a zip).
FIXED-IN: 5.47

Test Plan: autotest in a separate commit

Reviewers: dfaure, #frameworks

Reviewed By: dfaure

Subscribers: ngraham, #frameworks

Tags: #frameworks

Differential Revision: https://phabricator.kde.org/D11155

R  +10   -9    kzip.cpp [from: src/kzip.cpp - 098% similarity]

https://commits.kde.org/karchive/07e5678981c61158aeb7226ca02bfa5891e1b5e8