zip-archive performance/memmory usage

Hello,
I'm trying some haskell scripting. I'm writing a script to print some
information
from a zip archive. The zip-archive library does look nice but the
performance of zip-archive/lazy bytestring
doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not
production ready,
or am I doing something terribly wrong ?
kind regards,
Pieter
--
Pieter Laeremans

On 10 August 2010 09:29, Pieter Laeremans
Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
I'm not familiar with the library, but I have a sneaking suspicion that you're doing something terribly wrong, namely keeping the entire archive in memory rather than using it lazily. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Mon, Aug 9, 2010 at 4:29 PM, Pieter Laeremans
Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
So it's a stack overflow at about 8 megs. I don't have a strong sense of what is normal, but that seems like a small stack to me. Oh, actually I just check and that is the default stack size :) I looked at Zip.hs (included as an example). The closest I see to your example is some code for listing the files in the archive. Perhaps you should try the supplied program on your archive and see if it too has a stack overflow. The line the author uses to list files is: List -> mapM_ putStrLn $ filesInArchive archive But, you're taking the head of the entries, so I don't see how you'd be holding on to too much data. I just don't see anything wrong with your program. Did you remember to compile with optimizations? Perhaps try the author's way of listing entries and see if performance changes?
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
Not production ready would be my assumption. I think an iteratee style might be more appropriate for these sorts of nested streams of potentially large size anyway. I'm skeptical of anything that depends on lazy bytestrings or lazy io. In this case, the performance would appear to be depend on lazy bytestrings. You might want to experiment with increasing the stack size. Something like this: ./ZipList +RTS -K100M -RTS foo.zip Jason

I was interested to see if I could determine what was happening with this. After some playing around, I noticed the code was running significantly faster if I *didn't* compile it, but ran it with 'runghc' instead (running under ghci was also fast). Here are the running times I found. The 'Zip.hs' program comes with the zip-archive package. The runtime of the compiled version didn't seem to be affected by optimisations. Regardless, I'm quite surprised running interpreted was significantly faster than compiled.
time runghc ./Zip.hs -l ~/jdk1.6.0_05-src.zip 1.48s user 0.17s system 97% cpu 1.680 total
time ./dist/build/Zip/Zip -l ~/jdk1.6.0_05-src.zip 89.00s user 1.06s system 98% cpu 1:31.84 total
The file 'jdk1.6.0_05-src.zip' was just an 18MB zip file I had lying
around. I'm using ghc 6.12.1
Cheers,
--
David Powell
On Tue, Aug 10, 2010 at 12:10 PM, Jason Dagit
On Mon, Aug 9, 2010 at 4:29 PM, Pieter Laeremans
wrote: Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
So it's a stack overflow at about 8 megs. I don't have a strong sense of what is normal, but that seems like a small stack to me. Oh, actually I just check and that is the default stack size :)
I looked at Zip.hs (included as an example). The closest I see to your example is some code for listing the files in the archive. Perhaps you should try the supplied program on your archive and see if it too has a stack overflow.
The line the author uses to list files is: List -> mapM_ putStrLn $ filesInArchive archive
But, you're taking the head of the entries, so I don't see how you'd be holding on to too much data. I just don't see anything wrong with your program. Did you remember to compile with optimizations? Perhaps try the author's way of listing entries and see if performance changes?
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
Not production ready would be my assumption. I think an iteratee style might be more appropriate for these sorts of nested streams of potentially large size anyway. I'm skeptical of anything that depends on lazy bytestrings or lazy io. In this case, the performance would appear to be depend on lazy bytestrings.
You might want to experiment with increasing the stack size. Something like this: ./ZipList +RTS -K100M -RTS foo.zip
Jason
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 10/08/10 00:29, Pieter Laeremans wrote:
Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
I downloaded your program and compiled it (GHC 6.12.1, zip-archive 0.1.1.6, bytestring 0.9.1.5). I ran it on the JVM src.zip (20MB, ~8000 files) and it sat there for a minute (67s), taking 2.2% memory according to top, then completed successfully. Same behaviour with -O2. Which compares very badly in time to the instant return when I ran unzip -l on the same file, but I didn't see any memory problems. Presumably your archive is valid and works with unzip and other tools? Thanks, Neil.

Hi all, I'm the author of zip-archive. I wrote it for a fairly special purpose -- I wanted to create and read ODT files in pandoc -- and I know it could be improved. The main problem is that the parsing algorithm is kind of stupid; it just reads the whole archive in sequence, storing the files as it comes to them. So a file listing will take almost as much time as a full extract. There's a better way: The zip archive ends with an "end of central directory record", which contains (among other things) the offset of the central directory from the beginning of the file. So, one could use something like the following strategy: 1. read the "end of central directory record", which should be the last 22 bytes of the file. I think it should be possible to do this without allocating memory for the whole file. 2. parse that to determine the offset of the central directory. 3. seek to the offset of the central directory and parse it. This will give you a list of file headers. Each file header will tell you the name of a file in the archive, how it is compressed, and where to find it (its offset) in the file. At this point you'd have the list of files, and enough information to seek to any file and read it from the archive. The API could be changed to allow lazy reading of a single file without reading all of them. I don't think these changes would be too difficult, since you wouldn't have to change any of the functions that do the binary parsing -- it would just be a matter of changing the top-level functions. I don't have time to do this right now, but if one of you wants to tackle the problem, patches are more than welcome! There's some documentation on the ZIP format in comments in the source code. John +++ Neil Brown [Aug 10 10 12:35 ]:
On 10/08/10 00:29, Pieter Laeremans wrote:
Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
I downloaded your program and compiled it (GHC 6.12.1, zip-archive 0.1.1.6, bytestring 0.9.1.5). I ran it on the JVM src.zip (20MB, ~8000 files) and it sat there for a minute (67s), taking 2.2% memory according to top, then completed successfully. Same behaviour with -O2. Which compares very badly in time to the instant return when I ran unzip -l on the same file, but I didn't see any memory problems. Presumably your archive is valid and works with unzip and other tools?
Thanks,
Neil.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Thanks for your comments John.
I appreciate your work. I think pandoc is fantastic!
I'm interested to solve this problem, but time is also an issue.
I'll try to toy around with it.
Thanks,
Pieter
On Tue, Aug 10, 2010 at 7:06 PM, John MacFarlane
Hi all,
I'm the author of zip-archive. I wrote it for a fairly special purpose -- I wanted to create and read ODT files in pandoc -- and I know it could be improved.
The main problem is that the parsing algorithm is kind of stupid; it just reads the whole archive in sequence, storing the files as it comes to them. So a file listing will take almost as much time as a full extract.
There's a better way: The zip archive ends with an "end of central directory record", which contains (among other things) the offset of the central directory from the beginning of the file. So, one could use something like the following strategy:
1. read the "end of central directory record", which should be the last 22 bytes of the file. I think it should be possible to do this without allocating memory for the whole file.
2. parse that to determine the offset of the central directory.
3. seek to the offset of the central directory and parse it. This will give you a list of file headers. Each file header will tell you the name of a file in the archive, how it is compressed, and where to find it (its offset) in the file.
At this point you'd have the list of files, and enough information to seek to any file and read it from the archive. The API could be changed to allow lazy reading of a single file without reading all of them.
I don't think these changes would be too difficult, since you wouldn't have to change any of the functions that do the binary parsing -- it would just be a matter of changing the top-level functions.
I don't have time to do this right now, but if one of you wants to tackle the problem, patches are more than welcome! There's some documentation on the ZIP format in comments in the source code.
John
+++ Neil Brown [Aug 10 10 12:35 ]:
On 10/08/10 00:29, Pieter Laeremans wrote:
Hello,
I'm trying some haskell scripting. I'm writing a script to print some information from a zip archive. The zip-archive library does look nice but the performance of zip-archive/lazy bytestring doesn't seem to scale.
Executing :
eRelativePath $ head $ zEntries archive
on an archive of around 12 MB with around 20 files yields
Stack space overflow: current size 8388608 bytes.
The script in question can be found at :
http://github.com/plaeremans/HaskellSnipplets/blob/master/ZipList.hs
I'm using the latest version of haskell platform. Are these libaries not production ready, or am I doing something terribly wrong ?
I downloaded your program and compiled it (GHC 6.12.1, zip-archive 0.1.1.6, bytestring 0.9.1.5). I ran it on the JVM src.zip (20MB, ~8000 files) and it sat there for a minute (67s), taking 2.2% memory according to top, then completed successfully. Same behaviour with -O2. Which compares very badly in time to the instant return when I ran unzip -l on the same file, but I didn't see any memory problems. Presumably your archive is valid and works with unzip and other tools?
Thanks,
Neil.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
--
Pieter Laeremans
participants (6)
-
David Powell
-
Ivan Lazar Miljenovic
-
Jason Dagit
-
John MacFarlane
-
Neil Brown
-
Pieter Laeremans