Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to take lazy bytestring from zip archive without heap overflow

I want to take first five bytes from the fist file in zip archive. I use zip-archive package for decompression:

import qualified Data.ByteString.Lazy as L
import Data.Maybe
import System.Environment (getArgs)

import Codec.Archive.Zip

main = do
    f:_ <- getArgs
    print . L.take 5 . fromEntry . head . zEntries . toArchive =<< L.readFile f

This code works for small archives but I got heap overflow with big ones. For example:

./zip-arch test.zip +RTS -p -hy -M100M

for this archive gives this heap profile

like image 203
tymmym Avatar asked Feb 10 '12 10:02

tymmym


2 Answers

Consider calling out to unzip. It's not super haskelly but it does the job. Perhaps all the haters out there should spend more time fixing or replacing broken libraries like zip-archive and less time on stackoverflow.

Standard disclaimer: no error checking present. this may leak handles. lazy i/o is lazy.

import System.Environment (getArgs)
import System.IO (hSetBinaryMode)
import System.Process (StdStream(...), createProcess, proc, close_fds, std_out)

import qualified Data.ByteString.Lazy as L

unzipLBS :: FilePath -> IO L.ByteString
unzipLBS file = do
  let args = proc "unzip" ["-p", file]
      args' = args { std_out = CreatePipe, close_fds = True }

  (_, Just hOut, _, _) <- createProcess args'
  hSetBinaryMode hOut True
  L.hGetContents hOut

main :: IO ()
main = do
  f:_ <- getArgs
  print . L.take 5 =<< unzipLBS f

Seems to work:

$ runghc -Wall unzip.hs  ~/Downloads/test.zip
Chunk ",+\227F\149" Empty
like image 92
Nathan Howell Avatar answered Sep 28 '22 08:09

Nathan Howell


I've read the explanation of the zip-archive author and decided to make recommended repairs. I've finished with a new library - zip-conduit. Its main feature is constant memory usage without lazy IO. To take first five bytes from the fist file in the zip archive you can write:

import           System.Environment
import           Data.Conduit
import qualified Data.Conduit.Binary as CB

import           Codec.Archive.Zip

main = do
    f:_ <- getArgs
    res <- withArchive f $ do
               name:_ <- fileNames
               source <- getSource name
               runResourceT $ source $$ CB.take 5
    print res
like image 41
tymmym Avatar answered Sep 28 '22 08:09

tymmym