Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to read large binary file in Haskell?

Tags:

haskell

I want to process a binary file that is too large to read into memory. Currently I use ByteString.Lazy.readFile to stream the bytes. I thought it would be a good idea to use the streaming package to make my program faster. However, the documentation for readFile says:

readFile :: FilePath -> (Stream (Of String) IO () -> IO a) -> IO a

Read the lines of a file, using a function of the type: 'Stream (Of String) IO () -> IO a' to turn the stream into a value of type 'IO a'.

So the streaming package only reads ASCII text files? Can I use this package to read a binary file as bytes?

like image 842
paperduck Avatar asked Apr 23 '19 05:04

paperduck


2 Answers

To elaborate on @Cubic's comment, while there's a general consensus that lazy I/O should be avoided in production code and replaced with a streaming approach, this is not directly related to performance. If you're writing a program to do some one-off processing of a large file, as long as you have a lazy I/O version running fine now, there's probably no good performance reason to convert it over to a streaming package.

In fact, streaming is more likely to add some overhead, so I suspect that a well optimized lazy I/O solution would out-perform a well optimized streaming solution, in most cases.

The main reasons for avoiding Lazy I/O have been previously discussed on SO. In a nutshell, lazy I/O makes it difficult to consistently manage resources (e.g., file handles and network sockets), makes it hard to reason about space usage (e.g., a small program change can cause your memory usage to explode), and is occasionally "unsafe" if the timing and ordering of the I/O in question matters (usually not a problem if you're just reading in one set of files and/or writing out another set of files).

Short-running utility programs for reading and/or writing large files are probably good candidates to be written in a lazy I/O style. As long as they don't have any obvious space leaks when they're run, they're probably fine.

like image 101
K. A. Buhr Avatar answered Sep 22 '22 21:09

K. A. Buhr


With only streaming and bytestring, one can write something like:

import           Data.ByteString
import           Streaming
import qualified Streaming.Prelude as S
import           System.IO

fromHandle :: Int -> Handle -> Stream (Of ByteString) IO ()
fromHandle chunkSize h = 
    S.untilRight $ do bytes <- Data.ByteString.hGet h chunkSize
                      pure $ if Data.ByteString.null bytes then Right ()
                                                           else Left bytes

Using hGet, null from bytestring, and untilRight from streaming. You will need to use withFile to get the Handle, and consume the Stream within the callback:

dump :: FilePath -> IO ()
dump file = withFile file ReadMode go
 where
   go :: Handle -> IO ()
   go = S.mapM_ (Data.ByteString.hPut stdout) . fromHandle 4096 
like image 26
danidiaz Avatar answered Sep 23 '22 21:09

danidiaz