Most efficient way to seek around in a large file

Tags:

What's the most efficient way to process really large binary files in Haskell?

The standard answer is to read the entire file as a lazy ByteString and then use something like the Binary packet to write a parser over it. There are a couple of problems with that...

First, libraries like Binary don't really handle parse failure, and I'm explicitly expecting parsing to fail sometimes.

Second, I'm not parsing the entire file contents. I'm going to be skipping over large chunks of it. And reading gigabytes of data from disk into RAM only to have the garbage collector throw it away again seems rather unperformant.

Related to that, I need to be able to tell if the skip I want to perform will take me off the end of the file or not (and error out if it does).

I may also need to seek backwards, or maybe to a specific byte offset within the file, which does not appear to be well-supported by a lazy ByteString approach. (There's a severe danger of ending up holding the entire file in RAM.)

The alternative, of course, is to read individual bytes one by one, interleaved with hSeek commands. But now the problem is, how efficient is reading a file one byte at a time? That sounds like it could also be quite slow. I'm not sure is hSetBuffering has an effect on this. (?)

Then of course there's mmap. But that seems to freak out the virtual memory system if used on large files. (Which is odd, considering that's the entire purpose for it existing...)

What do we think, folks? What's the best way to approach this, in terms of I/O performance and code maintainability?

477

asked Feb 24 '13 14:02

MathematicalOrchid

1 Answers

I had similar issue when working on pdf parser. Initially I used iteratee package (it supports random access). AFAIK it is the only IO library with random IO support.

My current approach is based on io-streams package. I found it more convenience. Performance is goods enough, attoparsec integration out of the box, a lot of combinators included.

Here is a basic example how to use iteratee for random IO:

shum@shum-laptop:/tmp/shum$ cat test.hs 

import qualified  Data.Iteratee as I
import qualified Data.Attoparsec.Iteratee as I
import qualified Data.Attoparsec.Char8 as P
import Control.Monad.IO.Class
import System.Environment

main :: IO ()
main = do
  [file] <- getArgs
  flip I.fileDriverRandom file $ do
    I.seek 20
    num1 <- I.parserToIteratee P.number
    liftIO $ print num1
    I.seek 10
    num2 <- I.parserToIteratee P.number
    liftIO $ print num2
shum@shum-laptop:/tmp/shum$ cat in.data 
111111111
222222222
333333333
shum@shum-laptop:/tmp/shum$ runhaskell test.hs in.data 
333333333
222222222
shum@shum-laptop:/tmp/shum$

112

answered Oct 26 '22 17:10

Yuras

Related questions
                            
                                How much does the order of case labels affect the efficiency of switch statements?
                            
                                Fast, Simple Programmer's Editor
                            
                                How many MySQL queries should I limit myself to on a page? PHP / MySQL
                            
                                Analyzing Code for Efficiency?
                            
                                Is static method faster than non-static?
                            
                                Cost of throwing C++0x exceptions
                            
                                Why does my Python program average only 33% CPU per process? How can I make Python use all available CPU?
                            
                                Most efficient way to find the greatest of three ints
                            
                                Why Thread.Sleep() is so CPU intensive?
                            
                                How to increase the startup speed of the delphi app?
                            
                                Using properties and performance
                            
                                Javascript performance of Array.map
                            
                                Bad performance on Azure for Owin/IIS application
                            
                                TensorFlow - Low GPU usage on Titan X
                            
                                Change CSS3 transform without triggering recalculate styles?
                            
                                How to maximize http.sys file upload performance
                            
                                Performance Overhead of AOP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most efficient way to seek around in a large file

Tags:

performance

io

haskell

MathematicalOrchid

People also ask

1 Answers

Yuras

Recent Activity

Donate For Us