Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to seek around in a large file

What's the most efficient way to process really large binary files in Haskell?

The standard answer is to read the entire file as a lazy ByteString and then use something like the Binary packet to write a parser over it. There are a couple of problems with that...

First, libraries like Binary don't really handle parse failure, and I'm explicitly expecting parsing to fail sometimes.

Second, I'm not parsing the entire file contents. I'm going to be skipping over large chunks of it. And reading gigabytes of data from disk into RAM only to have the garbage collector throw it away again seems rather unperformant.

Related to that, I need to be able to tell if the skip I want to perform will take me off the end of the file or not (and error out if it does).

I may also need to seek backwards, or maybe to a specific byte offset within the file, which does not appear to be well-supported by a lazy ByteString approach. (There's a severe danger of ending up holding the entire file in RAM.)

The alternative, of course, is to read individual bytes one by one, interleaved with hSeek commands. But now the problem is, how efficient is reading a file one byte at a time? That sounds like it could also be quite slow. I'm not sure is hSetBuffering has an effect on this. (?)

Then of course there's mmap. But that seems to freak out the virtual memory system if used on large files. (Which is odd, considering that's the entire purpose for it existing...)

What do we think, folks? What's the best way to approach this, in terms of I/O performance and code maintainability?

like image 477
MathematicalOrchid Avatar asked Feb 24 '13 14:02

MathematicalOrchid


People also ask

How do I search for large files?

Open File Explorer and navigate to This PC or the drive you wish to search. In the search field, type size: gigantic and then press Enter. It will search for any files larger than 128 MB. Click the View tab, then select Details.

How can I open a text file larger than 1gb?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.


1 Answers

I had similar issue when working on pdf parser. Initially I used iteratee package (it supports random access). AFAIK it is the only IO library with random IO support.

My current approach is based on io-streams package. I found it more convenience. Performance is goods enough, attoparsec integration out of the box, a lot of combinators included.

Here is a basic example how to use iteratee for random IO:

shum@shum-laptop:/tmp/shum$ cat test.hs 

import qualified  Data.Iteratee as I
import qualified Data.Attoparsec.Iteratee as I
import qualified Data.Attoparsec.Char8 as P
import Control.Monad.IO.Class
import System.Environment

main :: IO ()
main = do
  [file] <- getArgs
  flip I.fileDriverRandom file $ do
    I.seek 20
    num1 <- I.parserToIteratee P.number
    liftIO $ print num1
    I.seek 10
    num2 <- I.parserToIteratee P.number
    liftIO $ print num2
shum@shum-laptop:/tmp/shum$ cat in.data 
111111111
222222222
333333333
shum@shum-laptop:/tmp/shum$ runhaskell test.hs in.data 
333333333
222222222
shum@shum-laptop:/tmp/shum$
like image 112
Yuras Avatar answered Oct 26 '22 17:10

Yuras