What's the most efficient way to process really large binary files in Haskell?
The standard answer is to read the entire file as a lazy ByteString and then use something like the Binary packet to write a parser over it. There are a couple of problems with that...
First, libraries like Binary don't really handle parse failure, and I'm explicitly expecting parsing to fail sometimes.
Second, I'm not parsing the entire file contents. I'm going to be skipping over large chunks of it. And reading gigabytes of data from disk into RAM only to have the garbage collector throw it away again seems rather unperformant.
Related to that, I need to be able to tell if the skip I want to perform will take me off the end of the file or not (and error out if it does).
I may also need to seek backwards, or maybe to a specific byte offset within the file, which does not appear to be well-supported by a lazy ByteString approach. (There's a severe danger of ending up holding the entire file in RAM.)
The alternative, of course, is to read individual bytes one by one, interleaved with hSeek
commands. But now the problem is, how efficient is reading a file one byte at a time? That sounds like it could also be quite slow. I'm not sure is hSetBuffering
has an effect on this. (?)
Then of course there's mmap
. But that seems to freak out the virtual memory system if used on large files. (Which is odd, considering that's the entire purpose for it existing...)
What do we think, folks? What's the best way to approach this, in terms of I/O performance and code maintainability?
Open File Explorer and navigate to This PC or the drive you wish to search. In the search field, type size: gigantic and then press Enter. It will search for any files larger than 128 MB. Click the View tab, then select Details.
To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.
I had similar issue when working on pdf parser. Initially I used iteratee
package (it supports random access). AFAIK it is the only IO library with random IO support.
My current approach is based on io-streams
package. I found it more convenience. Performance is goods enough, attoparsec
integration out of the box, a lot of combinators included.
Here is a basic example how to use iteratee
for random IO:
shum@shum-laptop:/tmp/shum$ cat test.hs
import qualified Data.Iteratee as I
import qualified Data.Attoparsec.Iteratee as I
import qualified Data.Attoparsec.Char8 as P
import Control.Monad.IO.Class
import System.Environment
main :: IO ()
main = do
[file] <- getArgs
flip I.fileDriverRandom file $ do
I.seek 20
num1 <- I.parserToIteratee P.number
liftIO $ print num1
I.seek 10
num2 <- I.parserToIteratee P.number
liftIO $ print num2
shum@shum-laptop:/tmp/shum$ cat in.data
111111111
222222222
333333333
shum@shum-laptop:/tmp/shum$ runhaskell test.hs in.data
333333333
222222222
shum@shum-laptop:/tmp/shum$
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With