Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading a CSV in memory with Cassava

I am trying to load a CSV into memory as a Vector of Vector with Cassava. My program does work but uses a huge amount of memory for a 50MB csv file and I don't understand why.

I know that working with Data.Csv.Streaming should be better for big files, but I thought 50MB would still be ok. I tried both Data.Csv and Data.Csv.Streaming with more or less canonical examples from the github project page, I also tried to implement my own parser that outputs Vector of Vector (I based my code on attoparsec-csv https://hackage.haskell.org/package/attoparsec-csv), and all these solutions use about 2000MB of memory! I am sure there is something wrong in what I am doing. What is the right way of doing this?

My final goal is to have the data fully loaded into memory for further processing later on. For example, I could split my data into interesting matrices and work with those using Hmatrix.

Here are the 2 programs I tried with Cassava:

1/ Using Data.Csv

import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
import Data.Csv
import Data.Foldable


main = do
   csv <- BL.readFile "train.csv"
   let Right res = decode HasHeader csv :: Either String (V.Vector(V.Vector(BL.ByteString)))
   print $ res V.! 0

2/ Using Data.Csv.Streaming

{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
import Data.Csv.Streaming
import Data.Foldable


main = do
   csv <- BL.readFile "train.csv"
   let !a = decode HasHeader csv :: Records(V.Vector(BL.ByteString))
   let !res = V.fromList $ Data.Foldable.toList a
   print $ res V.! 0

Note that I don't give you the program I made based on attoparsec-csv because it is almost exactly the same with Vector instead of List. The memory usage of that solution is still quite poor.

Interestingly, in the Data.Csv.Streaming solution, if I simply print my data using a Data.Foldable.for_, everything is super fast with a 2MB memory usage. This made me think that my problem is linked to the way I construct my Vector. Probably accumulating thunks instead of stacking raw data into a compact data structure.

Thank you for your help,

Antoine

like image 846
Antoine Genton Avatar asked Mar 25 '16 13:03

Antoine Genton


1 Answers

The difference between Data.CSV and Data.CSV.Streaming is probably not quite what you'd expect. The first produces a Data.Vector.Vector of the csv contents, as you see. I'm not sure why the construction of this vector should take up so much space -- though it starts not to surprise me when I reflect that the resulting vector-of-pointers-to-vectors-of-pointers-to-lazy-bytestrings here contains 28203420 distinct pointers to lazy bytestrings, 371 for each line, each pointing to a tiny morsel of the original stream of bytes, usually to '0'. Following http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html this means that the typical two byte sequence in the original stream of bytes - almost all of them look like this: ",0" ie. [44,48] - is replaced by a number of pointers and constructors: the lazy bytestring content alone makes each pair of bytes take up something like 11 words (the Chunk and Empty constructors for a lazy bytestring, plus the material for a strict bytestring which J Tibell puts at 9 words) ... plus the original bytes (minus those representing commas and spaces). In a 64 bit system this is a pretty gigantic escalation in size.

Data.CSV.Streaming is not really so different: basically it constructs a slightly decorated list rather than a vector, so in principle it can be lazily evaluated, and in ideal cases the whole thing won't need to be realized in memory, as you noticed. In a monadic context like this, though, you will be 'extracting a list from IO', which is not quite guaranteed to produce chaos and confusion.

If you want to properly stream csv contents, you should use ... one of the streaming libraries. (I have no advice for getting the whole thing into memory, except the obvious one of arranging that cassava reads each line into a nice compact data type rather than a vector of pointers to lazy bytestrings; here though we have 371 "fields").

So here is your program using cassava-streams, which uses cassava's (genuine) incremental interface, and then uses io-streams to make a stream of records:

  {-# LANGUAGE BangPatterns #-}

  import qualified Data.ByteString.Lazy as BL
  import qualified Data.Vector as V
  import Data.Foldable
  import System.IO.Streams (InputStream, OutputStream)
  import qualified System.IO.Streams as Streams
  import qualified System.IO.Streams.Csv as CSV
  import System.IO

  type StreamOfCSV = InputStream (V.Vector(BL.ByteString))

  main = withFile "train.csv" ReadMode $ \h -> do
     input          <- Streams.handleToInputStream h 
     raw_csv_stream <- CSV.decodeStream HasHeader input
     csv_stream     <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV
     m <- Streams.read csv_stream
     print m

This finishes immediately using no more memory than hello-world, printing the first record. You can see a little more manipulation in tutorial source https://github.com/pjones/cassava-streams/blob/master/src/System/IO/Streams/Csv/Tutorial.hs There are similar libraries for the other streaming libraries. If the data structure (like a matrix) that you want to constitute can fit in memory, you should be able to construct it by folding over the lines with Streams.fold and there should be no problem if the information you are trying to extract from each line is properly evaluated before it is consumed by the fold operation. If you can arrange that cassava outputs a non-recursive data structure with unboxed fields, then it will be possible to write an Unbox instance for that type, and fold the whole csv into a single closely packed unboxed vector. In this case, there are 371 distinct fields in each line, so that isn't an option I guess.

The following is the equivalent of the Data.CSV.Streaming program:

  main = withFile "train.csv" ReadMode $ \h -> do
    input          <- Streams.handleToInputStream h 
    raw_csv_stream <- CSV.decodeStream HasHeader input
    csv_stream     <- CSV.onlyValidRecords raw_csv_stream :: IO StreamOfCSV
    csvs <- Streams.toList csv_stream
    print (csvs !! 0)

It has all the same trouble, since it uses Streams.toList to collect the gigantic list before trying to find out the first element.

-- Addendum

Here for what it's worth is a pipes-csv variant, which just compresses each parsed row into an unboxed vector of Ints by hand (this easier than finding Doubles, which is what this csv is really storing, using readInt from the bytestring package.)

import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as B
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed as U
import Data.Csv

import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import Pipes
import qualified Pipes.Csv as Csv
import System.IO
import Control.Applicative

import qualified Control.Foldl as L

main = withFile "train.csv" ReadMode $ \h -> do
  let csvs :: Producer (V.Vector ByteString) IO ()
      csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
      -- shamelessly reading integral part only, counting bad parses as 0
      simplify bs = case B.readInt bs of
        Nothing       -> 0
        Just (n, bs') -> n
      uvectors :: Producer (U.Vector Int) IO ()
      uvectors = csvs  >-> P.map (V.map simplify) >-> P.map (V.foldr U.cons U.empty)
  runEffect $ uvectors >-> P.print

You can fold over the rows using the folds in the foldl library or any you care to write by swapping out the last line for something like this

  let myfolds = liftA3 (,,) (L.generalize (L.index 13))   -- the thirteenth row, if it exists
                            (L.randomN 3)   -- three random rows
                            (L.generalize L.length) -- number of rows
  (thirteen,mvs,len) <- L.impurely P.foldM myfolds uvectors

  case mvs of 
    Nothing -> return ()
    Just vs -> print (vs :: V.Vector (U.Vector Int))
  print thirteen
  print len

in this case I am collecting the thirteenth line, three random lines and the total number of records - any number of other folds can be combined with these. In particular, we could as well collect all of the rows into a giant vector using L.vector, which will probably still be a bad idea given the size of this csv file. Below we come back to our starting point, we collect everything and print the 17th line of the completed vector of vectors, i.e. a sort of big matrix.

  vec_vec <- L.impurely P.foldM  L.vector uvectors
  print $ (vec_vec :: V.Vector (U.Vector Int)) V.! 17

This takes plenty of memory, but doesn't particularly strain my little laptop.

like image 116
Michael Avatar answered Nov 19 '22 14:11

Michael