Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

textscan in Matlab uses excessive RAM compared to similar method in R

I run Matlab R2011b and R version 2.13.1 on Linux Mint v12 with 16 GB of RAM.

I have a csv file. The first 5 rows (and header) is:

#RIC,Date[G],Time[G],GMT Offset,Type,Price,Volume
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.66,300
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.65,1000
DAEG.OQ,07-JUL-2011,15:10:03.464,-4,Trade,1.65,3180

The file is large (approx 900MB). Given the combination of character and numeric data, one might read this file into matlab as follows:

fid1 = fopen('/home/MyUserName/Temp/X.csv');
D = textscan(fid1, '%s%s%s%f%s%f%f', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);

Although the file is 900MB, when running the above code, System Monitor indicates my RAM usage jumps from about 2GB to 10GB. Worse, if I attempt this same procedure with a slightly larger csv file (about 1.2 GB) my RAM maxes out at 16GB and Matlab never manages to finish reading in the data (it just stays stuck in "busy" mode).

If I wanted to read the same file into R, I might use:

D <- read.csv("/home/MyUserName/Temp/X.csv", stringsAsFactors=FALSE)

This takes a bit longer than Matlab, but system monitor indicates my RAM usage only jumps from 2GB to 3.3GB (much more reasonable given the original file size).

My question has two parts:

1) Why is textscan such a memory hog in this scenario?

2) Is there another approach I could use to get a 1.2GB csv file of this type into Matlab on my system without maxing out the RAM?

EDIT: Just to clarify, I'm curious as to whether there exists a matlab-only solution, ie I'm not interested in a solution that involves using a different language to break up the csv file into smaller chunks (as this is what I'm already doing). Sorry Trav1s, I should have made this clear from the start.

like image 768
Colin T Bowers Avatar asked Sep 18 '12 10:09

Colin T Bowers


1 Answers

The problem is probably that those "%s" strings are being read in to Matlab cellstrs, which are a memory-inefficient data structure for low cardinality strings. Cellstrs are lousy for big tabular data like this. Each string ends up getting stored in a separate primitive char array, each with some 400 bytes of overhead and fragmentation issues. With your 900MB file, that looks like 18 million rows; 4 strings per row, and that's about 10-20 GB of cellstrs to hold those strings. Ugh.

What you want is to convert those strings in to compact primitive datatypes as they're coming in, instead of getting all 18 million rows slurped in to bulky cell strings at once. The dates and timestamps as datenums or whatever numeric representation you're using, and those low-cardinality strings either as 2-d char arrays or some equivalent of a categorical variable. (Given your data set size, you probably want those strings represented as simple numeric identifiers with a lookup table, not chars.)

Once you've decided on your compact data structure, there's a couple approaches to loading it in. You could just break the read in to chunks in pure Matlab: use textscan() calls in a loop to read in 1000 lines at a time, parse and convert the cellstrs in that chunk in to their compact forms, buffer all the results, and cat them together at the end of the read. That'll keep the peak memory requirements lower.

If you're going to do a lot of work like this, and performance matters, you might want to drop down to Java and write your own parser that can convert the strings and dates as they come in, before handing them back to Matlab as more compact datatypes. It's not hard, and the Java method can be called directly from Matlab, so this may only kind of count as using a separate language.

like image 106
Andrew Janke Avatar answered Sep 20 '22 16:09

Andrew Janke