Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading very large CSV dataset into Python and R, Pandas struggles

I am loading a huge csv (18GB) into memory and noticing very large differences between R and Python. This is on an AWS ec2 r4.8xlarge which has 244 Gb of memory. Obviously this is an extreme example, but the principle holds for smaller files on real machines too.

When using pd.read_csv my file took ~30 mins to load and took up 174Gb of memory. Essentially so much that I then can't do anything with it. By contrast, R's fread() from the data.table package took ~7 mins and only ~55Gb of memory.

Why does the pandas object take up so much more memory than the data.table object? Furthermore, why fundamentally is the pandas object almost 10x larger than the text file on disk? It's not like .csv is a particularly efficient way to store data in the first place.

like image 381
seth127 Avatar asked Oct 31 '17 18:10

seth127


Video Answer


1 Answers

You won't be able to beat the speed of fread, but as far as memory usage goes my guess is that you have integers that are being read in as 64-bit integers in python.

Assuming your file looks like this:

a,b
1234567890123456789,12345

In R, you'll get:

sapply(fread('test.txt'), class)
#          a          b
#"integer64"  "integer"

Whereas in python (on a 64-bit machine):

pandas.read_csv('test.txt').dtypes
#a   int64
#b   int64

Thus you'll use more memory in python. You can force the type in read_csv as a workaround:

pandas.read_csv('test.txt', dtype={'b': numpy.int32}).dtypes
#a   int64
#b   int32

Small integers are also going to be the reason for both R and python objects taking up more space than the .csv file, since e.g. "1" in a .csv file takes up 2 bytes (char + either comma or end of line), but either 4 or 8 bytes in memory.

like image 60
eddi Avatar answered Nov 15 '22 04:11

eddi