Loading very large CSV dataset into Python and R, Pandas struggles

Question

I am loading a huge csv (18GB) into memory and noticing very large differences between R and Python. This is on an AWS ec2 r4.8xlarge which has 244 Gb of memory. Obviously this is an extreme example, but the principle holds for smaller files on real machines too.

When using pd.read_csv my file took ~30 mins to load and took up 174Gb of memory. Essentially so much that I then can't do anything with it. By contrast, R's fread() from the data.table package took ~7 mins and only ~55Gb of memory.

Why does the pandas object take up so much more memory than the data.table object? Furthermore, why fundamentally is the pandas object almost 10x larger than the text file on disk? It's not like .csv is a particularly efficient way to store data in the first place.

eddi · Accepted Answer

You won't be able to beat the speed of fread, but as far as memory usage goes my guess is that you have integers that are being read in as 64-bit integers in python.

Assuming your file looks like this:

a,b
1234567890123456789,12345

In R, you'll get:

sapply(fread('test.txt'), class)
#          a          b
#"integer64"  "integer"

Whereas in python (on a 64-bit machine):

pandas.read_csv('test.txt').dtypes
#a   int64
#b   int64

Thus you'll use more memory in python. You can force the type in read_csv as a workaround:

pandas.read_csv('test.txt', dtype={'b': numpy.int32}).dtypes
#a   int64
#b   int32

Small integers are also going to be the reason for both R and python objects taking up more space than the .csv file, since e.g. "1" in a .csv file takes up 2 bytes (char + either comma or end of line), but either 4 or 8 bytes in memory.

Loading very large CSV dataset into Python and R, Pandas struggles

Tags:

python

pandas

r

csv

data.table

seth127

Video Answer

1 Answers

eddi

Recent Activity

Donate For Us

Loading very large CSV dataset into Python and R, Pandas struggles

Tags:

python

pandas

r

csv

data.table

seth127

Video Answer

1 Answers

eddi

Related questions

Recent Activity

Donate For Us