I am loading a huge csv (18GB) into memory and noticing very large differences between R and Python. This is on an AWS ec2 r4.8xlarge which has 244 Gb of memory. Obviously this is an extreme example, but the principle holds for smaller files on real machines too.
When using pd.read_csv
my file took ~30 mins to load and took up 174Gb of memory. Essentially so much that I then can't do anything with it. By contrast, R's fread()
from the data.table
package took ~7 mins and only ~55Gb of memory.
Why does the pandas object take up so much more memory than the data.table object? Furthermore, why fundamentally is the pandas object almost 10x larger than the text file on disk? It's not like .csv is a particularly efficient way to store data in the first place.
You won't be able to beat the speed of fread
, but as far as memory usage goes my guess is that you have integers that are being read in as 64-bit integers in python.
Assuming your file looks like this:
a,b
1234567890123456789,12345
In R, you'll get:
sapply(fread('test.txt'), class)
# a b
#"integer64" "integer"
Whereas in python (on a 64-bit machine):
pandas.read_csv('test.txt').dtypes
#a int64
#b int64
Thus you'll use more memory in python. You can force the type in read_csv
as a workaround:
pandas.read_csv('test.txt', dtype={'b': numpy.int32}).dtypes
#a int64
#b int32
Small integers are also going to be the reason for both R and python objects taking up more space than the .csv file, since e.g. "1" in a .csv file takes up 2 bytes (char + either comma or end of line), but either 4 or 8 bytes in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With