Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a pandas dataframe consumes much more RAM than the size of the original text file?

Tags:

python

pandas

I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t"). The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv(). It is a file with 305 columns and +/- 1 000 000 rows.

When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20 in read_csv() it works fine.

The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.

My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()? Am I doing anything wrong?

EDIT: When executing df.dtypes the types are a mix of object, float64, and int64

UPDATE: I used the following code to overcome the problem and perform my calculations:

summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
    x=x+1
    sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
    summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
    del sample_col

it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column

like image 835
Robvh Avatar asked Jun 19 '19 06:06

Robvh


Video Answer


1 Answers

Pandas is cutting up the file, and storing the data individually. I don't know the data types, so I'll assume the worst: strings.

In Python (on my machine), an empty string needs 49 bytes, with an additional byte for each character if ASCII (or 74 bytes with extra 2 bytes for each character if Unicode). That's roughly 15Kb for a row of 305 empty fields. A million and a half of such rows would take roughly 22Gb in memory, while they would take about 437 Mb in a CSV file.

Pandas/numpy are good with numbers, as they can represent a numerical series very compactly (like C program would). As soon as you step away from C-compatible datatypes, it uses memory as Python does, which is... not very frugal.

like image 185
Amadan Avatar answered Sep 27 '22 23:09

Amadan