Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large, persistent DataFrame in pandas

Tags:

python

pandas

sas

I am exploring switching to python and pandas as a long-time SAS user.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in pandas?

I regularly work with large files and do not have access to a distributed computing network.

like image 821
Zelazny7 Avatar asked Jul 24 '12 00:07

Zelazny7


People also ask

How does pandas deal with large Dataframes?

When data is too large to fit into memory, you can use Pandas' chunksize option to split the data into chunks instead of dealing with one big block.

Is there a size limit for pandas Dataframe?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Is pandas efficient for large data sets?

The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.


1 Answers

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pd  tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)  # gives TextFileReader, which is iterable with chunks of 1000 rows. df = pd.concat(tp, ignore_index=True)  # df is DataFrame. If errors, do `list(tp)` instead of `tp` 
like image 179
fickludd Avatar answered Nov 07 '22 15:11

fickludd