Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum size of pandas dataframe

Tags:

python

pandas

I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?

For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.

like image 885
Nils Gudat Avatar asked May 09 '14 16:05

Nils Gudat


People also ask

Can pandas be used for big data?

Pandas uses in-memory computation which makes it ideal for small to medium sized datasets. However, Pandas ability to process big datasets is limited due to out-of-memory errors.

Can pandas handle billions of rows?

The answer is a big NO. Pandas is still the best tool for data analysis in Python. It has well-supported functions for the most common data analysis tasks. When it comes to bigger files, pandas might not be the fastest tool.

How big CSV can pandas handle?

Conclusion. Reading~1 GB CSV in the memory with various importing options can be assessed by the time taken to load in the memory. pandas.

How many dimensions is a pandas DataFrame?

A pandas DataFrame has two dimensions: the rows and the columns.


1 Answers

I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.

The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.

1) Check for code errors

This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)

2) Make your code more efficient

Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!

3) Check The Total Memory of the object

The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here

to find the size of an object in bites you can always use sys.getsizeof():

import sys print(sys.getsizeof(OBEJCT_NAME_HERE)) 

Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.

4) Check the memory while running

Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.

use the code below to see the documentation straight in Jupyter Notebook:

%mprun? %memit? 

Sample use:

%load_ext memory_profiler def lol(x):     return x %memit lol(500) #output --- peak memory: 48.31 MiB, increment: 0.00 MiB 

If you need help on magic functions This is a great post

5) This one may be first.... but Check for simple things like bit version

As in your case, a simple switching of the version of python you were running solved the issue.

Usually the above steps solve my issues.

like image 71
MattR Avatar answered Oct 07 '22 17:10

MattR