Maximum size of pandas dataframe

Tags:

I'm trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?

For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.

885

asked May 09 '14 16:05

Nils Gudat

1 Answers

I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.

The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.

1) Check for code errors

This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)

2) Make your code more efficient

Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!

3) Check The Total Memory of the object

The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here

to find the size of an object in bites you can always use sys.getsizeof():

import sys print(sys.getsizeof(OBEJCT_NAME_HERE))

Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.

4) Check the memory while running

Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.

use the code below to see the documentation straight in Jupyter Notebook:

%mprun? %memit?

Sample use:

%load_ext memory_profiler def lol(x):     return x %memit lol(500) #output --- peak memory: 48.31 MiB, increment: 0.00 MiB

If you need help on magic functions This is a great post

5) This one may be first.... but Check for simple things like bit version

As in your case, a simple switching of the version of python you were running solved the issue.

Usually the above steps solve my issues.

answered Oct 07 '22 17:10

MattR

Related questions
                            
                                Pip (python) differences between `--install-option='--prefix'` and `--root` and `--target`
                            
                                Python 3: Catching warnings during multiprocessing
                            
                                Python: Multiple packages in one repository or one package per repository?
                            
                                Python embeddable zip
                            
                                Getting Spark, Python, and MongoDB to work together
                            
                                Installer and Updater for a python desktop application
                            
                                Unsupervised pre-training for convolutional neural network in theano
                            
                                Simple, hassle-free, zero-boilerplate serialization in Scala/Java similar to Python's Pickle?
                            
                                Merge on single level of MultiIndex
                            
                                Why do I get an int when I index bytes?
                            
                                Grouping Functions by Using Classes in Python
                            
                                How to detect if the console does support ANSI escape codes in Python?
                            
                                How do I develop against OAuth locally?
                            
                                Pretty print JSON dumps
                            
                                Python 3.4: How to import a module given the full path? [duplicate]
                            
                                Avoid `logger=logging.getLogger(__name__)`
                            
                                Python+Celery: Chaining jobs?
                            
                                Packaging and shipping a python library and scripts, the professional way
                            
                                Error: No module named 'fcntl'
                            
                                Python - Best/Cleanest way to define constant lists or dictionarys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Maximum size of pandas dataframe

Tags:

python

pandas

Nils Gudat

People also ask

1 Answers

MattR

Recent Activity

Donate For Us