I'm trying to read in a somewhat large dataset using panda
s read_csv
or read_stata
functions, but I keep running into Memory Error
s. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn't be a problem for me. What else could cause the memory error?
For context, I'm trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv
) and in Stata format (using read_stata
). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.
Pandas uses in-memory computation which makes it ideal for small to medium sized datasets. However, Pandas ability to process big datasets is limited due to out-of-memory errors.
The answer is a big NO. Pandas is still the best tool for data analysis in Python. It has well-supported functions for the most common data analysis tasks. When it comes to bigger files, pandas might not be the fastest tool.
Conclusion. Reading~1 GB CSV in the memory with various importing options can be assessed by the time taken to load in the memory. pandas.
A pandas DataFrame has two dimensions: the rows and the columns.
I'm going to post this answer as was discussed in comments. I've seen it come up numerous times without an accepted answer.
The Memory Error is intuitive - out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.
1) Check for code errors
This may be a "dumb step" but that's why it's first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os
module that will search your entire computer and put the output in an excel file)
2) Make your code more efficient
Goes along the lines of Step 1. But if something simple is taking a long time, there's usually a module or a better way of doing something that is faster and more memory efficent. That's the beauty of Python and/or open source Languages!
3) Check The Total Memory of the object
The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here
to find the size of an object in bites you can always use sys.getsizeof()
:
import sys print(sys.getsizeof(OBEJCT_NAME_HERE))
Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.
4) Check the memory while running
Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.
use the code below to see the documentation straight in Jupyter Notebook:
%mprun? %memit?
Sample use:
%load_ext memory_profiler def lol(x): return x %memit lol(500) #output --- peak memory: 48.31 MiB, increment: 0.00 MiB
If you need help on magic functions This is a great post
5) This one may be first.... but Check for simple things like bit version
As in your case, a simple switching of the version of python you were running solved the issue.
Usually the above steps solve my issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With