I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError
, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Stacktrace:
Traceback (most recent call last): File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in <module> wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2 ) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py" , line 401, in parser_f return _read(filepath_or_buffer, kwds) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py" , line 216, in _read return parser.read() File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py" , line 643, in read df = DataFrame(col_dict, columns=columns, index=index) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py" , line 394, in __init__ mgr = self._init_dict(data, index, columns, dtype=dtype) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py" , line 525, in _init_dict dtype=dtype) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py" , line 5338, in _arrays_to_mgr return create_block_manager_from_arrays(arrays, arr_names, axes) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals .py", line 1820, in create_block_manager_from_arrays blocks = form_blocks(arrays, names, axes) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals .py", line 1872, in form_blocks float_blocks = _multi_blockify(float_items, items) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals .py", line 1930, in _multi_blockify block_items, values = _stack_arrays(list(tup_block), ref_items, dtype) File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals .py", line 1962, in _stack_arrays stacked = np.empty(shape, dtype=dtype) MemoryError Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...
Pandas is one of those packages and makes importing and analyzing data much easier. Code #1 : read_csv is an important pandas function to read csv files and do operations on it. Opening a CSV file through this is easy.
My guess is If pandas is made capable of releasing memory if the dataframe is not in use then I will be able to use memory effectively. Don't you think the same? Ok. Releasing memory in a dataframe will result in more complicated errors. It might cause frequent IO operation which slows down pandas significantly.
"MemoryError: Unable to allocate …" is the last thing that you want to see during data loading into Pandas Dataframe. I get this error here and there and my first reaction usually is "I need a bigger machine with more memory!". But I will show you in this tip how to avoid unnecessary expenses.
Code #1 : read_csv is an important pandas function to read csv files and do operations on it. Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype
. This lets pandas know what types exist inside your csv data.
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Let's say your csv looks like this:
name, age, birthday Alice, 30, 1985-01-01 Bob, 35, 1980-01-01 Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it's just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
By specifying dtype={'age':int}
as an option to the .read_csv()
will let pandas know that age should be interpreted as a number. This saves you lots of memory.
However, if your csv file would be corrupted, like this:
name, age, birthday Alice, 30, 1985-01-01 Bob, 35, 1980-01-01 Charlie, 25, 1990-01-01 Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int}
will break the .read_csv()
command, because it cannot cast "40+"
to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10))) resource.getrusage(resource.RUSAGE_SELF).ru_maxrss # 224544 (~224 MB) df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10))) resource.getrusage(resource.RUSAGE_SELF).ru_maxrss # 79560 (~79 MB)
I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:
df = pd.read_csv(myfile,sep='\t') # didn't work, memory error df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds
Spyder 3.2.3 Python 2.7.13 64bits
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With