How to upload two large (5GB) each csv file in local system Jupyter Notebook using python pandas. Please suggest any configuration to handle big csv files for data analysis ?
Local System Configuration:
OS: Windows 10
RAM: 16 GB
Processor: Intel-Core-i7
Code:
dpath = 'p_flg_tmp1.csv'
pdf = pd.read_csv(dpath, sep="|")
Error:
MemoryError: Unable to allocate array
or
pd.read_csv(po_cust_data, sep="|", low_memory=False)
Error:
ParserError: Error tokenizing data. C error: out of memory
How to handle two bigger csv file in local system for data analysis? please suggested better configuration if possible in local system using python pandas.
If you do not need to process everything at once you can use chunks:
reader = pd.read_csv('tmp.sv', sep='|', chunksize=4000)
for chunk in reader:
print(chunk)
see the Documentation of Pandas for further information.
If you need to process everything at once and chunking really isnt an option you have only two options left
A csv file takes an enormous amount of memory in RAM, see this article for more information even if it is for another software it gives a good idea about the problem:
Memory Usage
You can estimate the memory usage of your CSV file with this simple formula:
memory = 25 * R * C + F
where R is the number of rows, C the number of columns and F the file size in bytes.
One of my test files is 524 MB large, contains 10 columns in 4.4 million rows. Using the formula from above the RAM usage will be about 1.6 GB:
memory = 25 * 4,400,000 * 10 + 524,000,000 = 1,624,000,000 bytes
While this file is opened in Tablecruncher the Activity Monitor reports 1.4 GB RAM used, so the formula represents a rather accurate guess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With