I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv()
with a chunksize=10,000 parameter.
However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.
Any ideas?
The read_csv() method has many parameters but the one we are interested is chunksize. Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows.
Python loads CSV files 100 times faster than Excel files. Use CSVs. Con: csv files are nearly always bigger than . xlsx files.
For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you have CPU cores, and let each CPU core run the calculation on its piece. In the end, we can aggregate the results, which is a computationally cheap operation. How a multi-core system can process data faster.
There is no "optimal chunksize" [*]. Because chunksize
only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)
To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...
by looking at your number of columns, their dtypes, and the size of each; use either df.describe()
, or else for more in-depth memory usage, by column:
print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top
/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.
One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values
arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.
And use all the standard pandas read_csv
tricks, like:
dtypes
for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usageusecols
if you only want to keep a subset of columnsIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With