Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the optimal chunksize in pandas read_csv to maximize speed?

I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a chunksize=10,000 parameter.

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

Any ideas?

like image 412
ℕʘʘḆḽḘ Avatar asked Feb 05 '16 23:02

ℕʘʘḆḽḘ


People also ask

What is Chunksize in read_csv?

The read_csv() method has many parameters but the one we are interested is chunksize. Technically the number of rows read at a time in a file by pandas is referred to as chunksize. Suppose If the chunksize is 100 then pandas will load the first 100 rows.

Is read_csv faster than Read_excel?

Python loads CSV files 100 times faster than Excel files. Use CSVs. Con: csv files are nearly always bigger than . xlsx files.

How do you speed up pandas?

For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you have CPU cores, and let each CPU core run the calculation on its piece. In the end, we can aggregate the results, which is a computationally cheap operation. How a multi-core system can process data faster.


1 Answers

There is no "optimal chunksize" [*]. Because chunksize only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
  • Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.

  • One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.

  • And use all the standard pandas read_csv tricks, like:

    • specify dtypes for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage
    • specify usecols if you only want to keep a subset of columns
    • use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
    • read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.
like image 127
smci Avatar answered Oct 04 '22 21:10

smci