I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas <code>pd.read_csv()</code> with a chunksize=10,000 parameter. However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data. Any ideas?

There is no "optimal chunksize" [*]. Because <code>chunksize</code> only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K) To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row... by looking at your number of columns, their dtypes, and the size of each; use either <code>df.describe()</code>, or else for more in-depth memory usage, by column: <pre class="prettyprint"><code>print 'df Memory usage by column...' print df.memory_usage(index=False, deep=True) / df.shape[0] </code></pre> <ul> <li>Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix <code>top</code>/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.</li> <li>One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and <code>pandas.read_csv() dtypes, converters, na_values</code> arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.</li> <li> And use all the standard pandas <code>read_csv</code> tricks, like: <ul> <li> specify <code>dtypes</code> for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage</li> <li>specify <code>usecols</code> if you only want to keep a subset of columns</li> <li> use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.</li> <li> read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.</li> </ul> </li> </ul>

what is the optimal chunksize in pandas read_csv to maximize speed?

1 Answers

There is no "optimal chunksize" [*]. Because chunksize only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]

Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.
One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.
And use all the standard pandas read_csv tricks, like:
- specify dtypes for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage
- specify usecols if you only want to keep a subset of columns
- use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
- read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.

127

answered Oct 04 '22 21:10

smci

Related questions
                            
                                Actions before close python script
                            
                                How to increase thickness of polygon in PIL ImageDraw
                            
                                Pandas warning when using map: A value is trying to be set on a copy of a slice from a DataFrame
                            
                                python pandas date time conversion to date
                            
                                How to use markdown for python with pycharm?
                            
                                Updating client that a celery task has finished
                            
                                What is the difference between jsonify and tojson in Flask?
                            
                                Why should you manually run a garbage collection in python?
                            
                                Get selected data contained within box select tool in Bokeh
                            
                                How to save a spark RDD in gzip format through pyspark
                            
                                Flask: How to run a method before every route in a blueprint?
                            
                                Preserve the non-numerical columns when doing pandas.DataFrame.groupby().sum()
                            
                                Why `type(x).__enter__(x)` instead of `x.__enter__()` in Python standard contextlib?
                            
                                roc curve with sklearn [python]
                            
                                Filtering pandas or numpy arrays for continuous series with minimum window length
                            
                                Fastest way to check which interval index a value is
                            
                                Changing the order of operation for __add__, __mul__, etc. methods in a custom class
                            
                                Is There A Way to Truncate by Words in View in Django?
                            
                                Appending to dict of lists adds value to every key [duplicate]
                            
                                SKLearn how to get decision probabilities for LinearSVC classifier

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is the optimal chunksize in pandas read_csv to maximize speed?

Tags:

python

io

memory

pandas

chunks

ℕʘʘḆḽḘ

People also ask

1 Answers

smci

Recent Activity

Donate For Us