Why is it that passing vectorized functions to the read_csv
converters
parameter, they get executed on each value individually, instead of on all values at once? This is much slower (~60x) than manually converting the corresponding columns afterwards.
While reading large csv files (several million rows of mixed data), I used the converters
parameter of pandas' read_csv
method to conveniently pass functions that convert strings to datetime objects etc.
However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.
For illustration, let's convert strings to datetime objects using 3 different methods:
Note, the conversion from string to datetime is arbitrary here. This could be replaced with other functions (except for not having the specific parse_dates/date_parser parameters).
import pandas as pd # 0.19.2 with python 3.5
# create dummy data
rows = 100000
data = {"dates": pd.date_range("2017-02-27 20:44:23", periods=rows, freq="S")}
# save as temporary file for timeit
pd.DataFrame(data).to_csv("dummy")
# define converters
def convert_datetime(series):
return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")
Now, lets see the timeit (iPython) comparisons:
# METHOD 1
%%timeit
df = pd.read_csv("dummy", converters={"dates": convert_datetime})
# 1 loop, best of 3: 7.76 s per loop
# METHOD 2
%%timeit
df = pd.read_csv("dummy", parse_dates=["dates"], date_parser=convert_datetime)
# 10 loops, best of 3: 125 ms per loop
# METHOD 3
%%timeit
df = pd.read_csv("dummy")
df["dates"] = convert_datetime(df["dates"])
# 10 loops, best of 3: 129 ms per loop
The converters' version is roughly 60 times slower than the other ones. To understand this a bit better, I wrapped the convert_datetime function into a little decorator class to count the number of calls:
class Counter:
def __init__(self, func):
self.func = func
self.count = 0
def __call__(self, *args, **kwargs):
self.count += 1
return self.func(*args, **kwargs)
@Counter
def convert_datetime(series):
return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")
It revealed that using the converters' parameter calls the convert_datetime
function for every single value whereas the other versions call the converter function only once. This explains the performance deficit.
Python loads CSV files 100 times faster than Excel files. Use CSVs.
python - Pyarrow is slower than pandas for csv read in - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.
From the docs for read_csv
(emphasis mine),
converters
: dict, default None
- Dict of functions for converting values in certain columns. Keys can either be integers or column labels
The idea of the converters
keyword parameter is to provide functions which act upon individual values, not entire columns. This can be seen by redefining the converter function
def convert_datetime(val):
return datetime.datetime.strptime(val, "%Y-%m-%d %H:%M:%S")
%timeit pd.read_csv("dummy", converters={"dates": convert_datetime})
1 loop, best of 3: 2.81 s per loop
And as you can imagine, this amounts to a lot of Python function calls.
As for why the converters parameter doesn't accept vectorized functions, my best guess would be that they offer somewhat less flexibility than the current implementation. With the idea being that you can parse necessary columns of dates, etc. which may require some parsing logic with vectorized parse_dates
, date_parser
already, and most further column operations could be done post-read with vectorized methods.
In other words, being able to parse elements which have to end up in a different type (like datetimes) is convenient to have vectorized methods for in read_csv
. Beyond that, converters
is just a convenience parameter which can flexibly act on individual values - as further vectorized methods can just be done post-read anyways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With