I am facing a strange performance issue when parsing a lot of dates with Pandas 0.17.1. For demonstration, I created CSV files with exactly one column, containing datetimes in the format "2015-12-31 13:01:01". The sample files contain 10k, 100k, 1M, and 10M records. I am parsing it like this:
start = timer()
pd.read_csv('10k_records.csv', parse_dates=['date'])
end = timer()
print(end - start)
The elapsed times are:
10k: 0.011 s
100k: 0.10 s
1m: 1.2 s
10m: 300 s
You see, the time scales linearly to the number of records until 1 million, but then there is a huge drop.
It's not a memory issue. I have 16GB, and I work with dataframes of this size without any problems in Pandas, only parsing of dates appears to be slow.
I tried to use infer_datetime_format=True, but the speed was similar. Also a huge drop for 10m records.
Then I tried to register my own naive date parser:
def parseDate(t):
if type(t) is str :
st = str(t)
try:
return datetime.datetime(int(st[:4]),int(st[5:7]),int(st[8:10]),int(st[11:13]),int(st[14:16]),int(st[17:19]))
except:
return None
return datetime.datetime(0,0,0,0,0,0)
pd.read_csv(
'10k_records.csv', parse_dates=['date'],
date_parser=parseDate
)
And the times are now:
10k: 0.045 s
100k: 0.36 s
1m: 3.7 s
10m: 36 s
The routine is slower than the default pandas parser on smaller files, but it scales perfectly linearly for the larger one. So it really looks like some kind of performance leak in the standard date parsing routine.
Well, I could use my parser, but it's very simple, stupid, and apparently slow. I would prefer to use the intelligent, robust, and fast Pandas parser, only if I could somehow solve the scalability issue. Would anyone have any idea, if it could be solved, possibly by some esoteric parameter or something?
Thank all of you for your help so far.
After all, it seems that there is a reproduceable performance problem with dates parsing, but it has nothing to do with scalability. I was wrong in my original analysis.
You can try to download this file https://www.dropbox.com/s/c5m21s1uif329f1/slow.csv.tar.gz?dl=0 and parse it in Pandas. The format and everything is correct, all the data are valid. There are only 100k records, but it takes 3 seconds to parse them - while it takes 0.1s to parse 100k records from the generated regular sequence.
What happened: I did not generate my original testing data as a regular sequence, as @exp1orer did. I was taking samples of our real data, and their distribution is not that regular. The sequence is overall growing by a constant pace, but there are some local irregularities and unordered pieces. And, apparently, in my 10M sample, there happened to be one section, which made pandas particularly unhappy and parsing took so long. It's only a tiny fraction of the file content that is responsible for all the slowness. But I was not able to spot any principal differences between that fraction and the rest of the file.
So, the cause of slowness was that there were some weird dates, like 20124-10-20. Apparently, I will need to do some more pre-processing before importing the data to Pandas.
UPDATE:
look at this comparison:
In [507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'
In [508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'
In [509]: %timeit df = pd.read_csv(fn, parse_dates=['from'], index_col=0)
1 loop, best of 3: 15.7 s per loop
In [510]: %timeit df2 = pd.read_csv(fn2, parse_dates=['from'], index_col=0)
1 loop, best of 3: 399 ms per loop
In [511]: len(df)
Out[511]: 99831
In [512]: len(df2)
Out[512]: 99831
In [513]: df.dtypes
Out[513]:
from object
dtype: object
In [514]: df2.dtypes
Out[514]:
from datetime64[ns]
dtype: object
The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz
file:
In [518]: df.iloc[36867]
Out[518]:
from 20124-10-20 10:12:00
Name: 36867, dtype: object
In [519]: df2.iloc[36867]
Out[519]:
from 2014-10-20 10:12:00
Name: 36867, dtype: datetime64[ns]
Conclusion: it took Pandas 39 times longer because of one row with a "bad" date and at the end Pandas left from
column in the df
DF as a string
OLD answer:
it works pretty fair for me (pandas 0.18.0):
setup:
start_ts = '2000-01-01 00:00:00'
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**4)}).to_csv('d:/temp/10k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**5)}).to_csv('d:/temp/100k.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**6)}).to_csv('d:/temp/1m.csv', index=False)
pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**7)}).to_csv('d:/temp/10m.csv', index=False)
dt_parser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")
checks:
In [360]: fn = 'd:/temp/10m.csv'
In [361]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 22.6 s per loop
In [362]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 29.9 s per loop
In [363]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 29.9 s per loop
In [364]: fn = 'd:/temp/1m.csv'
In [365]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 2.32 s per loop
In [366]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 3.06 s per loop
In [367]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 3.06 s per loop
In [368]: %timeit pd.read_csv(fn)
1 loop, best of 3: 1.53 s per loop
conclusion: it's a bit faster when i'm using date_parser
where i'm specifying the date format, so read_csv
don't have to guess it. The difference is approx. 30%
Okay -- based on the discussion in the comments and in the chat room it seems that there is a problem with OP's data. Using the code below he is unable to reproduce his own error:
import pandas as pd
import datetime
from time import time
format_string = '%Y-%m-%d %H:%M:%S'
base_dt = datetime.datetime(2016,1,1)
exponent_range = range(2,8)
def dump(number_records):
print 'now dumping %s records' % number_records
dts = pd.date_range(base_dt,periods=number_records,freq='1s')
df = pd.DataFrame({'date': [dt.strftime(format_string) for dt in dts]})
df.to_csv('%s_records.csv' % number_records)
def test(number_records):
start = time()
pd.read_csv('%s_records.csv' % number_records, parse_dates=['date'])
end = time()
print str(number_records), str(end - start)
def main():
for i in exponent_range:
number_records = 10**i
dump(number_records)
test(number_records)
if __name__ == '__main__':
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With