Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large amount of dates with pandas - scalability - performance drops faster than linear

I am facing a strange performance issue when parsing a lot of dates with Pandas 0.17.1. For demonstration, I created CSV files with exactly one column, containing datetimes in the format "2015-12-31 13:01:01". The sample files contain 10k, 100k, 1M, and 10M records. I am parsing it like this:

start = timer()
pd.read_csv('10k_records.csv', parse_dates=['date'])
end = timer()
print(end - start)

The elapsed times are:
10k: 0.011 s
100k: 0.10 s
1m: 1.2 s
10m: 300 s

You see, the time scales linearly to the number of records until 1 million, but then there is a huge drop.

It's not a memory issue. I have 16GB, and I work with dataframes of this size without any problems in Pandas, only parsing of dates appears to be slow.

I tried to use infer_datetime_format=True, but the speed was similar. Also a huge drop for 10m records.

Then I tried to register my own naive date parser:

def parseDate(t):
    if type(t) is str :
        st = str(t)
        try:
          return datetime.datetime(int(st[:4]),int(st[5:7]),int(st[8:10]),int(st[11:13]),int(st[14:16]),int(st[17:19]))
        except:
          return None
    return datetime.datetime(0,0,0,0,0,0)

pd.read_csv(
    '10k_records.csv', parse_dates=['date'],
    date_parser=parseDate
)

And the times are now:
10k: 0.045 s
100k: 0.36 s
1m: 3.7 s
10m: 36 s

The routine is slower than the default pandas parser on smaller files, but it scales perfectly linearly for the larger one. So it really looks like some kind of performance leak in the standard date parsing routine.

Well, I could use my parser, but it's very simple, stupid, and apparently slow. I would prefer to use the intelligent, robust, and fast Pandas parser, only if I could somehow solve the scalability issue. Would anyone have any idea, if it could be solved, possibly by some esoteric parameter or something?

UPDATE

Thank all of you for your help so far.

After all, it seems that there is a reproduceable performance problem with dates parsing, but it has nothing to do with scalability. I was wrong in my original analysis.

You can try to download this file https://www.dropbox.com/s/c5m21s1uif329f1/slow.csv.tar.gz?dl=0 and parse it in Pandas. The format and everything is correct, all the data are valid. There are only 100k records, but it takes 3 seconds to parse them - while it takes 0.1s to parse 100k records from the generated regular sequence.

What happened: I did not generate my original testing data as a regular sequence, as @exp1orer did. I was taking samples of our real data, and their distribution is not that regular. The sequence is overall growing by a constant pace, but there are some local irregularities and unordered pieces. And, apparently, in my 10M sample, there happened to be one section, which made pandas particularly unhappy and parsing took so long. It's only a tiny fraction of the file content that is responsible for all the slowness. But I was not able to spot any principal differences between that fraction and the rest of the file.

UPDATE 2

So, the cause of slowness was that there were some weird dates, like 20124-10-20. Apparently, I will need to do some more pre-processing before importing the data to Pandas.

like image 508
Jan X Marek Avatar asked Apr 07 '16 18:04

Jan X Marek


2 Answers

UPDATE:

look at this comparison:

In [507]: fn
Out[507]: 'D:\\download\\slow.csv.tar.gz'

In [508]: fn2
Out[508]: 'D:\\download\\slow_filtered.csv.gz'

In [509]: %timeit df = pd.read_csv(fn, parse_dates=['from'], index_col=0)
1 loop, best of 3: 15.7 s per loop

In [510]: %timeit df2 = pd.read_csv(fn2, parse_dates=['from'], index_col=0)
1 loop, best of 3: 399 ms per loop

In [511]: len(df)
Out[511]: 99831

In [512]: len(df2)
Out[512]: 99831

In [513]: df.dtypes
Out[513]:
from    object
dtype: object

In [514]: df2.dtypes
Out[514]:
from    datetime64[ns]
dtype: object

The only difference between those two DFs is in the row# 36867, which i've manually corrected in the D:\\download\\slow_filtered.csv.gz file:

In [518]: df.iloc[36867]
Out[518]:
from    20124-10-20 10:12:00
Name: 36867, dtype: object

In [519]: df2.iloc[36867]
Out[519]:
from   2014-10-20 10:12:00
Name: 36867, dtype: datetime64[ns]

Conclusion: it took Pandas 39 times longer because of one row with a "bad" date and at the end Pandas left from column in the df DF as a string

OLD answer:

it works pretty fair for me (pandas 0.18.0):

setup:

start_ts = '2000-01-01 00:00:00'

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**4)}).to_csv('d:/temp/10k.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**5)}).to_csv('d:/temp/100k.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**6)}).to_csv('d:/temp/1m.csv', index=False)

pd.DataFrame({'date': pd.date_range(start_ts, freq='1S', periods=10**7)}).to_csv('d:/temp/10m.csv', index=False)

dt_parser = lambda x: pd.to_datetime(x, format="%Y-%m-%d %H:%M:%S")

checks:

In [360]: fn = 'd:/temp/10m.csv'

In [361]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 22.6 s per loop

In [362]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 29.9 s per loop

In [363]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 29.9 s per loop

In [364]: fn = 'd:/temp/1m.csv'

In [365]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime}, date_parser=dt_parser)
1 loop, best of 3: 2.32 s per loop

In [366]: %timeit pd.read_csv(fn, parse_dates=['date'], dtype={0: pd.datetime})
1 loop, best of 3: 3.06 s per loop

In [367]: %timeit pd.read_csv(fn, parse_dates=['date'])
1 loop, best of 3: 3.06 s per loop

In [368]: %timeit pd.read_csv(fn)
1 loop, best of 3: 1.53 s per loop

conclusion: it's a bit faster when i'm using date_parser where i'm specifying the date format, so read_csv don't have to guess it. The difference is approx. 30%

like image 80
MaxU - stop WAR against UA Avatar answered Oct 17 '22 15:10

MaxU - stop WAR against UA


Okay -- based on the discussion in the comments and in the chat room it seems that there is a problem with OP's data. Using the code below he is unable to reproduce his own error:

import pandas as pd
import datetime
from time import time

format_string = '%Y-%m-%d %H:%M:%S'
base_dt = datetime.datetime(2016,1,1)
exponent_range = range(2,8)


def dump(number_records):
    print 'now dumping %s records' % number_records
    dts = pd.date_range(base_dt,periods=number_records,freq='1s')
    df = pd.DataFrame({'date': [dt.strftime(format_string) for dt in dts]})
    df.to_csv('%s_records.csv' % number_records)


def test(number_records):
    start = time()
    pd.read_csv('%s_records.csv' % number_records, parse_dates=['date'])
    end = time()
    print str(number_records), str(end - start)

def main():
    for i in exponent_range:
        number_records = 10**i
        dump(number_records)
        test(number_records)

if __name__ == '__main__':
    main()
like image 1
exp1orer Avatar answered Oct 17 '22 14:10

exp1orer