Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently check if dataframe has date between a range, and return a count

Say we have a dataframe df with a list of dates in chronological order by day.

The goal is to obtain the number of people on a given day that had a date range that included the given date.

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)], 
                        'num_people_on_day': [0,0,0]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5)],
                        'end': [datetime.date(2007, 12, 6), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30)]})

enter image description here

So for each date in df, how can I check all of dg efficiently and then count the number returned and put it into df.

I am not even sure if a merge is necessary here (also trying to save memory), and am really trying to write it to be as fast as possible.

EDIT: Ok so I have come up with a different way to do this, but I hate using apply. Is there a way to do this new way without using .apply?

import pandas as pd
import datetime

df = pd.DataFrame(data={'date': [datetime.date(2007, 12, 1), 
                                 datetime.date(2007, 12, 2), 
                                 datetime.date(2007, 12, 3)]})

dg = pd.DataFrame(data={'person': ['Alice', 'Bob', 'Chuck', 'Dave'],
                        'start': [datetime.date(2007, 11, 5), 
                                  datetime.date(2007, 12, 8), 
                                  datetime.date(2007, 1, 5),
                                  datetime.date(2007, 11, 6)],
                        'end': [datetime.date(2007, 12, 1), 
                                datetime.date(2008, 1, 3), 
                                datetime.date(2007, 11, 30),
                                datetime.date(2007, 12, 2)]})

def get_num_persons(date, vec_starts, vec_ends):
    """
    Helper function for .apply to get number of persons.
    For each given date, if start and end date is 
    between the given date, then both results are True.
    The bitwise AND then only sums these True and True values.
    """
    return (((vec_starts <= date) & (vec_ends >= date)).sum())

def num_of_persons(persons, dates_df):
    """
    Obtains the number of persons for each day.
    """
    dates_df['num_persons'] = dates_df['date'].apply(lambda row: 
                                                   get_num_persons(row, 
                                                   persons['start'],
                                                   persons['end']))
    return dates_df

num_of_persons(dg, df.copy())
like image 308
bbd108 Avatar asked Oct 24 '25 06:10

bbd108


1 Answers

With sufficient memory, merge then count the dates that fall in between. .reindex ensures we get the 0s.

#df['date'] = pd.to_datetime(df.date)
#dg['start'] = pd.to_datetime(dg.start)
#dg['end'] = pd.to_datetime(dg.end)

m = df[['date']].assign(k=1).merge(dg.assign(k=1))

(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())

         date  num_people_on_day
0  2007-12-01                  1
1  2007-12-02                  1
2  2007-12-03                  1

The other option is to use the apply. This is a loop, so performance suffers as df grows.

def get_len(x, dg):
    try:
        return len(dg.iloc[dg.index.get_loc(x)])
    except KeyError:  # Deal with dates that have 0
        return 0

dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)

To illustrate the timings, look at your small set, then a much larger df.

%%timeit 
m = df[['date']].assign(k=1).merge(dg.assign(k=1))
(m[m.date.between(m.start, m.end)].groupby('date').size()
   .reindex(df.date).fillna(0)
   .rename('num_people_on_day').reset_index())
#9.39 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit 
dg.index = pd.IntervalIndex.from_arrays(dg['start'], dg['end'], closed='both')
df['num_people_on_day'] = df['date'].apply(get_len, dg=dg)
#4.06 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But once df is longer (even just 90 rows), the difference becomes apparent.

df = pd.DataFrame({'date': pd.date_range('2007-01-01', '2007-03-31')})

%%timeit merge
#9.78 ms ± 75.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit IntervalIndex
#65.5 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 173
ALollz Avatar answered Oct 26 '25 19:10

ALollz