I have a time series of people visiting a building. Each person has a unique ID. For every record in the time series, I want to know the number of unique people visiting the building in the last 365 days (i.e. a rolling unique count with a window of 365 days). <code>pandas</code> does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.) Is there a better way to calculate than what I've done below? I'm not sure why the fast method I made, <code>windowed_nunique</code> (under "Speed test 3"), is off by 1. Thanks for any help! Related links: <ul> <li>Source Jupyter Notebook: https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc </li> <li>Related <code>pandas</code> issue: https://github.com/pandas-dev/pandas/issues/14336 </li> </ul> <h3>Initialization</h3> <code>In [1]:</code> <pre class="prettyprint"><code># Import libraries. import pandas as pd import numba import numpy as np </code></pre> <code>In [2]:</code> <pre class="prettyprint"><code># Create data of people visiting a building. np.random.seed(seed=0) dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D') window = 365 # days num_pids = 100 probs = np.linspace(start=0.001, stop=0.1, num=num_pids) df = pd\ .DataFrame( data=[(date, pid) for (pid, prob) in zip(range(num_pids), probs) for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)], columns=['Date', 'PersonId'])\ .sort_values(by='Date')\ .reset_index(drop=True) print("Created data of people visiting a building:") df.head() # 9181 rows × 2 columns </code></pre> <code>Out[2]:</code> <pre class="prettyprint"><code>Created data of people visiting a building: | | Date | PersonId | |---|------------|----------| | 0 | 2010-01-01 | 76 | | 1 | 2010-01-01 | 63 | | 2 | 2010-01-01 | 89 | | 3 | 2010-01-01 | 81 | | 4 | 2010-01-01 | 7 | </code></pre> <h3>Speed reference</h3> <code>In [3]:</code> <pre class="prettyprint"><code>%%timeit # This counts the number of people visiting the building, not the number of unique people. # Provided as a speed reference. df.rolling(window='{:d}D'.format(window), on='Date').count() </code></pre> <code>3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)</code> <h3>Speed test 1</h3> <code>In [4]:</code> <pre class="prettyprint"><code>%%timeit df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique()) </code></pre> <code>2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <code>In [5]:</code> <pre class="prettyprint"><code># Save results as a reference to check calculation accuracy. ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values </code></pre> <h3>Speed test 2</h3> <code>In [6]:</code> <pre class="prettyprint"><code># Define a custom function and implement a just-in-time compiler. @numba.jit(nopython=True) def nunique(arr): return len(set(arr)) </code></pre> <code>In [7]:</code> <pre class="prettyprint"><code>%%timeit df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique) </code></pre> <code>430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <code>In [8]:</code> <pre class="prettyprint"><code># Check accuracy of results. test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values assert all(ref == test) </code></pre> <h3>Speed test 3</h3> <code>In [9]:</code> <pre class="prettyprint"><code># Define a custom function and implement a just-in-time compiler. @numba.jit(nopython=True) def windowed_nunique(dates, pids, window): r"""Track number of unique persons in window, reading through arrays only once. Args: dates (numpy.ndarray): Array of dates as number of days since epoch. pids (numpy.ndarray): Array of integer person identifiers. window (int): Width of window in units of difference of `dates`. Returns: ucts (numpy.ndarray): Array of unique counts. Raises: AssertionError: Raised if `len(dates) != len(pids)` Notes: * May be off by 1 compared to `pandas.core.window.Rolling` with a time series alias offset. """ # Check arguments. assert dates.shape == pids.shape # Initialize counters. idx_min = 0 idx_max = dates.shape[0] date_min = dates[idx_min] pid_min = pids[idx_min] pid_max = np.max(pids) pid_cts = np.zeros(pid_max, dtype=np.int64) pid_cts[pid_min] = 1 uct = 1 ucts = np.zeros(idx_max, dtype=np.int64) ucts[idx_min] = uct idx = 1 # For each (date, person)... while idx < idx_max: # If person count went from 0 to 1, increment unique person count. date = dates[idx] pid = pids[idx] pid_cts[pid] += 1 if pid_cts[pid] == 1: uct += 1 # For past dates outside of window... while (date - date_min) > window: # If person count went from 1 to 0, decrement unique person count. pid_cts[pid_min] -= 1 if pid_cts[pid_min] == 0: uct -= 1 idx_min += 1 date_min = dates[idx_min] pid_min = pids[idx_min] # Record unique person count. ucts[idx] = uct idx += 1 return ucts </code></pre> <code>In [10]:</code> <pre class="prettyprint"><code># Cast dates to integers. df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D') df['DateEpoch'] = df['DateEpoch'].astype(int) </code></pre> <code>In [11]:</code> <pre class="prettyprint"><code>%%timeit windowed_nunique( dates=df['DateEpoch'].values, pids=df['PersonId'].values, window=window) </code></pre> <code>107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <code>In [12]:</code> <pre class="prettyprint"><code># Check accuracy of results. test = windowed_nunique( dates=df['DateEpoch'].values, pids=df['PersonId'].values, window=window) # Note: Method may be off by 1. assert all(np.isclose(ref, np.asarray(test), atol=1)) </code></pre> <code>In [13]:</code> <pre class="prettyprint"><code># Show where the calculation doesn't match. print("Where reference ('ref') calculation of number of unique people doesn't match 'test':") df['ref'] = ref df['test'] = test df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns </code></pre> <code>Out[13]:</code> <pre class="prettyprint"><code>Where reference ('ref') calculation of number of unique people doesn't match 'test': | | Date | PersonId | DateEpoch | ref | test | |----|------------|----------|-----------|------|------| | 78 | 2010-01-19 | 99 | 14628 | 56.0 | 55 | | 79 | 2010-01-19 | 96 | 14628 | 56.0 | 55 | | 80 | 2010-01-19 | 88 | 14628 | 56.0 | 55 | | 81 | 2010-01-20 | 94 | 14629 | 56.0 | 55 | | 82 | 2010-01-20 | 48 | 14629 | 57.0 | 56 | </code></pre>

I had 2 errors in the fast method <code>windowed_nunique</code>, now corrected in <code>windowed_nunique_corrected</code> below: <ol> <li>The size of the array for memoizing the number of unique counts for each person ID within the window, <code>pid_cts</code>, was too small. </li> <li>Because the leading and trailing edges of the window include integer days, <code>date_min</code> should be updated when <code>(date - date_min + 1) > window</code>.</li> </ol> Related links: <ul> <li>Source Jupyter Notebook updated with solution: https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc </li> </ul> <code>In [14]:</code> <pre class="prettyprint"><code># Define a custom function and implement a just-in-time compiler. @numba.jit(nopython=True) def windowed_nunique_corrected(dates, pids, window): r"""Track number of unique persons in window, reading through arrays only once. Args: dates (numpy.ndarray): Array of dates as number of days since epoch. pids (numpy.ndarray): Array of integer person identifiers. Required: min(pids) >= 0 window (int): Width of window in units of difference of `dates`. Required: window >= 1 Returns: ucts (numpy.ndarray): Array of unique counts. Raises: AssertionError: Raised if not... * len(dates) == len(pids) * min(pids) >= 0 * window >= 1 Notes: * Matches `pandas.core.window.Rolling` with a time series alias offset. """ # Check arguments. assert len(dates) == len(pids) assert np.min(pids) >= 0 assert window >= 1 # Initialize counters. idx_min = 0 idx_max = dates.shape[0] date_min = dates[idx_min] pid_min = pids[idx_min] pid_max = np.max(pids) + 1 pid_cts = np.zeros(pid_max, dtype=np.int64) pid_cts[pid_min] = 1 uct = 1 ucts = np.zeros(idx_max, dtype=np.int64) ucts[idx_min] = uct idx = 1 # For each (date, person)... while idx < idx_max: # Lookup date, person. date = dates[idx] pid = pids[idx] # If person count went from 0 to 1, increment unique person count. pid_cts[pid] += 1 if pid_cts[pid] == 1: uct += 1 # For past dates outside of window... # Note: If window=3, it includes day0,day1,day2. while (date - date_min + 1) > window: # If person count went from 1 to 0, decrement unique person count. pid_cts[pid_min] -= 1 if pid_cts[pid_min] == 0: uct -= 1 idx_min += 1 date_min = dates[idx_min] pid_min = pids[idx_min] # Record unique person count. ucts[idx] = uct idx += 1 return ucts </code></pre> <code>In [15]:</code> <pre class="prettyprint"><code># Cast dates to integers. df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D') df['DateEpoch'] = df['DateEpoch'].astype(int) </code></pre> <code>In [16]:</code> <pre class="prettyprint"><code>%%timeit windowed_nunique_corrected( dates=df['DateEpoch'].values, pids=df['PersonId'].values, window=window) </code></pre> <code>98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <code>In [17]:</code> <pre class="prettyprint"><code># Check accuracy of results. test = windowed_nunique_corrected( dates=df['DateEpoch'].values, pids=df['PersonId'].values, window=window) assert all(ref == test) </code></pre>

How to efficiently compute a rolling unique count in a pandas time series?

Tags:

python

pandas

distinct-values

time-series

rolling-computation

I have a time series of people visiting a building. Each person has a unique ID. For every record in the time series, I want to know the number of unique people visiting the building in the last 365 days (i.e. a rolling unique count with a window of 365 days).

pandas does not seem to have a built-in method for this calculation. The calculation becomes computationally intensive when there are a large number of unique visitors and/or a large window. (The actual data is larger than this example.)

Is there a better way to calculate than what I've done below? I'm not sure why the fast method I made, windowed_nunique (under "Speed test 3"), is off by 1.

Thanks for any help!

Initialization

In [1]:

# Import libraries.
import pandas as pd
import numba
import numpy as np

In [2]:

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
    .DataFrame(
        data=[(date, pid)
              for (pid, prob) in zip(range(num_pids), probs)
              for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
        columns=['Date', 'PersonId'])\
    .sort_values(by='Date')\
    .reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns

Out[2]:

Created data of people visiting a building:

|   | Date       | PersonId | 
|---|------------|----------| 
| 0 | 2010-01-01 | 76       | 
| 1 | 2010-01-01 | 63       | 
| 2 | 2010-01-01 | 89       | 
| 3 | 2010-01-01 | 81       | 
| 4 | 2010-01-01 | 7        |

Speed reference

In [3]:

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()

3.32 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed test 1

In [4]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())

2.42 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]:

# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values

Speed test 2

In [6]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
    return len(set(arr))

In [7]:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)

430 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]:

# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)

Speed test 3

In [9]:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
    r"""Track number of unique persons in window,
    reading through arrays only once.

    Args:
        dates (numpy.ndarray): Array of dates as number of days since epoch.
        pids (numpy.ndarray): Array of integer person identifiers.
        window (int): Width of window in units of difference of `dates`.

    Returns:
        ucts (numpy.ndarray): Array of unique counts.

    Raises:
        AssertionError: Raised if `len(dates) != len(pids)`

    Notes:
        * May be off by 1 compared to `pandas.core.window.Rolling`
            with a time series alias offset.

    """

    # Check arguments.
    assert dates.shape == pids.shape

    # Initialize counters.
    idx_min = 0
    idx_max = dates.shape[0]
    date_min = dates[idx_min]
    pid_min = pids[idx_min]
    pid_max = np.max(pids)
    pid_cts = np.zeros(pid_max, dtype=np.int64)
    pid_cts[pid_min] = 1
    uct = 1
    ucts = np.zeros(idx_max, dtype=np.int64)
    ucts[idx_min] = uct
    idx = 1

    # For each (date, person)...
    while idx < idx_max:

        # If person count went from 0 to 1, increment unique person count.
        date = dates[idx]
        pid = pids[idx]
        pid_cts[pid] += 1
        if pid_cts[pid] == 1:
            uct += 1

        # For past dates outside of window...
        while (date - date_min) > window:

            # If person count went from 1 to 0, decrement unique person count.
            pid_cts[pid_min] -= 1
            if pid_cts[pid_min] == 0:
                uct -= 1
            idx_min += 1
            date_min = dates[idx_min]
            pid_min = pids[idx_min]

        # Record unique person count.
        ucts[idx] = uct
        idx += 1

    return ucts

In [10]:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

In [11]:

%%timeit
windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)

107 µs ± 63.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]:

# Check accuracy of results.
test = windowed_nunique(
    dates=df['DateEpoch'].values,
    pids=df['PersonId'].values,
    window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))

In [13]:

# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns

Out[13]:

Where reference ('ref') calculation of number of unique people doesn't match 'test':

|    | Date       | PersonId | DateEpoch | ref  | test | 
|----|------------|----------|-----------|------|------| 
| 78 | 2010-01-19 | 99       | 14628     | 56.0 | 55   | 
| 79 | 2010-01-19 | 96       | 14628     | 56.0 | 55   | 
| 80 | 2010-01-19 | 88       | 14628     | 56.0 | 55   | 
| 81 | 2010-01-20 | 94       | 14629     | 56.0 | 55   | 
| 82 | 2010-01-20 | 48       | 14629     | 57.0 | 56   |

632

asked Sep 28 '17 13:09

Samuel Harrold

1 Answers

I had 2 errors in the fast method windowed_nunique, now corrected in windowed_nunique_corrected below:

The size of the array for memoizing the number of unique counts for each person ID within the window, pid_cts, was too small.
Because the leading and trailing edges of the window include integer days, date_min should be updated when (date - date_min + 1) > window.

Samuel Harrold

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently compute a rolling unique count in a pandas time series?

Tags:

python

pandas

distinct-values

time-series

rolling-computation

Initialization

Speed reference

Speed test 1

Speed test 2

Speed test 3

Samuel Harrold

People also ask

1 Answers

Samuel Harrold

Recent Activity

Donate For Us

How to efficiently compute a rolling unique count in a pandas time series?

Tags:

python

pandas

distinct-values

time-series

rolling-computation

Initialization

Speed reference

Speed test 1

Speed test 2

Speed test 3

Samuel Harrold

People also ask

1 Answers

Samuel Harrold

Related questions

Recent Activity

Donate For Us