I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data - note there are four 2016-08-23 00:00:17.664193
timestamps:
In [167]: ts
Out[168]:
last last_sz bid ask
datetime
2016-08-23 00:00:14.161128 2170.75 1 2170.75 2171.00
2016-08-23 00:00:14.901180 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.196639 2170.75 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 2 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:26.206108 2170.75 2 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 7 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 1 2170.75 2171.00
In this example, there are only a few duplicates, but in some cases, there are hundreds of consecutive rows, all sharing the same timestamp. I'm aiming to solve this by adding 1 extra nanosecond to each duplicate (so in the case of 4 consecutive identical timestamps, I'd add 1ns to the second, 2ns to the 3rd, and 3ns to the fourth. For example, the data above would be converted to:
In [169]: make_timestamps_unique(ts)
Out[170]:
last last_sz bid ask
newindex
2016-08-23 00:00:14.161128000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:14.901180000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.196639000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:17.664193000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193001 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193002 2171.00 2 2170.75 2171.0
2016-08-23 00:00:17.664193003 2171.00 1 2170.75 2171.0
2016-08-23 00:00:26.206108000 2170.75 2 2170.75 2171.0
2016-08-23 00:00:28.322456000 2170.75 7 2170.75 2171.0
2016-08-23 00:00:28.322456001 2170.75 1 2170.75 2171.0
I've struggled to find a good way to do this - my current solution is to make multiple passes, checking for duplicates each time, and adding 1ns to all but the first in a series of identical timestamps. Here's the code:
def make_timestamps_unique(ts):
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes = 0
while duplicate_count > 0:
ts.loc[:, 'newindex'] = ts.index
ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
ts = ts.set_index('newindex')
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes += 1
print('%d passes of duplication loop' % passes)
return ts
This is obviously quite inefficient - it often requires hundreds of passes, and if I try it on a 2 million row dataframe, I get a MemoryError
. Any ideas for a better way to achieve this?
Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum
on duplicated timestamp values while resetting the cumulative sum each time a np.NaN
is encountered:
# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)
DatetimeIndex([ '2016-08-23 00:00:14.161128',
'2016-08-23 00:00:14.901180',
'2016-08-23 00:00:17.196639',
'2016-08-23 00:00:17.664193001',
'2016-08-23 00:00:17.664193002',
'2016-08-23 00:00:17.664193003',
'2016-08-23 00:00:17.664193004',
'2016-08-23 00:00:26.206108',
'2016-08-23 00:00:28.322456001',
'2016-08-23 00:00:28.322456002'],
dtype='datetime64[ns]', name='datetime', freq=None)
Timing this solution yields 10000 loops, best of 3: 107 µs per loop
whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop
.
Of course, you have to reset your index, finally:
df.index = result
You can group the rows by the index and then add a range of sequential timedeltas to the index of each group. I am not sure if this can be done directly with the index, but you can first convert the index to an ordinary column, apply the operation to the column, and set the column as the index again:
newindex = ts.reset_index()\
.groupby('datetime')['datetime']\
.apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With