Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Date Difference Between Two Device Failures

I am trying to calculate the # of days between failures. I'd like to know on each day in the series the # of days passed since the last failure where failure = 1. There may be anywhere from 1 to 1500 devices.

For Example, Id like my dataframe to look like this (please pull data from url in the second code block. This is just a short example of a larger dataframe.):

date        device      failure      elapsed    
10/01/2015  S1F0KYCR    1            0           
10/07/2015  S1F0KYCR    1            7           
10/08/2015  S1F0KYCR    0            0           
10/09/2015  S1F0KYCR    0            0           
10/17/2015  S1F0KYCR    1            11          
10/31/2015  S1F0KYCR    0            0           
10/01/2015  S8KLM011    1            0           
10/02/2015  S8KLM011    1            2           
10/07/2015  S8KLM011    0            0
10/09/2015  S8KLM011    0            0
10/11/2015  S8KLM011    0            0
10/21/2015  S8KLM011    1            20 

Sample Code:

Edit: Please pull actual data from code block below. The above sample data is an short example. Thanks.

url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"

df = pd.read_csv(url, encoding = "ISO-8859-1")

df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime

This is where I am running into obstacles. However the new column should contain the # of days since last failure, where failure = 1.

test['date'] = 0
for i in test.index[1:]:
    if not test['failure'][i]:
        test['elapsed'][i] = test['elapsed'][i-1] + 1

I have also tried

fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
like image 288
Starbucks Avatar asked May 21 '19 02:05

Starbucks


1 Answers

Using pandas.DataFrame.groupby with diff and apply:

import pandas as pd
import numpy as np

df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)

Output:

         Date    Device  Failure  Elapsed
0  2015-10-01  S1F0KYCR        1      0.0
1  2015-10-07  S1F0KYCR        1      7.0
2  2015-10-08  S1F0KYCR        0      0.0
3  2015-10-09  S1F0KYCR        0      0.0
4  2015-10-17  S1F0KYCR        1     11.0
5  2015-10-31  S1F0KYCR        0      0.0
6  2015-10-01  S8KLM011        1      0.0
7  2015-10-02  S8KLM011        1      2.0
8  2015-10-07  S8KLM011        0      0.0
9  2015-10-09  S8KLM011        0      0.0
10 2015-10-11  S8KLM011        0      0.0
11 2015-10-21  S8KLM011        1     20.0

Update:

Found out the actual data linked in the OP contains No device that has more than two failure cases, making the final result all zeros (i.e. no second failure has ever happened and thus nothing to calculate for elapsed). Using OP's original snippet:

import pandas as pd

url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"

df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) 
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')

Find if any device has more than 1 failure:

df.groupby(['device'])['failure'].sum().gt(1).any()
# False

Which actually confirms that the all zeros in df['elapsed'] is actually a correct answer :)

If you tweak your data a bit, it does yield elapsed just as expected.

df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1

s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0    124493
# 3.0         1
like image 69
Chris Avatar answered Nov 02 '22 23:11

Chris