I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
data column is not just one column but a ton of values from a sensor suite).Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With