Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.

However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.

I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.

My data roughly looks like this:

df =

timeStamp                 Temp
2018-05-04 10:08:00       28.63
         .                  . 
         .                  .
2018-05-04 21:00:00       31.63

The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:

basic threshold filtered data

To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.

e.g.

df_diff = df.diff(60) # period of about 60 makes jumps stick out

filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.

diff data

However, I find myself stuck here. The main problem is that:

1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?

The more minor problem is that 2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?

*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):

enter image description here

*Edit 2: Link to sample data

like image 820
user3168953 Avatar asked Jun 05 '18 11:06

user3168953


2 Answers

Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])

#the following line calculates the absolute value of a second order finite 
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()

df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change 
df[1].plot()                  #plot original data

plt.show()

Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.

enter image description here

Your first question I believe is answered with the .loc selection above.

You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.

Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.

Edit:

A fourth-order finite difference can be applied using this:

df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
    (df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()

It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders. Finite Difference Coefficients Calculator

Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.

like image 153
Boergler Avatar answered Sep 30 '22 19:09

Boergler


Here's a suggestion that targets your issues regarding

  1. [...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.

  2. [..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?

using stats.zscore() and pandas.merge()

As it is, it will still have a minor issue with your concerns regarding

[...]left with some residual artefacts from the data jumps near the edges[...]

But we'll get to that later.

First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:

# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(22)

# A function for noisy data with a trend element
def sample():

    base = 100
    nsample = 50
    sigma = 10
    
    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)
    
    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']
    
    df=df['y4'].to_frame()
    df.columns = ['Temp']
    
    df['Temp'][20:31] = np.nan
        
    # Insert spikes and missing values
    df['Temp'][19] = df['Temp'][39]/4000
    df['Temp'][31] = df['Temp'][15]/4000
    
    return(df)
    
# Dataframe with random data
df_raw = sample()
df_raw.plot()

enter image description here

As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:

enter image description here

But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:

enter image description here

This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.

The steps:

# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns

# 2. Take the first difference and 
df_diff = df_raw.diff()

# 3. Remove missing values
df_clean = df_diff.dropna()

# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index

# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]

# 6. 
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)

# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()

# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')

# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]

# 10. Reset column names
df_complete.columns = colName

# Result
df_complete.plot()

enter image description here

Here's the whole thing for an easy copy-paste:

# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(22)

# A function for noisy data with a trend element
def sample():

    base = 100
    nsample = 50
    sigma = 10
    
    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)
    
    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']
    
    df=df['y4'].to_frame()
    df.columns = ['Temp']
    
    df['Temp'][20:31] = np.nan
        
    # Insert spikes and missing values
    df['Temp'][19] = df['Temp'][39]/4000
    df['Temp'][31] = df['Temp'][15]/4000
    
    return(df)

# A function for removing outliers
def noSpikes(df, level, keepFirst):

    # 1. Get some info about the original data:
    firstVal = df[:1]
    colName = df.columns
    
    # 2. Take the first difference and 
    df_diff = df.diff()
    
    # 3. Remove missing values
    df_clean = df_diff.dropna()
    
    # 4. Select a level for a Z-score to identify and remove outliers
    df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
    ix_keep = df_Z.index
    
    # 5. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df_raw.loc[ix_keep]
    
    # 6. 
    # df_keep will be missing some indexes.
    # Do the following if you'd like to keep those indexes
    # and, for example, fill missing values with the previous values
    df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
    
    # 7. Keep only the first column
    df_out = df_out.ix[:,0].to_frame()
    
    # 8. Fill missing values
    df_complete = df_out.fillna(axis=0, method='ffill')
    
    # 9. Reset column names
    df_complete.columns = colName
    
    # Keep the first value
    if keepFirst:
        df_complete.iloc[0] = firstVal.iloc[0]
    
    return(df_complete)

# Dataframe with random data
df_raw = sample()
df_raw.plot()

# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
        
df_cleaned.plot()
like image 38
vestland Avatar answered Sep 30 '22 19:09

vestland