Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use time-series data in classification in sklearn

I have a time-series dataset as follows where I record 2 time-series for each of my sensors. The Label column depicts if the sensor is faulty or not (i.e. 0 or 1).

sensor, time-series 1, time-series 2, Label
x1, [38, 38, 35, 33, 32], [18, 18, 12, 11, 09], 1
x2, [33, 32, 35, 36, 32], [13, 12, 15, 16, 12], 0
and so on ..

Currently, I am considering different features from the two time-series (e.g., min, max, median, slope etc.) and consider them for classification as follows in randomforest classier in sklearn.

df = pd.read_csv(input_file)
X = df[[myfeatures]]
y = df['Label']

#Random Forest classifier
clf=RandomForestClassifier(random_state = 42, class_weight="balanced", criterion = 'gini', max_depth = 3, max_features = 'auto', n_estimators = 500)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

output = cross_validate(clf, X, y, cv=k_fold, scoring = 'roc_auc', return_estimator =True)
for idx,estimator in enumerate(output['estimator']):
    print("Features sorted by their score for estimator {}:".format(idx))
    feature_temp_importances = pd.DataFrame(estimator.feature_importances_,
                                       index = mylist,
                                        columns=['importance']).sort_values('importance', ascending=False)
    print(feature_temp_importances)

However, my results are very low. I am wondering if it possible to give the time-series data as it is to random forest classifier. For example, giving x1 features as [38, 38, 35, 33, 32], [18, 18, 12, 11, 09]. If it is possible, I would like to know how I can do it in sklearn?

I am happy to provide more details if needed.

like image 769
EmJ Avatar asked Jan 31 '26 06:01

EmJ


2 Answers

Yes, you can use the entire time-series data as the features for your classifier.

To do that, just use the raw data, concatenate the 2 time series for each sensor and feed it into the classifier.

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
import numpy as np

n_samples = 100

# generates 2 n_samples random time series with integer values from 0 to 100.
x1 = np.array([np.random.randint(0, 100, 5) for _ in range(n_samples)])
x2 = np.array([np.random.randint(0, 100, 5) for _ in range(n_samples)])

X = np.hstack((x1, x2))


# generates n_samples random binary labels.
y = np.random.randint(0, 2, n_samples)

#Random Forest classifier
clf=RandomForestClassifier(random_state = 42, class_weight="balanced", criterion = 'gini', max_depth = 3, max_features = 'auto', n_estimators = 500)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

output = cross_validate(clf, X, y, cv=k_fold, scoring = 'roc_auc', return_estimator =True)

However, you might not want to use a random forest with those features. Have a look at LSTM or even 1-D CNNs, they might be more suitable for this approach of using the entire time-series as inputs.

like image 149
jpnadas Avatar answered Feb 01 '26 19:02

jpnadas


If you want to feed the whole time series to the model and use that to make predictions you should try with RNNs.

Another option, if you wonder to continue with sklearn is to apply rolling mean or rolling std to your time series, so x at time t would be influenced by x at time t - 1 and so on. With thiw correlation you will be able to classify each point to an specific class and therefore classify the whole timeseries corresponding the points' major label.

like image 30
Guillem Avatar answered Feb 01 '26 18:02

Guillem