Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TSFRESH library for python is taking way too long to process

I came across the TSfresh library as a way to featurize time series data. The documentation is great, and it seems like the perfect fit for the project I am working on.

I wanted to implement the following code that was shared in the quick start section of the TFresh documentation. And it seems simple enough.

from tsfresh import extract_relevant_features
feature_filtered_direct=extract_relevant_features(result,y,column_id=0,column_sort=1)

My data included 400 000 rows of sensor data, with 6 sensors each for 15 different id's. I started running the code, and 17 hours later it still had not finished. I figured this might be too large of a data set to run through the relevant feature extractor, so I trimmed it down to 3000, and then further down to 300. None of these actions made the code run under an hour, and I just ended up shutting it down after an hour or so of waiting. I tried the standard feature extractor as well

extracted_features = extract_features(timeseries, column_id="id", column_sort="time")

Along with trying the example dataset that TSfresh presents on their quick start section. Which includes a dataset that is very similar to my orginal data, with about the same amount of data points as I reduced to.

Does anybody have any experience with this code? How would you go about making it work faster? I'm using Anaconda for python 2.7.

Update It seems to be related to multiprocessing. Because I am on windows, using the multiprocess code requires to be protected by

if __name__ == "__main__":
    main()

Once I added

if __name__ == "__main__":

    extracted_features = extract_features(timeseries, column_id="id", column_sort="time")

To my code, the example data worked. I'm still having some issues with running the extract_relevant_features function and running the extract features module on my own data set. It seems as though it continues to run slowly. I have a feeling its related to the multiprocess freeze as well, but without any errors popping up its impossible to tell. Its taking me about 30 minutes to run to extract features on less than 1% of my dataset.

like image 456
Michael Bawol Avatar asked Dec 14 '16 16:12

Michael Bawol


2 Answers

Syntax has changed slightly (see docs), the current approach would be:

from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=MinimalFCParameters())

Or

extract_features(timeseries, column_id="id", column_sort="time", default_fc_parameters=EfficientFCParameters())
like image 180
Guido Avatar answered Sep 24 '22 16:09

Guido


which version of tsfresh did you use? Which OS?

We are aware of the high computational costs of some feature calculators. There is less we can do about it. In the future we will implement some tricks like caching to increase the efficiency of tsfresh further.

Have you tried calculating only the basic features by using the MinimalFeatureExtractionSettings? It will only contain basic features such as Max, Min, Median and so on but should run way, way faster.

 from tsfresh.feature_extraction import MinimalFeatureExtractionSettings
 extracted_features = extract_features(timeseries, column_id="id", column_sort="time", feature_extraction_settings = MinimalFeatureExtractionSettings())

Also it is probably a good idea to install the latest version from the repo by pip install git+https://github.com/blue-yonder/tsfresh. We are actively developing it and the master should contain the newest and freshest version ;).

like image 34
MaxBenChrist Avatar answered Sep 22 '22 16:09

MaxBenChrist