Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA with several time series as features of one instance with sklearn

I want to apply PCA on a data set where I have 20 time series as features for one instance. I have some 1000 instances of this kind and I am looking for a way to reduce dimensionality. For every instance I have a pandas Data Frame, like:

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.normal(0, 1, (300, 20)))

Is there a way to use sklearn.fit on all instances with each having a set of time series as feature space. I mean I could apply sklearn.fit on all instances separatly, but I want the same principal components for all.

Is there a way? The only not satisfying idea I have by now is to append all those series of one instance to one, so that I have one time series for one instance.

like image 485
Mina L. Avatar asked Sep 21 '18 18:09

Mina L.


2 Answers

I do not find the other answers satisfactory. Mainly because you should account for both the time series structure of the data and the cross-sectional information. You can't simply treat the features at each instance as a single series. Doing so, would inevitably lead to a loss of information and is, simply speaking, statistically wrong.

That said, if you really need to go for PCA, you should at least preserve the time series information:

PCA

Following silgon we transform the data into a numpy array:

# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances]) 

This makes applying PCA way easier:

reshaped_data = data.reshape((1000*300, 20))    # create one big data panel with 20 series and 300.000 datapoints
n_comp=10                                       #choose the number of features to have after dimensionality reduction
pca = PCA(n_components=n_comp)                  #create the pca object       
pca.fit(pre_data)                               #fit it to your transformed data
transformed_data=np.empty([1000,300,n_comp])
for i in range(len(data)):
     transformed_data[i]=pca.transform(data[i])           #iteratively apply the transformation to each instance of the original dataset

Final output shape: transformed_data.shape: Out[]: (1000,300,n_comp).

PLS

However, you can (and should, in my opinion) construct the factors from your matrix of features using partial least squares PLS. This will also grant a further dimensionality reduction.

Let say your data has the following shape. T=1000, N=300, P=20.

Then we have y=[T,1], X=[N,P,T].

Now, it's pretty easy to understand that for this to work we need to have our matrices to be conformable for multiplication. In our case we will have: y=[T,1]=[1000,1], Xpca=[T,P*N]=[1000,20*300]

Intuitively, what we are doing is to create a new feature for each lag (299=N-1) of each of the P=20 basic features.

I.e. for a given instance i, we will have something like this:

Instancei : x1,i, x1,i-1,..., x1,i-j, x2,i, x2,i-1,..., x2,i-j,..., xP,i, xP,i-1,..., xP,i-j with j=1,...,N-1:

Now, implementation of PLS in python is pretty straightforward.

# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances]) 

# reshape your data:
reshaped_data = data.reshape((1000, 20*300))

from sklearn.cross_decomposition import PLSRegression

n_comp=10
pls_obj=PLSRegression(n_components=n_comp)
factorsPLS=pls_obj.fit_transform(reshaped_data,y)[0] 
factorsPLS.shape
Out[]: (1000, n_comp)

What is PLS doing?

To make things easier to grasp we can look at the three-pass regression filter (working paper here) (3PRF). Kelly and Pruitt show that PLS is just a special case of theirs 3PRF:

(The three steps)

Where Z represents a matrix of proxies. We don't have those but luckily Kelly and Pruitt have shown that we can live without it. All we need to do is to be sure that the regressors (our features) are standardized and run the first two regressions without intercept. Doing so, the proxies will be automatically selected.

So, in short PLS allows you to

  1. Achieve further dimensionality reduction than PCA.
  2. account for both the cross-sectional variability among the features and time series information of each series when creating the factors.
like image 171
CAPSLOCK Avatar answered Oct 12 '22 20:10

CAPSLOCK


First of all, I would recommend to see the this link to have a better understanding of PCA analysis and data series.

Please take into account that if you have pandas 1000 instances, your data needs to be processed to be more easily as a numpy array. You'd have something like the following:

# your 1000 pandas instances
instances = [pd.DataFrame(data=np.random.normal(0, 1, (300, 20))) for _ in range(1000)]
# transformation to be able to process more easily the data as a numpy array
data=np.array([d.values for d in instances]) 

This said, let's tackle 2 different solutions.

Simple Solution

This said, the easiest solution is to ignore that you have a time series and just concatenate the information to perform the PCA analysis with all

import numpy as np
from sklearn.decomposition import PCA
data = np.random.randn(1000, 300, 20) # n_instances, n_steps, n_features
# combine the features and the steps, then
# you perform PCA for your 1000 instances
preprocessed = data.reshape((1000, 20*300))
pca = PCA(n_components=100)
pca.fit(preprocessed)
# test it in one sample
sample = pca.transform(preprocessed[0].reshape(1,-1))

Variation with Fourier Transform

Another solution could be by using fourier to try to get more information from your time series.

import numpy as np
from sklearn.decomposition import PCA
data = np.random.randn(1000, 300, 20) # n_instances, n_steps, n_features
# perform a fast fourier transform
preprocessed_1 = np.fft.fft(data,axis=1)
# combine the features and the steps, then
# you perform PCA for your 1000 instances
preprocessed_2 = preprocessed_1.reshape((1000, 20*300))
pca = PCA(n_components=100)
pca.fit(preprocessed_2)
# test it in one sample
pca.transform(preprocessed_2[0].reshape(1,-1))

Note: be careful, for both cases I'm implying that you have the same length for every time series.

like image 42
silgon Avatar answered Oct 12 '22 20:10

silgon