I want to check the stationary of a time series data saved in TS.csv. However, R's <code>tseries::adf.test()</code> and Python's <code>statsmodels.tsa.stattools.adfuller()</code> give completely different results. <code>adf.test()</code> shows it's stationary (p < 0.05), while <code>adfuller()</code> shows it's non-stationary (p > 0.05). Is there any problems in the following codes? What's the right process to test stationary of a time series in R and Python? Thanks. R codes: <pre class="prettyprint"><code>> rd <- read.table('Data/TS.csv', sep = ',', header = TRUE) > inp <- ts(rd$Sales, frequency = 12, start = c(1965, 1)) > inp Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1965 154 96 73 49 36 59 95 169 210 278 298 245 1966 200 118 90 79 78 91 167 169 289 347 375 203 1967 223 104 107 85 75 99 135 211 335 460 488 326 1968 346 261 224 141 148 145 223 272 445 560 612 467 1969 518 404 300 210 196 186 247 343 464 680 711 610 1970 613 392 273 322 189 257 324 404 677 858 895 664 1971 628 308 324 248 272 > library(tseries) > adf.test(inp) Augmented Dickey-Fuller Test data: inp Dickey-Fuller = -7.2564, Lag order = 4, p-value = 0.01 alternative hypothesis: stationary </code></pre> Python codes (from Time_Series.ipynb): <pre class="prettyprint"><code>import pandas as pd from statsmodels.tsa.stattools import adfuller df = pd.read_csv('Data/TS.csv') ts = pd.Series(list(df['Sales']), index=pd.to_datetime(df['Month'],format='%Y-%m')) s_test = adfuller(ts, autolag='AIC') print("p value > 0.05 means data is non-stationary: ", s_test[1]) # output: p value > 0.05 means data is non-stationary: 0.988889420517 </code></pre> <h3>Update</h3> @gfgm give exellent explanations why results of R and Python are different, and how to make them the same by changing the parameters. For the second quetsion above: "What's the right process to test stationary of a time series in R and Python?". I'd like to provide some details: When forecast a time series, ARIMA model needs the input time series to be stationary. If the input isn't stationary, it should be <code>log()</code>ed or <code>diff()</code>ed to make it stationary, then fit it into the model. So the problem is: should I think the input is stationary (with R's default parameters) and fit it directly into ARIMA model, or think it's non-stationary (with Python's default parameters), and make it stationary with extra functions (like <code>log()</code> or <code>diff()</code>)?

The results are different because the models being fit are slightly different and because the lag order of the models are completely different. The python test includes a constant 'drift' term (a constant is estimated thus centering the time series at zero), but the R test includes both a constant and a linear trend term. This can be specified in the python code with the argument <code>regression = 'ct'</code>. <h3>Default lag length in r</h3> <code>nlag = trunc((length(x)-1)^(1/3))</code> <h3>Default lag length in python</h3> <code>12*(nobs/100)^(1/4)</code> When you ran the python code you told the function to pick optimal lag-length by AIC criteria. If we tell python to run a centered and detrended model, and we tell it to use the R lag-length criteria, we get: <pre class="prettyprint"><code>In [5]: adfuller(ts, regression="ct", maxlag = 4)[1] Out[5]: 3.6892966741832268e-09 </code></pre> It's hard to see if this is identical to the R result, as R rounds its p-value to .01, but we can tell R to use python's lag length, and python to use R's model (I cant change model in R with this function). We get: <pre class="prettyprint"><code>adf.test(inp, k = ceiling(12*(length(inp)/100)^(1/4))) Augmented Dickey-Fuller Test data: inp Dickey-Fuller = -2.0253, Lag order = 12, p-value = 0.5652 alternative hypothesis: stationary </code></pre> And in python: <pre class="prettyprint"><code>In [6]: adfuller(ts, regression="ct")[1] Out[6]: 0.58756464088883864 </code></pre> Not perfect, but pretty close. <h3>Note:</h3> The actual Dickey-Fuller test-statistic for the python model is <pre class="prettyprint"><code>In [8]: adfuller(ts, regression="ct")[0] Out[8]: -2.025340637385288 </code></pre> which is identical to the R result. The tests probably use different ways of computing the p-value from the stat.

Is this time series stationary or not?

Q: How do you know if a time series is stationary?

If Test statistic < Critical Value and p-value < 0.05 – Reject Null Hypothesis(HO) i.e., time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.

Q: What makes a time series stationary?

Stationary Time Series Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations.

Q: What is a stationary time series example?

A stationary process' distribution does not change over time. An intuitive example: you flip a coin. 50% heads, regardless of whether you flip it today or tomorrow or next year. A more complex example: by the efficient market hypothesis, excess stock returns should always fluctuate around zero.

Q: Which process is not stationary?

Examples of non-stationary processes are random walk with or without a drift (a slow steady change) and deterministic trends (trends that are constant, positive, or negative, independent of time for the whole life of the series).

Tags:

python

r

time-series

statsmodels

arima

I want to check the stationary of a time series data saved in TS.csv.

However, R's tseries::adf.test() and Python's statsmodels.tsa.stattools.adfuller() give completely different results.

adf.test() shows it's stationary (p < 0.05), while adfuller() shows it's non-stationary (p > 0.05).

Is there any problems in the following codes?

What's the right process to test stationary of a time series in R and Python?

Thanks.

R codes:

Click to copy

> rd <- read.table('Data/TS.csv', sep = ',', header = TRUE)
> inp <- ts(rd$Sales, frequency = 12, start = c(1965, 1))
> inp
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1965 154  96  73  49  36  59  95 169 210 278 298 245
1966 200 118  90  79  78  91 167 169 289 347 375 203
1967 223 104 107  85  75  99 135 211 335 460 488 326
1968 346 261 224 141 148 145 223 272 445 560 612 467
1969 518 404 300 210 196 186 247 343 464 680 711 610
1970 613 392 273 322 189 257 324 404 677 858 895 664
1971 628 308 324 248 272
> library(tseries)
> adf.test(inp)

    Augmented Dickey-Fuller Test

data:  inp
Dickey-Fuller = -7.2564, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

Python codes (from Time_Series.ipynb):

Click to copy

import pandas as pd
from statsmodels.tsa.stattools import adfuller
df = pd.read_csv('Data/TS.csv')
ts = pd.Series(list(df['Sales']), index=pd.to_datetime(df['Month'],format='%Y-%m'))
s_test = adfuller(ts, autolag='AIC')
print("p value > 0.05 means data is non-stationary: ", s_test[1])
# output: p value > 0.05 means data is non-stationary:  0.988889420517

Update

@gfgm give exellent explanations why results of R and Python are different, and how to make them the same by changing the parameters.

For the second quetsion above: "What's the right process to test stationary of a time series in R and Python?". I'd like to provide some details:

When forecast a time series, ARIMA model needs the input time series to be stationary. If the input isn't stationary, it should be log()ed or diff()ed to make it stationary, then fit it into the model.

So the problem is: should I think the input is stationary (with R's default parameters) and fit it directly into ARIMA model, or think it's non-stationary (with Python's default parameters), and make it stationary with extra functions (like log() or diff())?

642

asked Mar 27 '18 06:03

Leo

2 Answers

The results are different because the models being fit are slightly different and because the lag order of the models are completely different. The python test includes a constant 'drift' term (a constant is estimated thus centering the time series at zero), but the R test includes both a constant and a linear trend term. This can be specified in the python code with the argument regression = 'ct'.

Default lag length in r

nlag = trunc((length(x)-1)^(1/3))

Default lag length in python

12*(nobs/100)^(1/4)

When you ran the python code you told the function to pick optimal lag-length by AIC criteria. If we tell python to run a centered and detrended model, and we tell it to use the R lag-length criteria, we get:

Click to copy

In [5]: adfuller(ts, regression="ct", maxlag = 4)[1]
Out[5]: 3.6892966741832268e-09

It's hard to see if this is identical to the R result, as R rounds its p-value to .01, but we can tell R to use python's lag length, and python to use R's model (I cant change model in R with this function). We get:

Click to copy

adf.test(inp, k = ceiling(12*(length(inp)/100)^(1/4)))

    Augmented Dickey-Fuller Test

data:  inp
Dickey-Fuller = -2.0253, Lag order = 12, p-value = 0.5652
alternative hypothesis: stationary

And in python:

Click to copy

In [6]: adfuller(ts, regression="ct")[1]
Out[6]: 0.58756464088883864

Not perfect, but pretty close.

Note:

The actual Dickey-Fuller test-statistic for the python model is

Click to copy

In [8]: adfuller(ts, regression="ct")[0]
Out[8]: -2.025340637385288

which is identical to the R result. The tests probably use different ways of computing the p-value from the stat.

164

answered Oct 28 '22 13:10

gfgm

The p-values of the Augmented Dickey-Fuller test are rather sensitive to the choice of lag order. For example, here is the same test in R with a higher lag order:

Click to copy

> adf.test(rd$Sales, k=9)

    Augmented Dickey-Fuller Test

data:  rd$Sales
Dickey-Fuller = -2.9186, Lag order = 9,
p-value = 0.2004
alternative hypothesis: stationary

The documentation for the adf.test says that it uses regression with a constant and linear trend. We should pass the parameter regression = 'ct' to adfuller to use the same regression method.

I'm having some trouble with statsmodels on my machine at the moment, but I suggest you try the following parameters and see if you get closer correspondence:

Click to copy

adfuller(a, maxlag=9, autolag=None, regression='ct')

What you want to look for is whether the two are showing the same test statistic because the p-values are determined differently between the two packages.

answered Oct 28 '22 13:10

znr

Related questions
                            
                                Get the attributes of the selected item in a GeoJSONDataSource
                            
                                Expected tensorflow model size from learned variables
                            
                                How to feed into LSTM with 4 dimensional input?
                            
                                Asyncio exception handler: not getting called until event loop thread stopped
                            
                                How to send a django signal from other signal
                            
                                how to load a tensorflow model and continue training
                            
                                cv2.aruco.detectMarkers doesn't detect markers in python
                            
                                Define a pytest fixture providing multiple arguments to test function
                            
                                how do I safely write data from a single hdf5 file to multiple files in parallel in python?
                            
                                GridSearchCV - save result each iteration
                            
                                Purpose of __name__ in TypeVar, NewType
                            
                                Python requests module doesn't return full page during get request
                            
                                Exception " There is no current event loop in thread 'MainThread' " while running over new loop
                            
                                Only one line of SimpleHTTPServer output does not appear while running container without '-it'
                            
                                [Tensorflow][Object detection] ValueError when try to train with --num_clones=2
                            
                                Understanding multi-label classifier using confusion matrix
                            
                                marshmallow flatten nested objects
                            
                                Returning mutiple values in the input function for `tf.py_func`
                            
                                Parsing Index page in a PDF text book with Python
                            
                                python rq - how to trigger a job when multiple other jobs are finished? Multi job dependency work arround?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is this time series stationary or not?

Tags:

python

r

time-series

statsmodels

arima

Update

Leo

People also ask

2 Answers

Default lag length in r

Default lag length in python

Note:

gfgm

znr

Recent Activity

Donate For Us