Convert date to float for linear regression on Pandas data frame

Tags:

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:

date, city, players, sales

2014-04-28,London,111,1091.28

2014-04-29,London,100,1100.44

2014-04-28,Paris,87,1001.33

...

I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.

My current (non-working, but compiling code):

import pandas as pd

from pandas import DataFrame, Series

import statsmodels.formula.api as sm

df = pd.read_csv('gameAct.csv')

df.columns = ['date', 'city', 'players', 'sales']

city_data = df[df['city'] == 'London']

result = sm.ols(formula = 'sales ~ date', data = city_data).fit()

As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.

I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!

Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:

'\xef\xbb\xbf2014-04-28'

How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).

528

asked Jul 05 '14 16:07

Quetzalcoatl

1 Answers

For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.

This does the trick nicely:

df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])    
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()

The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.

There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.

116

answered Oct 10 '22 23:10

Tom Q.

Related questions
                            
                                Python Shared Memory Dictionary for Mapping Big Data
                            
                                FastAPI (starlette) get client real IP
                            
                                How to debug Web2py applications?
                            
                                Using the same decorator (with arguments) with functions and methods
                            
                                Python: find a list within members of another list(in order)
                            
                                Image color detection using python
                            
                                How do I install M2Crypto on Ubuntu?
                            
                                SSH Tunnel for Python MySQLdb connection
                            
                                Strange PEP8 recommendation on comparing Boolean values to True or False
                            
                                simple inter-process communication
                            
                                Run BASH built-in commands in Python?
                            
                                Check if file system is case-insensitive in Python
                            
                                Using Python's max to return two equally large values
                            
                                Python: JSON string to list of dictionaries - Getting error when iterating
                            
                                Get IP Address when testing flask application through nosetests
                            
                                How can I get Python to automatically create missing key/value pairs in a dictionary? [duplicate]
                            
                                Python write string of bytes to file
                            
                                What does "if var" mean in python?
                            
                                What is the Difference between PySphere and PyVmomi?
                            
                                Python property returning property object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert date to float for linear regression on Pandas data frame

Tags:

python

pandas

time-series

Quetzalcoatl

People also ask

1 Answers

Tom Q.

Recent Activity

Donate For Us