Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storage of timeseries data in python

I've a amazon price data for around 8.5k products from a period of Feb. 1, 2015 to Oct. 31, 2015. Currently, it is in the form of a dictionary with key as number of days from base date and value as the new price starting that day. Example, here the price is $10 from day 1 and changes to $15 on 45th day, and then changes to $9 on 173rd day and doesn't change after that.

{1:10,
 45:15,
 .
 .
 .
 173:9}

What is the best way to store such a timeseries for easy manipulation using python? I would like to perform a lot of aggregations and also would be querying a price on a particular date. Lastly, I would be performing some fixed effect regressions and am confused what would be best way to store this timeseries, so that my programming job becomes comparatively simpler. I could possibly store as table with 273 columns (each for a day) and rows corresponding to 8.5k products. I've been looking at pandas module which can help me do this, but is there a better way? Thanks!

like image 475
harshal.c Avatar asked Nov 09 '15 15:11

harshal.c


2 Answers

You could use a dict of dicts and convert that into a pandas dataframe and also use numpy to do calculations. Your first key would be product and the inner dict would be the one you already have but it won't print in the format you suggested but all you would need to do is transpose it so for a quick example

import pandas as pd

d = {'Product1': {1:10, 45:15, 173:9}, 'Product2': {1:11, 100:50, 173:10}}
df = pd.DataFrame(d).T
print df

          1    45   100  173
Product1   10   15  NaN    9
Product2   11  NaN   50   10
like image 77
SirParselot Avatar answered Sep 23 '22 11:09

SirParselot


8.5k products and 270+ days I would recommend dataframe this way,

price_dic = {1: 10, 2: 11, 3: 12, 5: 15}

df = pd.DataFrame({'days': pd.Series(price_dic.keys(),index=range(len(price_dic))),'price': pd.Series(price_dic.values(),index=range(len(price_dic)))})

df['prod_name'] = "Knote"

df
Out[80]: 
   days  price prod_name
0     1     10     Knote
1     2     11     Knote
2     3     12     Knote
3     5     15     Knote

df['Date'] = pd.to_datetime("Feb. 1, 2015") + pd.to_timedelta(df.days,'D')

df
Out[82]: 
   days  price prod_name       Date
0     1     10     Knote 2015-02-02
1     2     11     Knote 2015-02-03
2     3     12     Knote 2015-02-04
3     5     15     Knote 2015-02-06

Update:

Treversing list and getting final Dataframe with all content,

Lets say you have prod list, price list and start date list like below, we could do,

product_list = [1001,1002,1003]

y_dict = [{1: 10, 2: 11, 3: 12, 5: 15},
            {1: 10, 3: 11, 6: 12, 8: 15},
            {1: 90, 2: 100, 7: 120, 9: 100}]

start_dt_list = ['Feb 05 2015','Feb 01 2015','Feb 06 2015']

fdf = pd.DataFrame(columns =['P_ID','Date','Price','Days'])

Out[73]: 
Empty DataFrame
Columns: [P_ID, Date, Price, Days]
Index: []

for pid,j ,st_dt in zip(product_list, y_dict,start_dt_list):
    df = pd.DataFrame({'P_ID' : pd.Series([pid]*len(j)) ,
                    'Date' : pd.Series([pd.to_datetime(st_dt)]*len(j)),
                    'Price': pd.Series(j.values(),index=range(len(j))),
                    'Days': pd.Series(j.keys(),index=range(len(j)))
                    })
    fdf = fdf.append(df,ignore_index=True)


fdf.head(2)
Out[75]: 
        Date  Days  P_ID  Price
0 2015-02-05     1  1001     10
1 2015-02-05     2  1001     11

fdf['Date'] = fdf['Date'] + pd.to_timedelta(fdf.Days,'D')

fdf
Out[77]: 
         Date  Days  P_ID  Price
0  2015-02-06     1  1001     10
1  2015-02-07     2  1001     11
2  2015-02-08     3  1001     12
3  2015-02-10     5  1001     15
4  2015-02-09     8  1002     15
5  2015-02-02     1  1002     10
6  2015-02-04     3  1002     11
7  2015-02-07     6  1002     12
8  2015-02-07     1  1003     90
9  2015-02-08     2  1003    100
10 2015-02-15     9  1003    100
11 2015-02-13     7  1003    120
like image 27
WoodChopper Avatar answered Sep 21 '22 11:09

WoodChopper