I've a amazon price data for around 8.5k products from a period of Feb. 1, 2015 to Oct. 31, 2015. Currently, it is in the form of a dictionary with key as number of days from base date and value as the new price starting that day. Example, here the price is $10 from day 1 and changes to $15 on 45th day, and then changes to $9 on 173rd day and doesn't change after that.
{1:10,
45:15,
.
.
.
173:9}
What is the best way to store such a timeseries for easy manipulation using python? I would like to perform a lot of aggregations and also would be querying a price on a particular date. Lastly, I would be performing some fixed effect regressions and am confused what would be best way to store this timeseries, so that my programming job becomes comparatively simpler. I could possibly store as table with 273 columns (each for a day) and rows corresponding to 8.5k products. I've been looking at pandas module which can help me do this, but is there a better way? Thanks!
You could use a dict of dicts and convert that into a pandas dataframe and also use numpy to do calculations. Your first key would be product and the inner dict would be the one you already have but it won't print in the format you suggested but all you would need to do is transpose it so for a quick example
import pandas as pd
d = {'Product1': {1:10, 45:15, 173:9}, 'Product2': {1:11, 100:50, 173:10}}
df = pd.DataFrame(d).T
print df
1 45 100 173
Product1 10 15 NaN 9
Product2 11 NaN 50 10
8.5k products and 270+ days I would recommend dataframe this way,
price_dic = {1: 10, 2: 11, 3: 12, 5: 15}
df = pd.DataFrame({'days': pd.Series(price_dic.keys(),index=range(len(price_dic))),'price': pd.Series(price_dic.values(),index=range(len(price_dic)))})
df['prod_name'] = "Knote"
df
Out[80]:
days price prod_name
0 1 10 Knote
1 2 11 Knote
2 3 12 Knote
3 5 15 Knote
df['Date'] = pd.to_datetime("Feb. 1, 2015") + pd.to_timedelta(df.days,'D')
df
Out[82]:
days price prod_name Date
0 1 10 Knote 2015-02-02
1 2 11 Knote 2015-02-03
2 3 12 Knote 2015-02-04
3 5 15 Knote 2015-02-06
Update:
Treversing list and getting final Dataframe with all content,
Lets say you have prod list, price list and start date list like below, we could do,
product_list = [1001,1002,1003]
y_dict = [{1: 10, 2: 11, 3: 12, 5: 15},
{1: 10, 3: 11, 6: 12, 8: 15},
{1: 90, 2: 100, 7: 120, 9: 100}]
start_dt_list = ['Feb 05 2015','Feb 01 2015','Feb 06 2015']
fdf = pd.DataFrame(columns =['P_ID','Date','Price','Days'])
Out[73]:
Empty DataFrame
Columns: [P_ID, Date, Price, Days]
Index: []
for pid,j ,st_dt in zip(product_list, y_dict,start_dt_list):
df = pd.DataFrame({'P_ID' : pd.Series([pid]*len(j)) ,
'Date' : pd.Series([pd.to_datetime(st_dt)]*len(j)),
'Price': pd.Series(j.values(),index=range(len(j))),
'Days': pd.Series(j.keys(),index=range(len(j)))
})
fdf = fdf.append(df,ignore_index=True)
fdf.head(2)
Out[75]:
Date Days P_ID Price
0 2015-02-05 1 1001 10
1 2015-02-05 2 1001 11
fdf['Date'] = fdf['Date'] + pd.to_timedelta(fdf.Days,'D')
fdf
Out[77]:
Date Days P_ID Price
0 2015-02-06 1 1001 10
1 2015-02-07 2 1001 11
2 2015-02-08 3 1001 12
3 2015-02-10 5 1001 15
4 2015-02-09 8 1002 15
5 2015-02-02 1 1002 10
6 2015-02-04 3 1002 11
7 2015-02-07 6 1002 12
8 2015-02-07 1 1003 90
9 2015-02-08 2 1003 100
10 2015-02-15 9 1003 100
11 2015-02-13 7 1003 120
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With