Using data from From the Zillow research data site mainly city level. the data structure is 6 columns contain city related information and the remaining 245 columns contain the monthly sale price. I've used the code below to display a sample of the data
import pandas as pd
from tabulate import tabulate
df = pd.read_csv("City_Zhvi_AllHomes.csv")
c = df.columns.tolist()
cols = c[:7]
cols.append(c[-1])
print (tabulate(df[cols].iloc[23:29], headers = 'keys', tablefmt = 'orgtbl'))
The above code will print a sample as shown below:
| | RegionID | RegionName | State | Metro | CountyName | SizeRank | 1996-04 | 2016-08 |
|----+------------+---------------+---------+---------------+--------------+------------+-----------+-----------|
| 23 | 5976 | Milwaukee | WI | Milwaukee | Milwaukee | 24 | 68100 | 99500 |
| 24 | 7481 | Tucson | AZ | Tucson | Pima | 25 | 91500 | 153000 |
| 25 | 13373 | Portland | OR | Portland | Multnomah | 26 | 121100 | 390500 |
| 26 | 33225 | Oklahoma City | OK | Oklahoma City | Oklahoma | 27 | 64900 | 130500 |
| 27 | 40152 | Omaha | NE | Omaha | Douglas | 28 | 88900 | 143800 |
| 28 | 23429 | Albuquerque | NM | Albuquerque | Bernalillo | 29 | 115400 | 172000 |
Part of df
is a time series, the trick here is to separate the time dependant columns from the rest, use pandas
resample
and to_datetime
Suppose we are only interested in summarising the sales for the years 1998-2000
That will enable us to select the columns
# seperate time columns and convert their names to datetime
tdf = df[df.columns[6:]].rename(columns=pd.to_datetime)
# find the columns in the period 1998-2000
cols = tdf.columns
sel_cols = cols[(cols > '1997-12-31') & (cols < '2000')]
# select the columns, resample on columns
# calculate the mean
# rename the columns the way we like
mdf = tdf[sel_cols].resample('6M',axis=1).mean().rename(
columns=lambda x: '{:}${:}'.format(x.year, [1, 2][x.quarter > 2]))
# reattach non-time columns
mdf[df.columns[:6]] = df[df.columns[:6]]
print (tabulate(mdf[mdf.columns[0:9]].iloc[
23:29], headers='keys', tablefmt='orgtbl'))
The above code will print a sample as shown below:
| | 1998$1 | 1998$2 | 1999$1 | 1999$2 | 2000$1 | RegionID | RegionName | State | Metro |
|----+----------+----------+----------+----------+----------+------------+---------------+---------+---------------|
| 23 | 71900 | 72483.3 | 72616.7 | 74266.7 | 75920 | 5976 | Milwaukee | WI | Milwaukee |
| 24 | 94200 | 95133.3 | 96533.3 | 99100 | 100600 | 7481 | Tucson | AZ | Tucson |
| 25 | 139000 | 141900 | 145233 | 148900 | 151980 | 13373 | Portland | OR | Portland |
| 26 | 68500 | 69616.7 | 72016.7 | 73616.7 | 74900 | 33225 | Oklahoma City | OK | Oklahoma City |
| 27 | 98200 | 99250 | 103367 | 109083 | 112160 | 40152 | Omaha | NE | Omaha |
| 28 | 121000 | 122050 | 122833 | 123633 | 124420 | 23429 | Albuquerque | NM | Albuquerque |
The question is:
The last column from the resample result, has the year "2000" despite the selection using <'2000', why?
EDIT: Just for fun, I include a more "pandorable" method of doing the above
import pandas as pd
housing = pd.read_csv('City_Zhvi_AllHomes.csv',
index_col=list(range(6))).filter(
regex='199[8-9]-[0-1][0-9]').rename(
columns=pd.to_datetime).resample('2Q',
closed='left',axis=1).mean().rename(
columns=lambda x: str(x.to_period('2Q')).replace(
'Q','$').replace('2','1').replace('4','2')).reset_index()
This delivers the desired outcome, printout of housing.iloc[23:27,4:]
is shown below
| | CountyName | SizeRank | 1998$1 | 1998$2 | 1999$1 | 1999$2 |
|----+--------------+------------+----------+----------+----------+----------|
| 23 | Milwaukee | 24 | 72366.7 | 72583.3 | 73916.7 | 75750 |
| 24 | Pima | 25 | 94883.3 | 96183.3 | 98783.3 | 100450 |
| 25 | Multnomah | 26 | 141167 | 144733 | 148183 | 151767 |
| 26 | Oklahoma | 27 | 69300 | 71550 | 73466.7 | 74766.7 |
Consider using the closed argument of pandas' resample
to decide:
Which side of bin interval is closed
Below uses left
where 6 month cut-offs are 6/30 and 12/31
instead of 1/1 and 7/1
which yielded the 2000 value:
mdf = tdf[sel_cols].T.resample('6M', closed='left').mean().T.rename(
columns=lambda x: '{:}${:}'.format(x.year, [1, 2][x.quarter > 2]))
mdf[df.columns[:6]] = df[df.columns[:6]]
print(mdf.head())
# 1998$1 1998$2 1999$1 1999$2 RegionID RegionName State Metro CountyName SizeRank
# 0 NaN NaN NaN NaN 6181 New York NY New York Queens 1
# 1 169183.333333 179166.666667 189116.666667 198466.666667 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles 2
# 2 117700.000000 121666.666667 125550.000000 133000.000000 17426 Chicago IL Chicago Cook 3
# 3 50550.000000 50650.000000 51150.000000 51866.666667 13271 Philadelphia PA Philadelphia Philadelphia 4
# 4 97583.333333 101083.333333 104816.666667 108566.666667 40326 Phoenix AZ Phoenix Maricopa 5
print(mdf[mdf['Metro'].isin(['Milwaukee', 'Tucson', 'Portland', 'Oklahoma City', 'Omaha', 'Albuquerque'])].head())
# 1998$1 1998$2 1999$1 1999$2 RegionID RegionName State Metro CountyName SizeRank
# 23 72366.666667 72583.333333 73916.666667 75750.000000 5976 Milwaukee WI Milwaukee Milwaukee 24
# 24 94883.333333 96183.333333 98783.333333 100450.000000 7481 Tucson AZ Tucson Pima 25
# 25 141166.666667 144733.333333 148183.333333 151766.666667 13373 Portland OR Portland Multnomah 26
# 26 98950.000000 102450.000000 108016.666667 112116.666667 40152 Omaha NE Omaha Douglas 27
# 27 121816.666667 122666.666667 123550.000000 124333.333333 23429 Albuquerque NM Albuquerque Bernalillo 28
By the way consider reshaping the data with melt
to long format, aggregate by half year, then pivot_table
back to wide format. Admittedly, performance is lowered here but arguably more readable (mainly the Half_Year string concatenation is the bottleneck). You do get a long form dataset for other aggregations and/or modeling:
import pandas as pd
import datetime as dt
import numpy as np
# MELT (WIDE --> LONG)
idcols = ['RegionID', 'RegionName', 'State', 'Metro']
mdf = pd.melt(df, id_vars=idcols + ['CountyName', 'SizeRank'], var_name='Year_Month', value_name='Sale_Amt').reset_index()
# CALCULATE HALF_YEAR STRING
mdf['Year_Month'] = pd.to_datetime(mdf['Year_Month'])
mdf['Half_Year'] = mdf['Year_Month'].dt.year.astype(str) + '$' + np.where(mdf['Year_Month'].dt.month <= 6, 1, 2).astype(str)
# FILTER DATASET BY DATE INTERVAL
mdf = mdf[mdf['Year_Month'].between('1998-01-01', '1999-12-31')]
# GROUP BY AGGREGATION OF HOUSE SALES
mdf = mdf.groupby(idcols + ['Half_Year'])['Sale_Amt'].mean().reset_index()
# PIVOT (LONG --> WIDE)
pvtdf = mdf.pivot_table(index=idcols, columns='Half_Year', values='Sale_Amt', aggfunc=sum).reset_index()
Output
metros = ['Milwaukee', 'Tucson', 'Portland', 'Oklahoma City', 'Omaha', 'Albuquerque']
print(pvtdf[(pvtdf['RegionName'].isin(metros)) & (pvtdf['Metro'].isin(metros))])
# Half_Year RegionID RegionName State Metro 1998$1 1998$2 1999$1 1999$2
# 430 5976 Milwaukee WI Milwaukee 72366.666667 72583.333333 73916.666667 75750.000000
# 680 7481 Tucson AZ Tucson 94883.333333 96183.333333 98783.333333 100450.000000
# 1584 13373 Portland OR Portland 141166.666667 144733.333333 148183.333333 151766.666667
# 2923 23429 Albuquerque NM Albuquerque 121816.666667 122666.666667 123550.000000 124333.333333
# 5473 40152 Omaha NE Omaha 98950.000000 102450.000000 108016.666667 112116.666667
You can use groupby
by new column names with custom format, because if use resample
by 6M
it return 2000
(it looks like bug):
c = df.columns.tolist()
cols = c[:7]
cols.append(c[-1])
print (cols)
['RegionID', 'RegionName', 'State', 'Metro', 'CountyName', 'SizeRank', '1996-04', '2016-11']
#set to index all NOT date columns
tdf = df.set_index(df.columns.tolist()[:6])
tdf.columns = pd.to_datetime(tdf.columns).to_period('M')
# find the columns in the period 1998-2000
cols = tdf.columns
sel_cols = cols[(cols > '1997-12') & (cols < '2000')]
print (sel_cols)
PeriodIndex(['1998-01', '1998-02', '1998-03', '1998-04', '1998-05', '1998-06',
'1998-07', '1998-08', '1998-09', '1998-10', '1998-11', '1998-12',
'1999-01', '1999-02', '1999-03', '1999-04', '1999-05', '1999-06',
'1999-07', '1999-08', '1999-09', '1999-10', '1999-11', '1999-12'],
dtype='period[M]', freq='M')
#change format columns
a = tdf[sel_cols].rename(columns=lambda x: '{:}${:}'.format(x.year, [1, 2][x.quarter > 2]))
print (a.columns)
Index(['1998$1', '1998$1', '1998$1', '1998$1', '1998$1', '1998$1', '1998$2',
'1998$2', '1998$2', '1998$2', '1998$2', '1998$2', '1999$1', '1999$1',
'1999$1', '1999$1', '1999$1', '1999$1', '1999$2', '1999$2', '1999$2',
'1999$2', '1999$2', '1999$2'],
dtype='object')
#groupby by a with new format
mdf = a.groupby(a.columns, axis=1).mean().reset_index()
print (mdf.head())
RegionID RegionName State Metro CountyName \
0 6181 New York NY New York Queens
1 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles
2 17426 Chicago IL Chicago Cook
3 13271 Philadelphia PA Philadelphia Philadelphia
4 40326 Phoenix AZ Phoenix Maricopa
SizeRank 1998$1 1998$2 1999$1 1999$2
0 1 NaN NaN NaN NaN
1 2 169183.333333 179166.666667 189116.666667 198466.666667
2 3 117700.000000 121666.666667 125550.000000 133000.000000
3 4 50550.000000 50650.000000 51150.000000 51866.666667
4 5 97583.333333 101083.333333 104816.666667 108566.666667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With