Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute daily climatology using pandas python

I am trying to use pandas to compute daily climatology. My code is:

import pandas as pd

dates      = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data  = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data   = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")

cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...

Any suggestions how I can do this using pandas?

like image 415
user308827 Avatar asked Jan 07 '14 13:01

user308827


3 Answers

You can groupby the day of the year, and the calculate the mean for these groups:

cum_data.groupby(cum_data.index.dayofyear).mean()

However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:

In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1  1     462.25
   2     631.00
   3     615.50
   4     496.00
...
12  28    378.25
    29    427.75
    30    528.50
    31    678.50
Length: 366, dtype: float64
like image 66
joris Avatar answered Nov 14 '22 03:11

joris


Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.

I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series. This way, things like leap years are automatically taken care of.

#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()

#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]

#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
like image 34
DarioZapp Avatar answered Nov 14 '22 03:11

DarioZapp


@joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.

...   
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
  (mo, dy) = name
  # Note: can use any leap year here.
  yrday = (date(1952, mo, dy)).timetuple().tm_yday
  doy.append(yrday)
  doy_mean.append(group.mean())
  doy_size.append(group.count())
  # Note: useful climatology stats are also available via group.describe() returned as dict
  #desc = group.describe()
  # desc["mean"], desc["min"], desc["max"], std,quartiles, etc.

# we lose the counts here.
new_cum_data  = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5

pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)

print cum_data_df.ix[366]
>> mean    634.5
>> size      4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean    343
>> size      1
>> Name: 60, dtype: float64
like image 28
Eric Bridger Avatar answered Nov 14 '22 05:11

Eric Bridger