Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python's Pandas to find average values by bins

Tags:

python

pandas

bin

I just started using pandas to analyze groundwater well data over time.

My data in a text file looks like (site_no, date, well_level):

485438103132901 19800417    -7.1

485438103132901 19800506    -6.8

483622101085001 19790910    -6.7

485438103132901 19790731    -6.2

483845101112801 19801111    -5.37

484123101124601 19801111    -5.3

485438103132901 19770706    -4.98

I would like an output with average well levels binned by 5 year increments and with a count:

site_no   avg 1960-end1964  count    avg 1965-end1969  count    avg 1970-end1974 count

I am reading in the data with:

names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)

I can find the overall average by site with:

avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()

My crude bin attempts use:

a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()

My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.

like image 964
user3830166 Avatar asked Jul 11 '14 16:07

user3830166


People also ask

How do you find average value in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

What does bins mean in pandas?

Binning also known as bucketing or discretization is a common data pre-processing technique used to group intervals of continuous data into “bins” or “buckets”.

How do you find the average of a series in Python?

To find the average of the numbers in a list in Python, we have multiple ways. The two main ways are using the Len() and Sum() in-built function and using the mean() function from the statistics module.

How do you find average and standard deviation in pandas?

In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.


2 Answers

The most concise way is probably to convert this to a timeseris data and them downsample to get the means:

In [75]:

print df
                         ID  Level
1                                 
1980-04-17  485438103132901  -7.10
1980-05-06  485438103132901  -6.80
1979-09-10  483622101085001  -6.70
1979-07-31  485438103132901  -6.20
1980-11-11  483845101112801  -5.37
1980-11-11  484123101124601  -5.30
1977-07-06  485438103132901  -4.98
In [76]:

df.Level.resample('60M', how='mean') 
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31   -4.980
1982-07-31   -6.245
Freq: 60M, Name: Level, dtype: float64

Alternatively, you may use groupby together with cut:

In [99]:

print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
                        ID     Level
[1960, 1965]           NaN       NaN
(1965, 1970]           NaN       NaN
(1970, 1975]           NaN       NaN
(1975, 1980]  4.847632e+14 -6.064286

And by ID also:

In [100]:

print df.groupby(['ID', 
                  pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
                              Level
ID                                 
483622101085001 (1975, 1980]  -6.70
483845101112801 (1975, 1980]  -5.37
484123101124601 (1975, 1980]  -5.30
485438103132901 (1975, 1980]  -6.27
like image 194
CT Zhu Avatar answered Sep 28 '22 08:09

CT Zhu


so what i like to do is create a separate column with the rounded bin number:

    bin_width = 50000
    mult = 1. / bin_width
    df['bin'] = np.floor(ser * mult + .5) / mult

then, just group by the bins themselves

    df.groupby('bin').mean()

another note, you can do multiple truth evaluations in one go:

    df[(df.date > a) & (df.date < b)]
like image 31
acushner Avatar answered Sep 28 '22 08:09

acushner