I just started using pandas to analyze groundwater well data over time. My data in a text file looks like (site_no, date, well_level): <pre class="prettyprint"><code>485438103132901 19800417 -7.1 485438103132901 19800506 -6.8 483622101085001 19790910 -6.7 485438103132901 19790731 -6.2 483845101112801 19801111 -5.37 484123101124601 19801111 -5.3 485438103132901 19770706 -4.98 </code></pre> I would like an output with average well levels binned by 5 year increments and with a count: <pre class="prettyprint"><code>site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count </code></pre> I am reading in the data with: <pre class="prettyprint"><code>names = ['site_no','date','wtr_lvl'] df = pd.read_csv('D:\info.txt', sep='\t',names=names) </code></pre> I can find the overall average by site with: <pre class="prettyprint"><code>avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index() </code></pre> My crude bin attempts use: <pre class="prettyprint"><code>a1 = df[df.date > 19600000] a2 = a1[a1.date < 19650000] avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean() </code></pre> My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.

The most concise way is probably to convert this to a <code>timeseris</code> data and them downsample to get the means: <pre class="prettyprint"><code>In [75]: print df ID Level 1 1980-04-17 485438103132901 -7.10 1980-05-06 485438103132901 -6.80 1979-09-10 483622101085001 -6.70 1979-07-31 485438103132901 -6.20 1980-11-11 483845101112801 -5.37 1980-11-11 484123101124601 -5.30 1977-07-06 485438103132901 -4.98 In [76]: df.Level.resample('60M', how='mean') #also may consider different time alias: '5A', '5BA', '5AS', etc: #see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases Out[76]: 1 1977-07-31 -4.980 1982-07-31 -6.245 Freq: 60M, Name: Level, dtype: float64 </code></pre> Alternatively, you may use <code>groupby</code> together with <code>cut</code>: <pre class="prettyprint"><code>In [99]: print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean() ID Level [1960, 1965] NaN NaN (1965, 1970] NaN NaN (1970, 1975] NaN NaN (1975, 1980] 4.847632e+14 -6.064286 </code></pre> And by ID also: <pre class="prettyprint"><code>In [100]: print df.groupby(['ID', pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean() Level ID 483622101085001 (1975, 1980] -6.70 483845101112801 (1975, 1980] -5.37 484123101124601 (1975, 1980] -5.30 485438103132901 (1975, 1980] -6.27 </code></pre>

so what i like to do is create a separate column with the rounded bin number: <pre class="prettyprint"><code> bin_width = 50000 mult = 1. / bin_width df['bin'] = np.floor(ser * mult + .5) / mult </code></pre> then, just group by the bins themselves <pre class="prettyprint"><code> df.groupby('bin').mean() </code></pre> another note, you can do multiple truth evaluations in one go: <pre class="prettyprint"><code> df[(df.date > a) & (df.date < b)] </code></pre>

Using Python's Pandas to find average values by bins

I just started using pandas to analyze groundwater well data over time.

My data in a text file looks like (site_no, date, well_level):

485438103132901 19800417    -7.1

485438103132901 19800506    -6.8

483622101085001 19790910    -6.7

485438103132901 19790731    -6.2

483845101112801 19801111    -5.37

484123101124601 19801111    -5.3

485438103132901 19770706    -4.98

I would like an output with average well levels binned by 5 year increments and with a count:

site_no   avg 1960-end1964  count    avg 1965-end1969  count    avg 1970-end1974 count

I am reading in the data with:

names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)

I can find the overall average by site with:

avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()

My crude bin attempts use:

a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()

My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.

How do you find average value in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

What does bins mean in pandas?

Binning also known as bucketing or discretization is a common data pre-processing technique used to group intervals of continuous data into “bins” or “buckets”.

How do you find the average of a series in Python?

To find the average of the numbers in a list in Python, we have multiple ways. The two main ways are using the Len() and Sum() in-built function and using the mean() function from the statistics module.

How do you find average and standard deviation in pandas?

In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.

The most concise way is probably to convert this to a timeseris data and them downsample to get the means:

In [75]:

print df
                         ID  Level
1                                 
1980-04-17  485438103132901  -7.10
1980-05-06  485438103132901  -6.80
1979-09-10  483622101085001  -6.70
1979-07-31  485438103132901  -6.20
1980-11-11  483845101112801  -5.37
1980-11-11  484123101124601  -5.30
1977-07-06  485438103132901  -4.98
In [76]:

df.Level.resample('60M', how='mean') 
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31   -4.980
1982-07-31   -6.245
Freq: 60M, Name: Level, dtype: float64

Alternatively, you may use groupby together with cut:

In [99]:

print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
                        ID     Level
[1960, 1965]           NaN       NaN
(1965, 1970]           NaN       NaN
(1970, 1975]           NaN       NaN
(1975, 1980]  4.847632e+14 -6.064286

And by ID also:

In [100]:

print df.groupby(['ID', 
                  pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
                              Level
ID                                 
483622101085001 (1975, 1980]  -6.70
483845101112801 (1975, 1980]  -5.37
484123101124601 (1975, 1980]  -5.30
485438103132901 (1975, 1980]  -6.27

so what i like to do is create a separate column with the rounded bin number:

    bin_width = 50000
    mult = 1. / bin_width
    df['bin'] = np.floor(ser * mult + .5) / mult

then, just group by the bins themselves

    df.groupby('bin').mean()

another note, you can do multiple truth evaluations in one go:

    df[(df.date > a) & (df.date < b)]

Using Python's Pandas to find average values by bins

Tags:

python

pandas

bin

user3830166

People also ask

2 Answers

CT Zhu

acushner

Recent Activity

Donate For Us

Using Python's Pandas to find average values by bins

Tags:

python

pandas

bin

user3830166

People also ask

2 Answers

CT Zhu

acushner

Related questions

Recent Activity

Donate For Us