I just started using pandas to analyze groundwater well data over time.
My data in a text file looks like (site_no, date, well_level):
485438103132901 19800417 -7.1
485438103132901 19800506 -6.8
483622101085001 19790910 -6.7
485438103132901 19790731 -6.2
483845101112801 19801111 -5.37
484123101124601 19801111 -5.3
485438103132901 19770706 -4.98
I would like an output with average well levels binned by 5 year increments and with a count:
site_no avg 1960-end1964 count avg 1965-end1969 count avg 1970-end1974 count
I am reading in the data with:
names = ['site_no','date','wtr_lvl']
df = pd.read_csv('D:\info.txt', sep='\t',names=names)
I can find the overall average by site with:
avg = df.groupby(['site_no'])['wtr_lvl'].mean().reset_index()
My crude bin attempts use:
a1 = df[df.date > 19600000]
a2 = a1[a1.date < 19650000]
avga2 = a2.groupby(['site_no'])['wtr_lvl'].mean()
My question: how can I join the results to display as desired? I tried merge, join, and append, but they do not allow for empty data frames (which happens). Also, I am sure there is a simple way to bin the data by the dates. Thanks.
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
Binning also known as bucketing or discretization is a common data pre-processing technique used to group intervals of continuous data into “bins” or “buckets”.
To find the average of the numbers in a list in Python, we have multiple ways. The two main ways are using the Len() and Sum() in-built function and using the mean() function from the statistics module.
In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.
The most concise way is probably to convert this to a timeseris
data and them downsample to get the means:
In [75]:
print df
ID Level
1
1980-04-17 485438103132901 -7.10
1980-05-06 485438103132901 -6.80
1979-09-10 483622101085001 -6.70
1979-07-31 485438103132901 -6.20
1980-11-11 483845101112801 -5.37
1980-11-11 484123101124601 -5.30
1977-07-06 485438103132901 -4.98
In [76]:
df.Level.resample('60M', how='mean')
#also may consider different time alias: '5A', '5BA', '5AS', etc:
#see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
Out[76]:
1
1977-07-31 -4.980
1982-07-31 -6.245
Freq: 60M, Name: Level, dtype: float64
Alternatively, you may use groupby
together with cut
:
In [99]:
print df.groupby(pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)).mean()
ID Level
[1960, 1965] NaN NaN
(1965, 1970] NaN NaN
(1970, 1975] NaN NaN
(1975, 1980] 4.847632e+14 -6.064286
And by ID also:
In [100]:
print df.groupby(['ID',
pd.cut(df.index.year, pd.date_range('1960', periods=5, freq='5A').year, include_lowest=True)]).mean()
Level
ID
483622101085001 (1975, 1980] -6.70
483845101112801 (1975, 1980] -5.37
484123101124601 (1975, 1980] -5.30
485438103132901 (1975, 1980] -6.27
so what i like to do is create a separate column with the rounded bin number:
bin_width = 50000
mult = 1. / bin_width
df['bin'] = np.floor(ser * mult + .5) / mult
then, just group by the bins themselves
df.groupby('bin').mean()
another note, you can do multiple truth evaluations in one go:
df[(df.date > a) & (df.date < b)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With