I have a csv which looks like this:
Date,Sentiment
2014-01-03,0.4
2014-01-04,-0.03
2014-01-09,0.0
2014-01-10,0.07
2014-01-12,0.0
2014-02-24,0.0
2014-02-25,0.0
2014-02-25,0.0
2014-02-26,0.0
2014-02-28,0.0
2014-03-01,0.1
2014-03-02,-0.5
2014-03-03,0.0
2014-03-08,-0.06
2014-03-11,-0.13
2014-03-22,0.0
2014-03-23,0.33
2014-03-23,0.3
2014-03-25,-0.14
2014-03-28,-0.25
etc
And my goal is to aggregate date by months and calculate average of months. Dates might not start with 1. or January. Problem is that I have a lot of data, that means I have more years. For this purpose I would like to find the soonest date (month) and from there start counting months and their averages. For example:
Month count, average
1, 0.4 (<= the earliest month)
2, -0.3
3, 0.0
...
12, 0.1
13, -0.4 (<= new year but counting of month is continuing)
14, 0.3
I'm using Pandas to open csv
data = pd.read_csv("pks.csv", sep=",")
so in data['Date']
I have dates and in data['Sentiment']
I have values. Any idea how to do it?
Once you have all the numbers for each month, add all the numbers together for each month, and then divide them by the total amount of months.
Sum all the values for each day present in that month. Divide by the number of days with data for that month.
To find the average of the numbers in a list in Python, we have multiple ways. The two main ways are using the Len() and Sum() in-built function and using the mean() function from the statistics module.
To calculate the mean of whole columns in the DataFrame, use pandas. Series. mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.
Probably the simplest approach is to use the resample
command. First, when you read in your data make sure you parse the dates and set the date column as your index (ignore the StringIO
part and the header=True ... I am reading in your sample data from a multi-line string):
>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'],
index_col='Date')
>>> df
Sentiment
Date
2014-01-03 0.40
2014-01-04 -0.03
2014-01-09 0.00
2014-01-10 0.07
2014-01-12 0.00
2014-02-24 0.00
2014-02-25 0.00
2014-02-25 0.00
2014-02-26 0.00
2014-02-28 0.00
2014-03-01 0.10
2014-03-02 -0.50
2014-03-03 0.00
2014-03-08 -0.06
2014-03-11 -0.13
2014-03-22 0.00
2014-03-23 0.33
2014-03-23 0.30
2014-03-25 -0.14
2014-03-28 -0.25
>>> df.resample('M').mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
And if you want a month counter, you can add it after your resample
:
>>> agg = df.resample('M',how='mean')
>>> agg['cnt'] = range(len(agg))
>>> agg
Sentiment cnt
2014-01-31 0.088 0
2014-02-28 0.000 1
2014-03-31 -0.035 2
You can also do this with the groupby
method and the TimeGrouper
function (group by month and then call the mean convenience method that is available with groupby
).
>>> df.groupby(pd.TimeGrouper(freq='M')).mean()
Sentiment
2014-01-31 0.088
2014-02-28 0.000
2014-03-31 -0.035
To get the monthly average values of a Data Frame when the DataFrame has daily data rows 'Sentiment', I would:
df['dates']
into the index of the DataFrame df
: df.set_index('date',inplace=True)
dates
into a month-index: df.index.month
df.groupby(df.index.month).Sentiment.mean()
I go slowly throw each step here:
You need first to import Pandas and Numpy, as well as the module datetime
from datetime import datetime
Generate a Column 'date'
between 1/1/2019 and the 3/05/2019, at week 'W' intervals. And a column 'Sentiment'
with random values between 1-100:
date_rng = pd.date_range(start='1/1/2018', end='3/05/2018', freq='W')
df = pd.DataFrame(date_rng, columns=['date'])
df['Sentiment']=np.random.randint(0,100,size=(len(date_rng)))
the df
has two columns 'date'
and 'Sentiment'
:
date Sentiment
0 2018-01-07 34
1 2018-01-14 32
2 2018-01-21 15
3 2018-01-28 0
4 2018-02-04 95
5 2018-02-11 53
6 2018-02-18 7
7 2018-02-25 35
8 2018-03-04 17
'date'
column as the index of the DataFrame:df.set_index('date',inplace=True)
df
has one column 'Sentiment'
and the index is 'date'
:
Sentiment
date
2018-01-07 34
2018-01-14 32
2018-01-21 15
2018-01-28 0
2018-02-04 95
2018-02-11 53
2018-02-18 7
2018-02-25 35
2018-03-04 17
months=df.index.month
monthly_avg=df.groupby(months).Sentiment.mean()
'monthly_avg'
is: date
1 20.25
2 47.50
3 17.00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With