I'm struggling with something very basic: sorting a data frame based on a time format (month-year, or, “%B-%y” in this case). My goal is to calculate various monthly statistics, starting with sum.
The part of relevant part of the data frame looks like this * (This goes well and in accordance of my goal. I'm including it here to show where the problem could originate from)*:
> tmp09
Instrument AccountValue monthYear ExitTime
1 JPM 6997 april-07 2007-04-10
2 JPM 7261 mei-07 2007-05-29
3 JPM 7545 juli-07 2007-07-18
4 JPM 7614 juli-07 2007-07-19
5 JPM 7897 augustus-07 2007-08-22
10 JPM 7423 november-07 2007-11-02
11 KFT 6992 mei-07 2007-05-14
12 KFT 6944 mei-07 2007-05-21
13 KFT 7069 juli-07 2007-07-09
14 KFT 6919 juli-07 2007-07-16
# Order on the exit time, which corresponds with 'monthYear'
> tmp09.sorted <- tmp09[order(tmp09$ExitTime),]
> tmp09.sorted
Instrument AccountValue monthYear ExitTime
1 JPM 6997 april-07 2007-04-10
11 KFT 6992 mei-07 2007-05-14
12 KFT 6944 mei-07 2007-05-21
2 JPM 7261 mei-07 2007-05-29
13 KFT 7069 juli-07 2007-07-09
14 KFT 6919 juli-07 2007-07-16
3 JPM 7545 juli-07 2007-07-18
4 JPM 7614 juli-07 2007-07-19
5 JPM 7897 augustus-07 2007-08-22
10 JPM 7423 november-07 2007-11-02
So far, so good, and sorting based on ExitTime works. The trouble starts when I try to calculate the totals per month, followed by an attempt to sort this output:
# Calculate the total results per month
> Tmp09Totals <- tapply(tmp09.sorted$AccountValue, tmp09.sorted$monthYear, sum)
> Tmp09Totals <- data.frame(Tmp09Totals)
> Tmp09Totals
Tmp09Totals
april-07 6997
augustus-07 7897
juli-07 29147
mei-07 21197
november-07 7423
How can I sort this output in a chronological Way?
I've already tried (besides various attempts to convert the monthYear to another date format): order, sort, sort.list, sort_df, reshape, and calculating the sum based on tapply, lapply, sapply, aggregate. And even rewriting the rownames (by giving them a number from 1 to length (tmp09.sorted2$AccountValue
) didn't work. I also tried to give each month-year a different ID based on what I've learned in another question, but R also experienced difficulties in discriminating between the various month-year values.
The correct order of this output would be april-07,mei-07,juli-07,augustus07, november-07
:
apr-07 6997
mei-07 21197
jul-07 29147
aug-07 7897
nov-07 7423
It would be easier to have separate Month
and Year
factors, in the correct order, and use tapply
on the union of both variables, e.g.:
## The Month factor
tmp09 <- within(tmp09,
Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
levels = month.name)))
## for @Jura25's locale, we can't use the in built English constant
## instead, we can use this solution, from ?month.name:
## format(ISOdate(2000, 1:12, 1), "%B"))
tmp09 <- within(tmp09,
Month <- droplevels(factor(strftime(ExitTime, format = "%B"),
levels = format(ISOdate(2000, 1:12, 1), "%B"))))
##
## And the Year factor
tmp09 <- within(tmp09, Year <- factor(strftime(ExitTime, format = "%Y")))
Which gives us (in my locale):
> head(tmp09)
Instrument AccountValue monthYear ExitTime Month Year
1 JPM 6997 april-07 2007-04-10 April 2007
2 JPM 7261 mei-07 2007-05-29 May 2007
3 JPM 7545 juli-07 2007-07-18 July 2007
4 JPM 7614 juli-07 2007-07-19 July 2007
5 JPM 7897 augustus-07 2007-08-22 August 2007
10 JPM 7423 november-07 2007-11-02 November 2007
Then use tapply
with both factors:
> with(tmp09, tapply(AccountValue, list(Month, Year), sum))
2007
April 6997
May 21197
July 29147
August 7897
November 7423
or via aggregate
:
> with(tmp09, aggregate(AccountValue, list(Month = Month, Year = Year), sum))
Month Year x
1 April 2007 6997
2 May 2007 21197
3 July 2007 29147
4 August 2007 7897
5 November 2007 7423
Try using the "yearmon"
class in zoo as it sorts appropriately. Below we create the sample DF
data frame and then we add a YearMonth
column of class "yearmon"
. Finally we perform our aggregation. The actual processing is just the last two lines (the other part is just to create the sample data frame).
Lines <- "Instrument AccountValue monthYear ExitTime
JPM 6997 april-07 2007-04-10
JPM 7261 mei-07 2007-05-29
JPM 7545 juli-07 2007-07-18
JPM 7614 juli-07 2007-07-19
JPM 7897 augustus-07 2007-08-22
JPM 7423 november-07 2007-11-02
KFT 6992 mei-07 2007-05-14
KFT 6944 mei-07 2007-05-21
KFT 7069 juli-07 2007-07-09
KFT 6919 juli-07 2007-07-16"
library(zoo)
DF <- read.table(textConnection(Lines), header = TRUE)
DF$YearMonth <- as.yearmon(DF$ExitTime)
aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
This gives the following:
> aggregate(AccountValue ~ YearMonth + Instrument, DF, sum)
YearMonth Instrument AccountValue
1 Apr 2007 JPM 6997
2 May 2007 JPM 7261
3 Jul 2007 JPM 15159
4 Aug 2007 JPM 7897
5 Nov 2007 JPM 7423
6 May 2007 KFT 13936
7 Jul 2007 KFT 13988
A slightly different approach and output uses read.zoo
directly. It produces one column per instrument and one row per year/month. We read in the columns assigning them appropriate classes using "NULL"
for the monthYear
column since we won't use that one. We also specify that the time index is the 3rd column of the remaining columns and that we want the input split into columns by the 1st column. FUN=as.yearmon
indicates that we want the time index to be converted from "Date"
class to "yearmon"
class and we aggregate everything using sum
.
z <- read.zoo(textConnection(Lines), header = TRUE, index = 3,
split = 1, colClasses = c("character", "numeric", "NULL", "Date"),
FUN = as.yearmon, aggregate = sum)
The resulting zoo object looks like this:
> z
JPM KFT
Apr 2007 6997 NA
May 2007 7261 13936
Jul 2007 15159 13988
Aug 2007 7897 NA
Nov 2007 7423 NA
We may prefer to keep it as a zoo object to take advantage of other functionality in zoo or we can convert it to a data frame like this: data.frame(Time = time(z), coredata(z))
which makes the time a separate column or as.data.frame(z)
which uses row names for the time. fortify.zoo()z)
also works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With