Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding a seasons column to data table based on month dates

Tags:

r

data.table

I'm using data.table and I am trying to make a new column, called "season", which creates a column with the corresponding season, e.g summer, winter... based on a column called "MonthName".

I'm wondering whether there is a more efficient way to add a season column to a data table based on month values.

This is the first 6 of 300,000 observations, assume that the table is called "dt".

    rrp         Year   Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500     1999     1    1999      00:00      33.09037       Jan
2: 21.01167     1999     1    1999      00:00      33.09037       Jan
3: 25.28667     1999     2    1999      00:00      33.09037       Feb
4: 18.42334     1999     2    1999      00:00      33.09037       Feb
5: 16.67499     1999     2    1999      00:00      33.09037       Feb
6: 18.90001     1999     2    1999      00:00      33.09037       Feb

I have tried the following code:

dt[, Season :=  ifelse(MonthName = c("Jun", "Jul", "Aug"),"Winter", ifelse(MonthName = c("Dec", "Jan", "Feb"), "Summer", ifelse(MonthName = c("Sep", "Oct", "Nov"), "Spring" , ifelse(MonthName = c("Mar", "Apr", "May"), "Autumn", NA))))]

Which returns:

 rrp totaldemand   Year Month Finyear hourminute AvgPriceByTOD MonthName Season
1: 35.27500     1999     1    1999      00:00      33.09037       Jan     NA
2: 21.01167     1999     1    1999      00:00      33.09037       Jan Summer
3: 25.28667     1999     2    1999      00:00      33.09037       Feb Summer
4: 18.42334     1999     2    1999      00:00      33.09037       Feb     NA
5: 16.67499     1999     2    1999      00:00      33.09037       Feb     NA
6: 18.90001     1999     2    1999      00:00      33.09037       Feb Summer

I get the error:

Warning messages:
1: In MonthName == c("Jun", "Jul", "Aug") :
  longer object length is not a multiple of shorter object length
2: In MonthName == c("Dec", "Jan", "Feb") :
  longer object length is not a multiple of shorter object length
3: In MonthName == c("Sep", "Oct", "Nov") :
  longer object length is not a multiple of shorter object length
4: In MonthName == c("Mar", "Apr", "May") :
  longer object length is not a multiple of shorter object length 

ALongside this, for reasons that I don't know, some of the summer months are correctly assigned "summer", but others are assigned NA, e.g rows 1 and 2 should both be summer, but return differently.

Thanks in advance!

like image 716
Gin_Salmon Avatar asked Dec 08 '22 22:12

Gin_Salmon


2 Answers

One pretty straightforward way is to use a lookup table to map month names to seasons:

# create a named vector where names are the month names and elements are seasons
seasons <- rep(c("winter","spring","summer","fall"), each = 3)
names(seasons) <- month.abb[c(6:12,1:5)] # thanks thelatemail for pointing out month.abb
seasons
#     Jun      Jul      Aug      Sep      Oct      Nov      Dec      Jan 
#"winter" "winter" "winter" "spring" "spring" "spring" "summer" "summer" 
#     Feb      Mar      Apr      May 
#"summer"   "fall"   "fall"   "fall" 

Use it:

dt[, season := seasons[MonthName]]

data:

dt <- setDT(read.table(text="    rrp         Year   Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500     1999     1    1999      00:00      33.09037       Jan
2: 21.01167     1999     1    1999      00:00      33.09037       Jan
3: 25.28667     1999     2    1999      00:00      33.09037       Feb
4: 18.42334     1999     2    1999      00:00      33.09037       Feb
5: 16.67499     1999     2    1999      00:00      33.09037       Feb
6: 18.90001     1999     2    1999      00:00      33.09037       Feb",
   header = TRUE, stringsAsFactors = FALSE))
like image 120
Jota Avatar answered Jan 30 '23 20:01

Jota


A bit of typing, but the code is efficient

dt[MonthName %in% c("Jun","Jul","Aug"), Season := "Winter"]
dt[MonthName %in% c("Dec","Jan","Feb"), Season := "Summer"]
dt[MonthName %in% c("Sep","Oct","Nov"), Season := "Spring"]
dt[is.na(MonthName), Season := "Autumn"]

Here we are assigning by-reference on a subset of the data.table

I prefer this to a lot of nested ifelses


If you want to check if a value is in a vector, you have to use %in%. See the different behaviour of:

myVec <- c("a","b","c")

"a" == myVec
[1] TRUE FALSE FALSE

"a" %in% myVec
[1] TRUE
like image 33
SymbolixAU Avatar answered Jan 30 '23 21:01

SymbolixAU