Lets say I have data frame:
df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
where YearFrom is the year when e.g. firm was established and YearTo is year when it was canceled. If YearTo is NA then it is still working.
I would like to calculate number of firms for every year.
The table should look like this
City |"Year" |"Count"
"NY" |2001 1
"NY" |2002 2
"NY" |2003 3
"NY" |2004 3
"NY" |2005 2
"NY" |2006 3
"NY" |2007 3
"NY" |2008 4
"NY" |2009 3
"LA" |2001 0
"LA" |2002 1
"LA" |2003 1
"LA" |2004 2
"LA" |2005 4
"LA" |2006 4
"LA" |2007 4
"LA" |2008 2
"LA" |2009 2
I would like to solve this by dplyr or datatable package but I can't figure it out how?
First, to clean the data...
curr_year = as.integer(year(Sys.Date()))
library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]
Then, a non-equi join:
df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE,
.N
, by=.EACHI][, .(City, Year = YearFrom, N)]
City Year N
1: LA 2001 0
2: LA 2002 1
3: LA 2003 1
4: LA 2004 2
5: LA 2005 4
6: LA 2006 4
7: LA 2007 4
8: LA 2008 3
9: LA 2009 3
10: NY 2001 1
11: NY 2002 2
12: NY 2003 3
13: NY 2004 3
14: NY 2005 2
15: NY 2006 3
16: NY 2007 3
17: NY 2008 4
18: NY 2009 3
A shorter tidyverse
solution.
# Firsts some data prep
df <- mutate(df,
YearFrom = as.numeric(as.character(YearFrom)), #Fix year coding
YearTo = as.numeric(as.character(YearTo)),
YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max
df %>%
mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>% #Find all years
unnest() %>% #Spread over rows
count(Years, City) %>% #Count them
complete(City, Years, fill = list(n = 0)) #Add in zeros, if needed
Here is one answer using data.table
. The data preparation is at the bottom.
# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])
# combine to a single data.table
dfNew <- rbindlist(cityList)
# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]
written in one line, this is
# get the counts
rbindlist(lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
by=.(City, Year)]
Here, lapply
runs through each row and constructs a data.table with repeated city values as one column and a second column with the years of operation. Here, YearTo is decremented so that it doesn't include the year of closure. Note that in the data preparation, the missing values are set to 2018 so that the current year is included.
lapply
returns a list of data.tables which is combined into a single data.table via rbindlist
. This data.table is aggregated to city-year pairs and counts are constructed using .N
.
these return
City Year Count
1: NY 2001 1
2: NY 2002 2
3: NY 2003 3
4: NY 2004 3
5: NY 2005 2
6: NY 2006 3
7: NY 2007 3
...
26: LA 2012 3
27: LA 2013 3
28: LA 2014 3
29: LA 2015 3
30: LA 2016 3
31: LA 2017 3
32: LA 2002 1
33: LA 2003 1
data
setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) :=
lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]
This solution uses dplyr
and tidyr
.
library(dplyr)
library(tidyr)
df %>%
# Change YearFrom and YearTo to numeric
mutate(YearFrom = as.numeric(as.character(YearFrom)),
YearTo = as.numeric(as.character(YearTo))) %>%
# Replace NA with 2017 in YearTo
mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>%
# All number in YearTo minus 1 to exclude the year of cancellation
mutate(YearTo = YearTo - 1) %>%
# Group by row
rowwise() %>%
# Create a tbl for each row, expand the Year column based on YearFrom and YearTo
do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>%
ungroup() %>%
# Count the number of each City and Year
count(City, Year) %>%
# Rename the column n to Count
rename(Count = n) %>%
# Spread the data frame to find the implicity missing value in LA, 2001
spread(Year, Count) %>%
# Gather the data frame to account for the missing value in LA, 2001
gather(Year, Count, - City) %>%
# Replace NA with 0 in Count
mutate(Count = ifelse(is.na(Count), 0L, Count)) %>%
# Arrange the data
arrange(desc(City), Year) %>%
# Filter the data until 2009
filter(Year <= 2009)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With