Number of firms per year using dplyr or datatable

Question

Lets say I have data frame:

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
                 YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
                 YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))

where YearFrom is the year when e.g. firm was established and YearTo is year when it was canceled. If YearTo is NA then it is still working.

I would like to calculate number of firms for every year.

The table should look like this

City    |"Year"   |"Count"
"NY"    |2001       1
"NY"    |2002       2
"NY"    |2003       3
"NY"    |2004       3
"NY"    |2005       2
"NY"    |2006       3
"NY"    |2007       3
"NY"    |2008       4
"NY"    |2009       3
"LA"    |2001       0
"LA"    |2002       1
"LA"    |2003       1
"LA"    |2004       2
"LA"    |2005       4
"LA"    |2006       4
"LA"    |2007       4
"LA"    |2008       2
"LA"    |2009       2

I would like to solve this by dplyr or datatable package but I can't figure it out how?

Frank · Accepted Answer

First, to clean the data...

curr_year = as.integer(year(Sys.Date()))

library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]

Then, a non-equi join:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]

    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3

Axeman · Answer

A shorter tidyverse solution.

# Firsts some data prep
df <- mutate(df,
    YearFrom = as.numeric(as.character(YearFrom)),                     #Fix year coding
    YearTo = as.numeric(as.character(YearTo)),
    YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max

df %>% 
  mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>%          #Find all years
  unnest() %>%                                                 #Spread over rows
  count(Years, City) %>%                                       #Count them
  complete(City, Years, fill = list(n = 0))                    #Add in zeros, if needed

lmo · Answer

Here is one answer using data.table. The data preparation is at the bottom.

# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
              function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])

# combine to a single data.table
dfNew <- rbindlist(cityList)

# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]

written in one line, this is

# get the counts
rbindlist(lapply(seq_len(nrow(df)),
          function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
  by=.(City, Year)]

Here, lapply runs through each row and constructs a data.table with repeated city values as one column and a second column with the years of operation. Here, YearTo is decremented so that it doesn't include the year of closure. Note that in the data preparation, the missing values are set to 2018 so that the current year is included.

lapply returns a list of data.tables which is combined into a single data.table via rbindlist. This data.table is aggregated to city-year pairs and counts are constructed using .N.

these return

    City Year Count
 1:   NY 2001     1
 2:   NY 2002     2
 3:   NY 2003     3
 4:   NY 2004     3
 5:   NY 2005     2
 6:   NY 2006     3
 7:   NY 2007     3
  ...
26:   LA 2012     3
27:   LA 2013     3
28:   LA 2014     3
29:   LA 2015     3
30:   LA 2016     3
31:   LA 2017     3
32:   LA 2002     1
33:   LA 2003     1

data

setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) := 
   lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]

www · Answer

This solution uses dplyr and tidyr.

library(dplyr)
library(tidyr)

df %>%
  # Change YearFrom and YearTo to numeric
  mutate(YearFrom = as.numeric(as.character(YearFrom)), 
         YearTo = as.numeric(as.character(YearTo))) %>%
  # Replace NA with 2017 in YearTo
  mutate(YearTo = ifelse(is.na(YearTo), 2017, YearTo)) %>%
  # All number in YearTo minus 1 to exclude the year of cancellation
  mutate(YearTo = YearTo - 1) %>%
  # Group by row
  rowwise() %>%
  # Create a tbl for each row, expand the Year column based on YearFrom and YearTo
  do(data_frame(City = .$City, Year = seq(.$YearFrom, .$YearTo, by = 1))) %>%
  ungroup() %>%
  # Count the number of each City and Year
  count(City, Year) %>%
  # Rename the column n to Count
  rename(Count = n) %>%
  # Spread the data frame to find the implicity missing value in LA, 2001
  spread(Year, Count) %>%
  # Gather the data frame to account for the missing value in LA, 2001
  gather(Year, Count, - City) %>%
  # Replace NA with 0 in Count
  mutate(Count = ifelse(is.na(Count), 0L, Count)) %>%
  # Arrange the data 
  arrange(desc(City), Year) %>%
  # Filter the data until 2009
  filter(Year <= 2009)

Number of firms per year using dplyr or datatable

Tags:

r

data-manipulation

data.table

dplyr

Mislav

4 Answers

Frank

Axeman

lmo

www

Recent Activity

Donate For Us

Number of firms per year using dplyr or datatable

Tags:

r

data-manipulation

data.table

dplyr

Mislav

4 Answers

Frank

Axeman

lmo

www

Related questions

Recent Activity

Donate For Us