Creating monthly data and expanding data

Tags:

I have a data frame and I want to create an unbalanced panel based on the following dataset.

 profile<- c('lehman', 'john','oliver','stephen','picasso')
 start_date<-   c(2008-01-01, 2008-02-02,2008-04-02,2008-09-02,2009-02-02)
 end_date <-   c (2009-12-31, 2009-12-31, 2009-12-31,2009-12-31,2009-12-31)
 df<- data.frame(profile,start_date,end_date)

I would like to create two columns tid and myear. Myear is basically the month year starting from start date and it keeps expanding based on months till the last end date. Then I need a tid which is coded as 01 for myear 01-2008 and 02 for 02-2008 ....so on 12-2009 as 24. Can someone suggest how it can be done? Here is the expected output.

 profile      start_date    end_date     tid   myear
 lehman       2008-01-01    2009-12-31   01   01-2008
 lehman       2008-01-01    2009-12-31   02   02-2008
 ...          ..            ..           ..
 lehman       2008-01-01    2009-12-31   24   12-2009
 john         2008-02-02    2009-12-31   02   02-2008
 john         2008-02-02    2009-12-31   03   03-2008
 ..           ..             ..          ..
 john         2008-02-02    2009-12-31   24   12-2009
 ...          ..            ...          ..
 picasso      2009-02-02    2009-12-31   14   02-2009
 picasso      2009-03-02    2009-12-31   15   03-2009     
 ...          ...           ...          ..

334

asked Jun 13 '17 13:06

user3570187

3 Answers

Here is an idea. First make sure your dates are as.Date (i.e. df[2:3] <- lapply(df[2:3], function(i) as.Date(i, format = '%Y-%m-%d')). Then create a list with the monthly sequence between start and end date. Count the lengths of that list and use them to expand your data frame. Add the sequence of dates as a new column and create tid based on each profile's length.

seq_lst <- lapply(Map(function(x, y) seq(x, y, by = 'months'), 
                      df$start_date, df$end_date), function(i) format(i, '%m-%Y'))

df <- df[rep(seq_len(nrow(df)), lengths(seq_lst)),]

df$myear <- unlist(seq_lst)

i1 <- setNames(seq(length(seq_lst[[1]])), seq_lst[[1]])
df$tid <- sprintf('%02d', i1[match(df$myear, names(i1))])

head(df)
#    profile start_date   end_date   myear tid
#1    lehman 2008-01-01 2009-12-31 01-2008  01
#1.1  lehman 2008-01-01 2009-12-31 02-2008  02
#1.2  lehman 2008-01-01 2009-12-31 03-2008  03
#1.3  lehman 2008-01-01 2009-12-31 04-2008  04
#1.4  lehman 2008-01-01 2009-12-31 05-2008  05
#1.5  lehman 2008-01-01 2009-12-31 06-2008  06

160

answered Oct 31 '22 14:10

Sotos

Here is another possible way to achieve the task. I am following your sample data. For all names in profile, you have the same end_date, which is the 31st of December, 2009. The earliest start_date is the 1st of January, 2008. These two things are in my assumptions for the following code. So if your data is different from the sample data, the following would not be good.

I tried to create the sequences of dates using do(). Since I used group_by(), start_date and end_date were repeated in the accordance of the length of myear. Here, I created a sequence of dates by month and transformed the dates to the format you specified, namely year and month (e.g., 01-2008). myear is, hence, in character. Once, this job was done, I created tid. No matter what, the ending number is 24 for all levels in profile. So I did simple math. You want to know how many rows exist for each level of profile. Let's have a look of picasso. The start_date is Feb, 2009, which is the 14th month counting from Jan 2008. So you have 11 rows for picaso, which means n() = 11. Hence, (1 + (24 - 11)):24 creates a numeric sequence beginning at 14 and ending at 24. I leave a part of the output below of you.

library(dplyr)

group_by(df, profile) %>%
do(data.frame(start_date = .$start_date,
              end_date = .$end_date,
              myear = format(seq(from = .$start_date, to = .$end_date, by = "months"),
                             "%m-%Y")
             )
   ) %>%
mutate(tid = (1 + (24 - n())):24)

#69 picasso 2009-02-02 2009-12-31 02-2009  14
#70 picasso 2009-02-02 2009-12-31 03-2009  15
#71 picasso 2009-02-02 2009-12-31 04-2009  16
#72 picasso 2009-02-02 2009-12-31 05-2009  17
#73 picasso 2009-02-02 2009-12-31 06-2009  18
#74 picasso 2009-02-02 2009-12-31 07-2009  19
#75 picasso 2009-02-02 2009-12-31 08-2009  20
#76 picasso 2009-02-02 2009-12-31 09-2009  21
#77 picasso 2009-02-02 2009-12-31 10-2009  22
#78 picasso 2009-02-02 2009-12-31 11-2009  23
#79 picasso 2009-02-02 2009-12-31 12-2009  24

DATA

structure(list(profile = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("john", 
"lehman", "oliver", "picasso", "stephen"), class = "factor"), 
start_date = structure(c(1199113200, 1201878000, 1207062000, 
1220281200, 1233500400), class = c("POSIXct", "POSIXt"), tzone = ""), 
end_date = structure(c(1262185200, 1262185200, 1262185200, 
1262185200, 1262185200), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("profile", 
"start_date", "end_date"), row.names = c(NA, -5L), class = "data.frame")

answered Oct 31 '22 13:10

jazzurro

This solution is based on functions from tidyverse, lubridate, and stringr.

Update

I misunderstood the definition of tid. Now the code should calculate tid as expected. tid shows the total number of records but the beginning of tid is the earliest month of the earliest year, while myear is month and year information combined together.

library(tidyverse)
library(lubridate)
library(stringr)    

df2 <- df %>%
  mutate(start_date = ymd(start_date), end_date = ymd(end_date)) %>%
  mutate(start_year = year(start_date), end_year = year(end_date),
         start_month = month(start_date), end_month = month(end_date)) %>%
  mutate(Year = map2(start_year, end_year, `:`)) %>%
  unnest() %>%
  group_by(profile) %>%
  mutate(first_year = ifelse(Year == min(Year), TRUE, FALSE),
         last_year = ifelse(Year == max(Year), TRUE, FALSE)) %>%
  mutate(start_month = ifelse(!first_year, 1, start_month),
         end_month = ifelse(!last_year, 12, end_month)) %>%
  mutate(Month = map2(start_month, end_month, `:`)) %>%
  unnest() %>%
  mutate(endid = n() + Month - 1) %>%
  mutate(tid = first(Month):first(endid)) %>%
  mutate(Multiple_Year = ifelse(length(unique(Year)) > 1, TRUE, FALSE)) %>%
  ungroup() %>%
  mutate(tid = ifelse(Year > min(Year) & !Multiple_Year,
                      tid + 12 * (Year - min(Year)), tid)) %>%
  mutate(tid = str_pad(tid, width = 2, pad = "0")) %>%
  mutate(Month = str_pad(Month, width = 2, pad = "0")) %>%
  mutate(myear = paste(Month, Year, sep = "-")) %>%
  select(profile, start_date, end_date, tid, myear)

Output

Now examine part of the output df2 to see if the code works as expected.

The first two rows of lehman

df2 %>%
  filter(profile %in% "lehman") %>%
  head(2)

    # A tibble: 2 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1  lehman 2008-01-01 2009-12-31    01 01-2008
2  lehman 2008-01-01 2009-12-31    02 02-2008

The last one row of lehman

df2 %>%
  filter(profile %in% "lehman") %>%
  tail(1)

# A tibble: 1 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1  lehman 2008-01-01 2009-12-31    24 12-2009

The first two rows of picasso

df2 %>% 
  filter(profile %in% "picasso") %>% 
  head(2)  

# A tibble: 2 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1 picasso 2009-02-02 2009-12-31    14 02-2009
2 picasso 2009-02-02 2009-12-31    15 03-2009

Data Preparation

profile <- c('lehman', 'john','oliver','stephen','picasso')
start_date <- c("2008-01-01", "2008-02-02", "2008-04-02", "2008-09-02", "2009-02-02")
end_date <- c("2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31")
df <- data.frame(profile,start_date,end_date)

answered Oct 31 '22 12:10

www

Related questions
                            
                                Read column names as date format
                            
                                How can I maintain a color scheme across ggplots, while dropping unused levels in each plot?
                            
                                How to increase the size of the text in a Bayesian network plot with bnlearn in R
                            
                                R dplyr method to replace all empty factors with NA
                            
                                Adding multiple reactive plots and tables to Shiny app
                            
                                Group by aggregate dynamic column name matching
                            
                                Refering to a variable of the data frame passed in the 'data' parameter of ggplot function
                            
                                Speed up INSERT of 1 million+ rows into Postgres via R using COPY?
                            
                                How to plot a function family in ggplot2
                            
                                Print label on circle markers in leaflet in Rshiny
                            
                                How to do group matching in R?
                            
                                Running a GLM with a Gamma distribution, but data includes zeros
                            
                                Concatenating strings using group_by and summarise in r [duplicate]
                            
                                Why set.seed() affects sample() in R
                            
                                How add named element to R vector with name from a variable
                            
                                devtools equivalent of RStudio Build panel buttons
                            
                                Recommended way for variable scoping [duplicate]
                            
                                Replace random values in a column in a dataframe
                            
                                How to use biglm with more than 2^31 observations
                            
                                Print message into R markdown console while knitting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With