Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating monthly data and expanding data

Tags:

r

dplyr

tidyr

I have a data frame and I want to create an unbalanced panel based on the following dataset.

 profile<- c('lehman', 'john','oliver','stephen','picasso')
 start_date<-   c(2008-01-01, 2008-02-02,2008-04-02,2008-09-02,2009-02-02)
 end_date <-   c (2009-12-31, 2009-12-31, 2009-12-31,2009-12-31,2009-12-31)
 df<- data.frame(profile,start_date,end_date)

I would like to create two columns tid and myear. Myear is basically the month year starting from start date and it keeps expanding based on months till the last end date. Then I need a tid which is coded as 01 for myear 01-2008 and 02 for 02-2008 ....so on 12-2009 as 24. Can someone suggest how it can be done? Here is the expected output.

 profile      start_date    end_date     tid   myear
 lehman       2008-01-01    2009-12-31   01   01-2008
 lehman       2008-01-01    2009-12-31   02   02-2008
 ...          ..            ..           ..
 lehman       2008-01-01    2009-12-31   24   12-2009
 john         2008-02-02    2009-12-31   02   02-2008
 john         2008-02-02    2009-12-31   03   03-2008
 ..           ..             ..          ..
 john         2008-02-02    2009-12-31   24   12-2009
 ...          ..            ...          ..
 picasso      2009-02-02    2009-12-31   14   02-2009
 picasso      2009-03-02    2009-12-31   15   03-2009     
 ...          ...           ...          ..
like image 334
user3570187 Avatar asked Jun 13 '17 13:06

user3570187


People also ask

How to expand the outline of a month in Excel?

As a result, all rows with Jan-21 and Feb-21 are collapsed and only the totals are displayed. If we want to expand the entire outline again, (1) click on Grand Total, then in the Ribbon, (2) go to the Data tab, and in the Outline section, (3) click on Show Detail. Now all data is visible again, and the Month outline is expanded.

How to expand the outline of jan-21 and feb-21 in Excel?

As a result, all rows with Jan-21 and Feb-21 are collapsed and only the totals are displayed. If we want to expand the entire outline again, (1) click on Grand Total, then in the Ribbon, (2) go to the Data tab, and in the Outline section, (3) click on Show Detail.

How to expand and collapse rows and columns in Excel?

In this article, you will learn how to expand and collapse rows or columns by grouping them in Excel and Google Sheets. Excel allows us to group and ungroup data, which enables us to expand or collapse rows and columns to better organize our spreadsheets. This is possible by grouping data manually or using the Auto Outline option.

Can a newly generated dataset improve a model within a range?

Let’s say, we set aside a specific carefully curated test dataset for final testing. Now if the newly generated dataset, combined with the original dataset is able to improve the model within a range of threashold, it would solve our purpose and would be useful enough.


3 Answers

Here is an idea. First make sure your dates are as.Date (i.e. df[2:3] <- lapply(df[2:3], function(i) as.Date(i, format = '%Y-%m-%d')). Then create a list with the monthly sequence between start and end date. Count the lengths of that list and use them to expand your data frame. Add the sequence of dates as a new column and create tid based on each profile's length.

seq_lst <- lapply(Map(function(x, y) seq(x, y, by = 'months'), 
                      df$start_date, df$end_date), function(i) format(i, '%m-%Y'))

df <- df[rep(seq_len(nrow(df)), lengths(seq_lst)),]

df$myear <- unlist(seq_lst)

i1 <- setNames(seq(length(seq_lst[[1]])), seq_lst[[1]])
df$tid <- sprintf('%02d', i1[match(df$myear, names(i1))])

head(df)
#    profile start_date   end_date   myear tid
#1    lehman 2008-01-01 2009-12-31 01-2008  01
#1.1  lehman 2008-01-01 2009-12-31 02-2008  02
#1.2  lehman 2008-01-01 2009-12-31 03-2008  03
#1.3  lehman 2008-01-01 2009-12-31 04-2008  04
#1.4  lehman 2008-01-01 2009-12-31 05-2008  05
#1.5  lehman 2008-01-01 2009-12-31 06-2008  06
like image 160
Sotos Avatar answered Oct 31 '22 14:10

Sotos


Here is another possible way to achieve the task. I am following your sample data. For all names in profile, you have the same end_date, which is the 31st of December, 2009. The earliest start_date is the 1st of January, 2008. These two things are in my assumptions for the following code. So if your data is different from the sample data, the following would not be good.

I tried to create the sequences of dates using do(). Since I used group_by(), start_date and end_date were repeated in the accordance of the length of myear. Here, I created a sequence of dates by month and transformed the dates to the format you specified, namely year and month (e.g., 01-2008). myear is, hence, in character. Once, this job was done, I created tid. No matter what, the ending number is 24 for all levels in profile. So I did simple math. You want to know how many rows exist for each level of profile. Let's have a look of picasso. The start_date is Feb, 2009, which is the 14th month counting from Jan 2008. So you have 11 rows for picaso, which means n() = 11. Hence, (1 + (24 - 11)):24 creates a numeric sequence beginning at 14 and ending at 24. I leave a part of the output below of you.

library(dplyr)

group_by(df, profile) %>%
do(data.frame(start_date = .$start_date,
              end_date = .$end_date,
              myear = format(seq(from = .$start_date, to = .$end_date, by = "months"),
                             "%m-%Y")
             )
   ) %>%
mutate(tid = (1 + (24 - n())):24)

#69 picasso 2009-02-02 2009-12-31 02-2009  14
#70 picasso 2009-02-02 2009-12-31 03-2009  15
#71 picasso 2009-02-02 2009-12-31 04-2009  16
#72 picasso 2009-02-02 2009-12-31 05-2009  17
#73 picasso 2009-02-02 2009-12-31 06-2009  18
#74 picasso 2009-02-02 2009-12-31 07-2009  19
#75 picasso 2009-02-02 2009-12-31 08-2009  20
#76 picasso 2009-02-02 2009-12-31 09-2009  21
#77 picasso 2009-02-02 2009-12-31 10-2009  22
#78 picasso 2009-02-02 2009-12-31 11-2009  23
#79 picasso 2009-02-02 2009-12-31 12-2009  24

DATA

structure(list(profile = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("john", 
"lehman", "oliver", "picasso", "stephen"), class = "factor"), 
start_date = structure(c(1199113200, 1201878000, 1207062000, 
1220281200, 1233500400), class = c("POSIXct", "POSIXt"), tzone = ""), 
end_date = structure(c(1262185200, 1262185200, 1262185200, 
1262185200, 1262185200), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("profile", 
"start_date", "end_date"), row.names = c(NA, -5L), class = "data.frame")
like image 23
jazzurro Avatar answered Oct 31 '22 13:10

jazzurro


This solution is based on functions from tidyverse, lubridate, and stringr.

Update

I misunderstood the definition of tid. Now the code should calculate tid as expected. tid shows the total number of records but the beginning of tid is the earliest month of the earliest year, while myear is month and year information combined together.

library(tidyverse)
library(lubridate)
library(stringr)    

df2 <- df %>%
  mutate(start_date = ymd(start_date), end_date = ymd(end_date)) %>%
  mutate(start_year = year(start_date), end_year = year(end_date),
         start_month = month(start_date), end_month = month(end_date)) %>%
  mutate(Year = map2(start_year, end_year, `:`)) %>%
  unnest() %>%
  group_by(profile) %>%
  mutate(first_year = ifelse(Year == min(Year), TRUE, FALSE),
         last_year = ifelse(Year == max(Year), TRUE, FALSE)) %>%
  mutate(start_month = ifelse(!first_year, 1, start_month),
         end_month = ifelse(!last_year, 12, end_month)) %>%
  mutate(Month = map2(start_month, end_month, `:`)) %>%
  unnest() %>%
  mutate(endid = n() + Month - 1) %>%
  mutate(tid = first(Month):first(endid)) %>%
  mutate(Multiple_Year = ifelse(length(unique(Year)) > 1, TRUE, FALSE)) %>%
  ungroup() %>%
  mutate(tid = ifelse(Year > min(Year) & !Multiple_Year,
                      tid + 12 * (Year - min(Year)), tid)) %>%
  mutate(tid = str_pad(tid, width = 2, pad = "0")) %>%
  mutate(Month = str_pad(Month, width = 2, pad = "0")) %>%
  mutate(myear = paste(Month, Year, sep = "-")) %>%
  select(profile, start_date, end_date, tid, myear)

Output

Now examine part of the output df2 to see if the code works as expected.

The first two rows of lehman

df2 %>%
  filter(profile %in% "lehman") %>%
  head(2)

    # A tibble: 2 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1  lehman 2008-01-01 2009-12-31    01 01-2008
2  lehman 2008-01-01 2009-12-31    02 02-2008

The last one row of lehman

df2 %>%
  filter(profile %in% "lehman") %>%
  tail(1)

# A tibble: 1 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1  lehman 2008-01-01 2009-12-31    24 12-2009

The first two rows of picasso

df2 %>% 
  filter(profile %in% "picasso") %>% 
  head(2)  

# A tibble: 2 x 5
  profile start_date   end_date   tid   myear
   <fctr>     <date>     <date> <chr>   <chr>
1 picasso 2009-02-02 2009-12-31    14 02-2009
2 picasso 2009-02-02 2009-12-31    15 03-2009

Data Preparation

profile <- c('lehman', 'john','oliver','stephen','picasso')
start_date <- c("2008-01-01", "2008-02-02", "2008-04-02", "2008-09-02", "2009-02-02")
end_date <- c("2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31")
df <- data.frame(profile,start_date,end_date)
like image 30
www Avatar answered Oct 31 '22 12:10

www