library(data.table)
# Target string to convert
DATE_DATA <- c("2015-01-02;2015-01-07;2021-05-02;2019-02-05",
"2017-08-02;2000-01-22;2003-03-07;2017-10-09",
"2013-08-02;2022-06-02;2012-03-15")
# Dataset
DT <- data.table(NAME = c("JOE","MARY","PAUL"),DATE = c(DATE_DATA))
Expected result -- convert DATE column in a new column call "period" like below: split + sorted decreasing = F + unique year
# period
1: 2015,2019,2021
2: 2000,2003,2017
3: 2012,2013,2022
The approaches like below i have don't meet excepted result
# 1st approach -- RESULT : created column with class -- "list"
DT[,period:= lapply(strsplit(DT$DATE,";"),
function(x) sort(unique(str_sub(x,1,4)),
decreasing = FALSE))]
# 2nd approach -- RESULT : created column with class -- "character" but value
# turn to "c("xxx", "xxx", "xxx")" , not expected
# "xxx,xxx,xxx"
DT[,period:= as.character(paste(lapply(strsplit(DT$DATE,";"),
function(x) sort(unique(str_sub(x,1,4)),
decreasing = FALSE)),collapse = ","))]
What step did i missed ? Thanks in advance
For each DATE we can split the DATE column on ";", convert them into Date, extract the year using format, take the unique years and paste them together using toString.
DT$Period <- sapply(DT$DATE, function(x)
toString(sort(unique(format(as.Date(strsplit(x, ";")[[1]]), "%Y")))))
DT
# NAME DATE Period
#1: JOE 2015-01-02;2015-01-07;2021-05-02;2019-02-05 2015, 2019, 2021
#2: MARY 2017-08-02;2000-01-22;2003-03-07;2017-10-09 2000, 2003, 2017
#3: PAUL 2013-08-02;2022-06-02;2012-03-15 2012, 2013, 2022
We can reduce the as.Date and format step using the year function from the lubridate package which gives the same output.
library(lubridate)
DT$Period <- sapply(DT$DATE, function(x)
toString(sort(unique(year(strsplit(x, ";")[[1]])))))
I am not a data.table expert but I think the thing which you are missing from your attempt is the grouping (by) parameter because currently it is giving you unique year from the entire DATE column, you need to specify that you need unique year for each row separately which is mentioned in the by argument.
DT[,period:= paste(sapply(strsplit(DATE,";"),
function(x) sort(unique(substr(x,1,4)),)),collapse = ","), by = 1:nrow(DT)]
DT
# NAME DATE period
#1: JOE 2015-01-02;2015-01-07;2021-05-02;2019-02-05 2015,2019,2021
#2: MARY 2017-08-02;2000-01-22;2003-03-07;2017-10-09 2000,2003,2017
#3: PAUL 2013-08-02;2022-06-02;2012-03-15 2012,2013,2022
We can do this using gsub and scan
DT[, Period := toString(sort(unique(scan(text=gsub("-\\d+",
"", DATE), what = numeric(), sep=";")))), NAME]
DT
# NAME DATE Period
#1: JOE 2015-01-02;2015-01-07;2021-05-02;2019-02-05 2015, 2019, 2021
#2: MARY 2017-08-02;2000-01-22;2003-03-07;2017-10-09 2000, 2003, 2017
#3: PAUL 2013-08-02;2022-06-02;2012-03-15 2012, 2013, 2022
Or another option is tidyverse, where we reshape into 'long' format by splitting the 'DATE' at ;, grouped by 'NAME', summarise the 'Period' as the sorted year of the converted Date class (ymd), do a join with the original dataset and select the columns to the appropriate order (if needed)
library(tidyverse)
DT %>%
separate_rows(DATE, sep = ";") %>%
group_by(NAME) %>%
summarise(Period = toString(sort(unique(year(ymd(DATE)))))) %>%
right_join(DT) %>%
select(names(DT), everything())
# A tibble: 3 x 3
# NAME DATE Period
# <chr> <chr> <chr>
#1 JOE 2015-01-02;2015-01-07;2021-05-02;2019-02-05 2015, 2019, 2021
#2 MARY 2017-08-02;2000-01-22;2003-03-07;2017-10-09 2000, 2003, 2017
#3 PAUL 2013-08-02;2022-06-02;2012-03-15 2012, 2013, 2022
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With