I have a wine dataset with a column called "title" which contains the title of the wine including its vintage year. Refer sample:
I want to extract just the year in the strings i.e. 2013, and not the rest of the number in the string e.g. 2, 4.
I got to this part:
wine_tidy2$vintage_year <- as.list(str_extract_all(wine_tidy2$title, "[0-9]+"))
But how do I exclude other numbers that are not part of the year?
I want to append the result to a data frame. With the above code, it adds the resulting list to the data frame, how can I add to the data frame as another column of integer?
Thank you.
you can use sub() or regexec() from base by searching for numbers with have 4 digits:
string <- c('R2 2013 Camp 4 Vineyard Grenache Blanc', 'Santa Ynez Valley 1999', 'dsdd 2015')
sub("^.*([0-9]{4}).*", "\\1", string)
unlist(regmatches(string, regexec("[0-9]{4}", string)))
for your case:
# create a helper function
yearExtract <- function(string) {
t <- regmatches(string, regexec("[0-9]{4}", string))
sapply(t, function(x) {
if(length(x) > 0){
return(as.numeric(x))
} else {
return(NA)
}
})
}
# create data.frame
title <- c('R2 2013 Camp 4 Vineyard Grenache Blanc', 'Santa Ynez Valley 1999', 'dsdd 15')
distributor <- c('a', 'b', 'd')
wine_tidy2 <- data.frame(title, distributor)
wine_tidy2$vintage_year <- yearExtract(as.character(wine_tidy2$title))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With