Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract date from given string in r

string<-c("Posted 69 months ago (7/4/2011)")
library(gsubfn)
strapplyc(string, "(.*)", simplify = TRUE)

I apply above function but nothing happens.

In this I want to extract only date part i.e 7/4/2011.

like image 558
Avinash Avatar asked Apr 14 '17 05:04

Avinash


People also ask

How do I get the year from a date in R?

To get the year from a date in R you can use the functions as. POSIXct() and format() . For example, here's how to extract the year from a date: 1) date <- as. POSIXct("02/03/2014 10:41:00", format = "%m/%d/%Y %H:%M:%S) , and 2) format(date, format="%Y") .


1 Answers

The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.

1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:

strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"

2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:

strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"

You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.

3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:

strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"

4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:

gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"

5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.

sub(".*\\(", "", sub("\\).*", "", string))

6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.

read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"

Note: Note that you don't need the c in defining string

string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE
like image 151
G. Grothendieck Avatar answered Sep 20 '22 06:09

G. Grothendieck