Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing date in Mon, DD, YYYY format using RegEx in R

Tags:

date

regex

r

I am attempting to parse a date from a string of text. I'm assuming the best way to do this is regex, but I haven't quite found a solution that works.

First, I used a CSS selector to grab a date from a website.

date <-html_nodes(x=doc, css=".middleheadline+ .topnewsbar b") %>% html_text()

This produces:

[1] "\r\n        Washington,\r\n        Jan 5, 2011"

I want to extract the date itself (here, Jan 5, 2011) from this string. NOTE: the month can be any month, the date can be any date, and the year can be anything from 2011-2015, so I'm trying to find an expression that can generally parse a date in the Mon D[D], YYYY format.

Here's one attempt:

date <-str_extract_all(string=date, pattern='[A-Z][a-z]{3,4} ([0-9]{1,2}), [0-9]{4}')

This produces character(0)

And another:

grep("[A-Z][a-z]{3,4} ([0-9]{1,2}), [0-9]{4}", date, value=TRUE)

which also produces character(0)

Any tips?

like image 920
Rachel B. Avatar asked Aug 05 '15 15:08

Rachel B.


People also ask

Can you parse regex with regex?

You totally can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order: Solve the Halting Problem.

How do I format a date in regex?

To match a date in mm/dd/yyyy format, rearrange the regular expression to ^(0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01])[- /.] (19|20)\d\d$. For dd-mm-yyyy format, use ^(0[1-9]|[12][0-9]|3[01])[- /.]

What is parse in regex?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.


3 Answers

You may also try strsplit(). Sometimes I prefer it over a mind-numbing regular expression.

test <- c("\r\n        Washington,\r\n        Jan 5, 2011",
    "\r\n        Boston,\r\n        Mar 15, 2015")

vapply(strsplit(test, ".*\n\\s+"), "[", "", 2)
# [1] "Jan 5, 2011"  "Mar 15, 2015"

as.Date(vapply(strsplit(test, ".*\n\\s+"), "[", "", 2), "%b %d, %Y")
# [1] "2011-01-05" "2015-03-15"
like image 63
Rich Scriven Avatar answered Oct 06 '22 00:10

Rich Scriven


You could try this:

date <-str_extract_all(string=date, pattern='\\w+\\s\\d+(st)?(nd)?(rd)?(th)?,\\s+\\d+')

HERE test link.

like image 22
teoreda Avatar answered Oct 06 '22 00:10

teoreda


A function to convert the dates:

make_dates <- function(x, date_format=TRUE, split="\n") {
  dates <- lapply(strsplit(x, split), function(x) {
    grep("\\w+ \\d+, \\d+", x, value=T)})

  if(date_format) {
    strptime(gsub("\\s", "", dates), format="%b%d,%Y")
  } else { gsub(".*?(\\w.*)", "\\1", dates)}
}

test <- c("\r\n        Washington,\r\n        Jan 5, 2011",
       "\r\n        Boston,\r\n        Mar 15, 2015")

make_dates(test)
#[1] "2011-01-05 EST" "2015-03-15 EDT"
make_dates(test, FALSE)
#[1] "Jan 5, 2011"  "Mar 15, 2015"
like image 38
Pierre L Avatar answered Oct 06 '22 00:10

Pierre L