I am attempting to parse a date from a string of text. I'm assuming the best way to do this is regex, but I haven't quite found a solution that works.
First, I used a CSS selector to grab a date from a website.
date <-html_nodes(x=doc, css=".middleheadline+ .topnewsbar b") %>% html_text()
This produces:
[1] "\r\n Washington,\r\n Jan 5, 2011"
I want to extract the date itself (here, Jan 5, 2011) from this string. NOTE: the month can be any month, the date can be any date, and the year can be anything from 2011-2015, so I'm trying to find an expression that can generally parse a date in the Mon D[D], YYYY format.
Here's one attempt:
date <-str_extract_all(string=date, pattern='[A-Z][a-z]{3,4} ([0-9]{1,2}), [0-9]{4}')
This produces character(0)
And another:
grep("[A-Z][a-z]{3,4} ([0-9]{1,2}), [0-9]{4}", date, value=TRUE)
which also produces character(0)
Any tips?
You totally can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order: Solve the Halting Problem.
To match a date in mm/dd/yyyy format, rearrange the regular expression to ^(0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01])[- /.] (19|20)\d\d$. For dd-mm-yyyy format, use ^(0[1-9]|[12][0-9]|3[01])[- /.]
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
You may also try strsplit()
. Sometimes I prefer it over a mind-numbing regular expression.
test <- c("\r\n Washington,\r\n Jan 5, 2011",
"\r\n Boston,\r\n Mar 15, 2015")
vapply(strsplit(test, ".*\n\\s+"), "[", "", 2)
# [1] "Jan 5, 2011" "Mar 15, 2015"
as.Date(vapply(strsplit(test, ".*\n\\s+"), "[", "", 2), "%b %d, %Y")
# [1] "2011-01-05" "2015-03-15"
You could try this:
date <-str_extract_all(string=date, pattern='\\w+\\s\\d+(st)?(nd)?(rd)?(th)?,\\s+\\d+')
HERE test link.
A function to convert the dates:
make_dates <- function(x, date_format=TRUE, split="\n") {
dates <- lapply(strsplit(x, split), function(x) {
grep("\\w+ \\d+, \\d+", x, value=T)})
if(date_format) {
strptime(gsub("\\s", "", dates), format="%b%d,%Y")
} else { gsub(".*?(\\w.*)", "\\1", dates)}
}
test <- c("\r\n Washington,\r\n Jan 5, 2011",
"\r\n Boston,\r\n Mar 15, 2015")
make_dates(test)
#[1] "2011-01-05 EST" "2015-03-15 EDT"
make_dates(test, FALSE)
#[1] "Jan 5, 2011" "Mar 15, 2015"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With