How can I temporarily change/specify the locale settings to be used for certain function calls (e.g. strptime()
)?
I just ran the following rvest
demo:
demo("tripadvisor", package = "rvest")
When it comes to the part where the dates are to be scraped, I run into some problems that most likely are caused by my locale settings: the dates are in an US american format while I'm on a German locale:
require("rvest")
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
html() %>%
html_nodes("#REVIEWS .innerBubble")
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title")
> date
[1] "December 9, 2014" "December 9, 2014" "December 8, 2014" "December 8, 2014"
[5] "December 6, 2014" "December 5, 2014" "December 5, 2014" "December 3, 2014"
[9] "December 3, 2014" "December 3, 2014"
Based on this output, I would use the following format: %B %e, %Y
(or %B%e, %Y
depending on what "with a leading space for a single-digit number" actually means WRT to the leading space; see ?strptime
).
Yet, both fails:
strptime(date, "%B %e, %Y")
strptime(date, "%B%e, %Y")
I suppose it's due to the fact that %B
expects the month names to be in German instead of English:
Full month name in the current locale. (Also matches abbreviated name on input.)
Sys.setlocale()
let's you change your locale settings. But it seems that it's not possible to do so after a function relying on locale settings has been called. I.e., you need to start with a fresh R session in order for the locale change to take effect. This makes temporary changes a bit cumbersome. Any ideas how to work around this?
This is my locale:
> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
When I change it before running strptime()
for the first time, everything works just fine:
Sys.setlocale(category = "LC_ALL", locale = "us")
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
However, if I change it after having run stptime()
, the change does not seem to be recognized
> Sys.setlocale(category = "LC_ALL", locale = "German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> strptime(date, "%B %e, %Y")
[1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
[6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"
This should actually result in a vector of NA
s if the change back to a German locale had been carried out.
parse_date_time()
from the lubridate
package is what you are looking for. It has an explicit locale
option for parsing strings according to a specific locale.
parse_date_time(date, orders = "B d, Y", locale = "us")
gives you:
[1] "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-24 UTC" "2016-02-23 UTC" "2016-02-21 UTC"
[7] "2016-02-21 UTC" "2016-02-21 UTC" "2016-02-20 UTC" "2016-02-20 UTC"
Note that you give the parsing format without leading %
as you would in strptime()
.
You can also use readr::locale("en")
inside readr::parse_date()
readr::parse_date(date, format = "%B %e, %Y",
# vector of strings to be interpreted as missing values:
na = c("", "NA"),
locale = readr::locale("en"),
# to trim leading and trailing whitespaces:
trim_ws = TRUE)
From the docs: "The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With