Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Temporarily change locale settings

Tags:

r

locale

strptime

Actual question

How can I temporarily change/specify the locale settings to be used for certain function calls (e.g. strptime())?

Background

I just ran the following rvest demo:

demo("tripadvisor", package = "rvest")

When it comes to the part where the dates are to be scraped, I run into some problems that most likely are caused by my locale settings: the dates are in an US american format while I'm on a German locale:

require("rvest")
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"

reviews <- url %>%
  html() %>%
  html_nodes("#REVIEWS .innerBubble")

date <- reviews %>%
  html_node(".rating .ratingDate") %>%
  html_attr("title")
> date
 [1] "December 9, 2014" "December 9, 2014" "December 8, 2014" "December 8, 2014"
 [5] "December 6, 2014" "December 5, 2014" "December 5, 2014" "December 3, 2014"
 [9] "December 3, 2014" "December 3, 2014"

Based on this output, I would use the following format: %B %e, %Y (or %B%e, %Y depending on what "with a leading space for a single-digit number" actually means WRT to the leading space; see ?strptime).

Yet, both fails:

strptime(date, "%B %e, %Y")
strptime(date, "%B%e, %Y")

I suppose it's due to the fact that %B expects the month names to be in German instead of English:

Full month name in the current locale. (Also matches abbreviated name on input.)


EDIT

Sys.setlocale() let's you change your locale settings. But it seems that it's not possible to do so after a function relying on locale settings has been called. I.e., you need to start with a fresh R session in order for the locale change to take effect. This makes temporary changes a bit cumbersome. Any ideas how to work around this?

This is my locale:

> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

When I change it before running strptime() for the first time, everything works just fine:

Sys.setlocale(category = "LC_ALL", locale = "us")
> strptime(date, "%B %e, %Y")
 [1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
 [6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"

However, if I change it after having run stptime(), the change does not seem to be recognized

> Sys.setlocale(category = "LC_ALL", locale = "German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> strptime(date, "%B %e, %Y")
 [1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
 [6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"

This should actually result in a vector of NAs if the change back to a German locale had been carried out.

like image 811
Rappster Avatar asked Dec 10 '14 10:12

Rappster


2 Answers

parse_date_time() from the lubridate package is what you are looking for. It has an explicit locale option for parsing strings according to a specific locale.

parse_date_time(date, orders = "B d, Y", locale = "us")

gives you:

[1] "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-24 UTC" "2016-02-23 UTC" "2016-02-21 UTC"
[7] "2016-02-21 UTC" "2016-02-21 UTC" "2016-02-20 UTC" "2016-02-20 UTC"

Note that you give the parsing format without leading %as you would in strptime().

like image 78
Felix Avatar answered Nov 04 '22 19:11

Felix


You can also use readr::locale("en") inside readr::parse_date()

  readr::parse_date(date, format = "%B %e, %Y", 
              # vector of strings to be interpreted as missing values:
                na = c("", "NA"), 
                locale = readr::locale("en"), 
              # to trim leading and trailing whitespaces:
                trim_ws = TRUE)

From the docs: "The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names."

like image 38
alisson_Soares Avatar answered Nov 04 '22 19:11

alisson_Soares