Related to: Convert upper case words to title case
Some code that uses strings fetched from online doesn't behave as I expect, you can reproduce the issue by running the following:
library(xml2)
library(magrittr)
x <- xml2::read_html("https://poesie.webnet.fr/lesgrandsclassiques/Authors/B") %>%
gsub("^.*?<span>(Pierre-Jean de BÉRANGER)</span>.*$","\\1",.)
x # [1] "Pierre-Jean de BÉRANGER"
This string is identical to "Pierre-Jean de BÉRANGER"
copied/pasted from page source, however the following behavior is very disturbing to me:
y <- "Pierre-Jean de BÉRANGER"
x == y # TRUE
identical(x, y) # TRUE
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE) # [1] "Pierre-Jean de BÉRANGER"
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", y, perl = TRUE) # [1] "Pierre-Jean de Béranger"
grepl("\\bB\\w+", x, perl = TRUE) # FALSE
grepl("\\bB\\w+", y, perl = TRUE) # TRUE
grepl("\\bB\\w", x, perl = TRUE) # TRUE
grepl("\\bB\\w", y, perl = TRUE) # TRUE
If x
and y
are identical, how can these give a different output ?
?identical
:
The safe and reliable way to test two objects for being exactly equal
Edit:
Here's an observable difference :
Encoding(x) # "UTF-8"
Encoding(y) # "latin1"
I'm running R version 3.5.0
on Windows
To overcome that problem, you need to make sure your pattern is Unicode-aware, so that \w
could match all Unicode letters and digits and \b
could match at Unicode word boundaries. That is possible by using the PCRE verb (*UCP)
:
gsub("(*UCP)\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)
^^^^^^
To make it fully Unicode use \p{Lu}
instead of [A-Z]
:
gsub("(*UCP)\\b(\\p{Lu})(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)
Also, if you do not want to match digits and _
, you may replace \w
with \p{L}
(any letter):
gsub("(*UCP)\\b(\\p{Lu})(\\p{L}+)\\b", "\\1\\L\\2", x, perl = TRUE)
If you check out the source of the identical() function, you can see that when it's passed a CHARSXP
value (a character vector), it calls the internal helper function Seql()
. That function converts string values to UTF prior to doing the comparison. Thus identical
isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.
In a perfect world, the identical()
function should have an ignore.encoding=
option in addition to all the other properties you can ignore when doing a comparison.
But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem
grepl("B\\w+", x)
# [1] TRUE
grepl("B\\w+", y)
# [1] TRUE
@MrFlick explained very well the reasons behind the issue and @Wiktor-Stribiżew gave a great solution to use the perl regex engine with mixed encodings, which conserves the original encoding.
Now looking at the workflow, I believe in practice it is good to make sure to know what encoding one is working with at all times, and whenever it's acceptable, harmonize everything at the importation/fetching step or right after.
In the above case there's no reason not to harmonize the encoding right after the external data is retrieved to avoid such bad surprises.
This can be done by running as a second step:
x <- iconv(x, from="UTF-8", to="latin1")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With