Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`

Tags:

regex

r

encoding

Related to: Convert upper case words to title case

Some code that uses strings fetched from online doesn't behave as I expect, you can reproduce the issue by running the following:

library(xml2)
library(magrittr)
x <- xml2::read_html("https://poesie.webnet.fr/lesgrandsclassiques/Authors/B") %>%
  gsub("^.*?<span>(Pierre-Jean de BÉRANGER)</span>.*$","\\1",.)
x # [1] "Pierre-Jean de BÉRANGER"

This string is identical to "Pierre-Jean de BÉRANGER" copied/pasted from page source, however the following behavior is very disturbing to me:

y <- "Pierre-Jean de BÉRANGER"
x == y  # TRUE
identical(x, y) # TRUE
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE) # [1] "Pierre-Jean de BÉRANGER"
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", y, perl = TRUE) # [1] "Pierre-Jean de Béranger"
grepl("\\bB\\w+", x, perl = TRUE) # FALSE
grepl("\\bB\\w+", y, perl = TRUE) # TRUE
grepl("\\bB\\w", x, perl = TRUE)  # TRUE
grepl("\\bB\\w", y, perl = TRUE)  # TRUE

If x and y are identical, how can these give a different output ?

?identical :

The safe and reliable way to test two objects for being exactly equal


Edit:

Here's an observable difference :

Encoding(x) # "UTF-8"
Encoding(y) # "latin1"

I'm running R version 3.5.0 on Windows

like image 407
Moody_Mudskipper Avatar asked Aug 15 '18 19:08

Moody_Mudskipper


3 Answers

To overcome that problem, you need to make sure your pattern is Unicode-aware, so that \w could match all Unicode letters and digits and \b could match at Unicode word boundaries. That is possible by using the PCRE verb (*UCP):

gsub("(*UCP)\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)
      ^^^^^^

To make it fully Unicode use \p{Lu} instead of [A-Z]:

gsub("(*UCP)\\b(\\p{Lu})(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE)

Also, if you do not want to match digits and _, you may replace \w with \p{L} (any letter):

gsub("(*UCP)\\b(\\p{Lu})(\\p{L}+)\\b", "\\1\\L\\2", x, perl = TRUE)
like image 75
Wiktor Stribiżew Avatar answered Nov 20 '22 17:11

Wiktor Stribiżew


If you check out the source of the identical() function, you can see that when it's passed a CHARSXP value (a character vector), it calls the internal helper function Seql(). That function converts string values to UTF prior to doing the comparison. Thus identical isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.

In a perfect world, the identical() function should have an ignore.encoding= option in addition to all the other properties you can ignore when doing a comparison.

But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem

grepl("B\\w+", x)
# [1] TRUE
grepl("B\\w+", y)
# [1] TRUE
like image 40
MrFlick Avatar answered Nov 20 '22 15:11

MrFlick


@MrFlick explained very well the reasons behind the issue and @Wiktor-Stribiżew gave a great solution to use the perl regex engine with mixed encodings, which conserves the original encoding.

Now looking at the workflow, I believe in practice it is good to make sure to know what encoding one is working with at all times, and whenever it's acceptable, harmonize everything at the importation/fetching step or right after.

In the above case there's no reason not to harmonize the encoding right after the external data is retrieved to avoid such bad surprises.

This can be done by running as a second step:

x <- iconv(x, from="UTF-8", to="latin1")
like image 33
Moody_Mudskipper Avatar answered Nov 20 '22 15:11

Moody_Mudskipper