I run into problems matching tables where one dataframe contains special characters and the other doesn't. Example: Doña Ana County vs. Dona Ana County
Here is a script where you can reproduce the outputs:
library(tidyverse)
library(acs)
tbl_df(acs::fips.place) # contains "Do\xf1a Ana County"
tbl_df(tigris::fips_codes) # contains "Dona Ana County"
Example:
tbl_df(tigris::fips_codes) %>% filter(county == "Dona Ana County")
returns:
# A tibble: 1 x 5
state state_code state_name county_code county
<chr> <chr> <chr> <chr> <chr>
1 NM 35 New Mexico 013 Dona Ana County
Unfortunately, following queries return nothing:
tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\xf1a Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Doña Ana County")
tbl_df(acs::fips.place) %>% filter(COUNTY == "Dona Ana County")
# A tibble: 0 x 7
# ... with 7 variables: STATE <chr>, STATEFP <int>, PLACEFP <int>, PLACENAME <chr>, TYPE <chr>, FUNCSTAT <chr>, COUNTY <chr>
However, when opening the dataframe in R Studio, it shows:
Question 1: Why does the second query give no return, though "Do\xf1a Ana County" appears in the database?
Question 2: How can I convert all "special" characters such as ñ into n, or similar (UTF-8?)? Is there a library or snippet for that, or definition in the header, instead of defining rules for every character? I would have to do this anyways in order to match certain columns from both tables.
Thank you!
Use
tbl_df(acs::fips.place) %>% filter(COUNTY == "Do\\xf1a Ana County")
In your dataset what you really have is Do\\xf1a
you can check this in the R console by using for instance :
acs::fips.place[grep("Ana",f$COUNTY),]
The functions to use are iconv(x, from = "", to = "")
or
enc2utf8
or enc2native
which don't take a "from" argument.
In most cases to build a package you need to convert data to UTF-8 (I have to transcode all my French strings when building packages). Here I think it's latin1, but the \ has been escaped.
x<-"Do\\xf1a Ana County"
Encoding(x)<-"latin1"
charToRaw(x)
# [1] 44 6f f1 61 20 41 6e 61 20 43 6f 75 6e 74 79
xx<-iconv(x, "latin1", "UTF-8")
charToRaw(xx)
# [1] 44 6f c3 b1 61 20 41 6e 61 20 43 6f 75 6e 74 79
Finally if you need to clean up your output to get comparable strings you can use this function (straight from my own encoding hell).
to.plain <- function(s) {
#old1 <- iconv("èéêëù","UTF8") #use this if your console is in LATIN1
#new1 <- iconv("eeeeu","UTF8") #use this if your console is in LATIN1
old1 <- "èéêëù"
new1 <- "eeeeu"
s1 <- chartr(old1, new1, s)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With