Reading foreign characters

I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'

#Example of how the foreign character is treated:
[1] "Ã"
[1] "Ö"
[1] "
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

1 Answers

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"
