Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading foreign characters

I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

like image 929
Pash101 Avatar asked Apr 18 '14 11:04

Pash101


People also ask

Do screen readers read special characters?

Also known as 'text art', ASCII art uses special characters to form pictures. Text art is often used on Twitter and in online chats. But because it's made using special characters and spaces, it's not accessible to screen readers.

Can screen readers read Unicode?

It matters because beyond being semantically incorrect, using Unicode in this way renders the text completely unintelligible to assistive technology like screen readers.

What is a unique character on the keyboard?

A special character is one that is not considered a number or letter. Symbols, accent marks, and punctuation marks are considered special characters. Similarly, ASCII control characters and formatting characters like paragraph marks are also special characters.


1 Answers

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

PlayerData<-read.table('~/Desktop/PLAYERS.csv',
    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"
like image 177
gagolews Avatar answered Sep 19 '22 00:09

gagolews