Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove special characters, spaces and trim in one string a character variable in R

Tags:

regex

r

I have a little problem in R with a variable that is character type. My variable in data frame has a structure like this:

X1
ANGLO AUTOMOTRIZ S.A. MATRIZ
AUTOMOTORES Y ANEXOS / AYASA
ECUA - AUTO S.A. MATRIZ
METROCAR S.A. 10 DE AGOSTO
MOSUMI LA "Y"

My problem is I want a new variable without ./-"" and the strings must be grouped in one without spaces like this:

X2
ANGLOAUTOMOTRIZSAMATRIZ
AUTOMOTORESYANEXOSAYASA
ECUAAUTOSAMATRIZ
METROCARSA10DEAGOSTO
MOSUMILAY

It is possible to make this in R. Thanks.

like image 326
Duck Avatar asked Sep 06 '13 14:09

Duck


2 Answers

Try gsub...

gsub( "\\.|/|\\-|\"|\\s" , "" , df$X1 )
#[1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"       
#[4] "METROCARSA10DEAGOSTO"    "MOSUMILAY"  
  • \\. - match a literal .
  • | - OR separator
  • / - match a / (no escaping needed)
  • \\- - match a literal -
  • \" - match a literal "
  • \\s - match a whitespace

gsub is greedy so tries to match as many as it can, and it is also vectorised so you can just pass the whole column at once. The second argument is the replacement value, which in this case is "", which replaces all matched characters with nothing.

like image 196
Simon O'Hanlon Avatar answered Sep 18 '22 11:09

Simon O'Hanlon


Since you are also dealing with accented characters, I can think of two options:

  1. Get rid of the accented characters entirely.
  2. Use iconv to attempt to "transliterate" the accented characters to ASCII characters.

Here are both. For both examples, I'm using the following sample text:

Z <- c("ANGLO AUTOMOTRIZ S.A. MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA",
"ECUA - AUTO S.A. MATRIZ", "METROCAR S.A. 10 DE AGOSTO", "MOSUMI LA \"Y\"",
"distribuir contenidos", "proponer autoevaluaciones", "como buzón de actividades")

Option 1: Note that the accented "ó" is dropped in the last item.

gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE)
# [1] "ANGLOAUTOMOTRIZSAMATRIZ"  "AUTOMOTORESYANEXOSAYASA"  "ECUAAUTOSAMATRIZ"        
# [4] "METROCARSA10DEAGOSTO"     "MOSUMILAY"                "distribuircontenidos"    
# [7] "proponerautoevaluaciones" "comobuzndeactividades"   

Option 2: Note that the "ó" has been converted to "o"

gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT"))
# [1] "ANGLOAUTOMOTRIZSAMATRIZ"  "AUTOMOTORESYANEXOSAYASA"  "ECUAAUTOSAMATRIZ"        
# [4] "METROCARSA10DEAGOSTO"     "MOSUMILAY"                "distribuircontenidos"    
# [7] "proponerautoevaluaciones" "comobuzondeactividades"  

Notes:

  • For convenience, I've decided to just use the character classes [[:punct:]] and [[:space:]].
  • For the first option, you need perl = TRUE to recognize the [[:ascii:]] character class.
  • The ^ in option 1 means "not" (so, you can read it as "find anything that is not an ASCII character, that is a space, or that is a punctuation mark, and replace it with nothing).
like image 20
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 22 '22 11:09

A5C1D2H2I1M1N2O1R2T1