I have a little problem in R with a variable that is character type. My variable in data frame has a structure like this:
X1
ANGLO AUTOMOTRIZ S.A. MATRIZ
AUTOMOTORES Y ANEXOS / AYASA
ECUA - AUTO S.A. MATRIZ
METROCAR S.A. 10 DE AGOSTO
MOSUMI LA "Y"
My problem is I want a new variable without ./-""
and the strings must be grouped in one without spaces like this:
X2
ANGLOAUTOMOTRIZSAMATRIZ
AUTOMOTORESYANEXOSAYASA
ECUAAUTOSAMATRIZ
METROCARSA10DEAGOSTO
MOSUMILAY
It is possible to make this in R. Thanks.
Try gsub
...
gsub( "\\.|/|\\-|\"|\\s" , "" , df$X1 )
#[1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"
#[4] "METROCARSA10DEAGOSTO" "MOSUMILAY"
\\.
- match a literal .
|
- OR separator/
- match a /
(no escaping needed)\\-
- match a literal -
\"
- match a literal "
\\s
- match a whitespacegsub
is greedy so tries to match as many as it can, and it is also vectorised so you can just pass the whole column at once. The second argument is the replacement value, which in this case is ""
, which replaces all matched characters with nothing.
Since you are also dealing with accented characters, I can think of two options:
iconv
to attempt to "transliterate" the accented characters to ASCII characters.Here are both. For both examples, I'm using the following sample text:
Z <- c("ANGLO AUTOMOTRIZ S.A. MATRIZ", "AUTOMOTORES Y ANEXOS / AYASA",
"ECUA - AUTO S.A. MATRIZ", "METROCAR S.A. 10 DE AGOSTO", "MOSUMI LA \"Y\"",
"distribuir contenidos", "proponer autoevaluaciones", "como buzón de actividades")
Option 1: Note that the accented "ó" is dropped in the last item.
gsub("[^[:ascii:]]|[[:punct:]]|[[:space:]]", "", Z, perl=TRUE)
# [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"
# [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos"
# [7] "proponerautoevaluaciones" "comobuzndeactividades"
Option 2: Note that the "ó" has been converted to "o"
gsub("[[:punct:]]|[[:space:]]", "", iconv(Z, to = "ASCII//TRANSLIT"))
# [1] "ANGLOAUTOMOTRIZSAMATRIZ" "AUTOMOTORESYANEXOSAYASA" "ECUAAUTOSAMATRIZ"
# [4] "METROCARSA10DEAGOSTO" "MOSUMILAY" "distribuircontenidos"
# [7] "proponerautoevaluaciones" "comobuzondeactividades"
Notes:
[[:punct:]]
and [[:space:]]
. perl = TRUE
to recognize the [[:ascii:]]
character class. ^
in option 1 means "not" (so, you can read it as "find anything that is not an ASCII character, that is a space, or that is a punctuation mark, and replace it with nothing).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With