Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract first letter in each word in R

Tags:

regex

r

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:

sentences <- c("Direito à participação e ao controle social",
               "Direito a ser ouvido pelo governo e representantes", 
               "Direito aos serviços públicos",
               "Direito de acesso à informação")

For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:

[1] "DPCS", "DOGR", "DSP", "DAI

I tried to make a pattern subset using stringr with a regex pattern founded here:

library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)

But I got an error when creating the pattern object:

Error: '\w'  is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"

What am I doing wrong?

Thanks in advance for any help.

like image 227
Bruno Pinheiro Avatar asked Dec 03 '22 11:12

Bruno Pinheiro


2 Answers

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:

 gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS"   "DSOPGR" "DASP"   "DAI"  

But if we were to ignore the words you indicated then it would be:

gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP"  "DAI"  
like image 71
KU99 Avatar answered Dec 16 '22 02:12

KU99


@Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.

Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE). Post in the hope of being helpful to others.

Background information:

  • \\b: boundary of word
  • \\pL matches any kind of letter from any language
  • {4,} is an occurrence indicator
    • {m}: The preceding item is matched exactly m times.
    • {m,}: The preceding item is matched m or more times, i.e., m+
    • {m,n}: The preceding item is matched at least m times, but not more than n times.
  • | is OR logic operator
  • . represents any one character except newline.

\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.

With all the background knowledge, the interpretation of the command is

  • replace words matching \\b(\\pL)\\pL{4,} with the first letter
  • replace any character not matching the above pattern with "" as nothing is captured for this group

Here are two great places I learned all these backgrounds.

  • https://www.regular-expressions.info/rlanguage.html
  • https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
like image 42
jxshen Avatar answered Dec 16 '22 01:12

jxshen