Extract first letter in each word in R

Question

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:

sentences <- c("Direito à participação e ao controle social",
               "Direito a ser ouvido pelo governo e representantes", 
               "Direito aos serviços públicos",
               "Direito de acesso à informação")

For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:

[1] "DPCS", "DOGR", "DSP", "DAI

I tried to make a pattern subset using stringr with a regex pattern founded here:

library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)

But I got an error when creating the pattern object:

Error: '\w'  is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"

What am I doing wrong?

Thanks in advance for any help.

KU99 · Accepted Answer

You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:

 gsub('\b(\pL)\pL{2,}|.','\U\1',sentences,perl = TRUE)
[1] "DPCS"   "DSOPGR" "DASP"   "DAI"

But if we were to ignore the words you indicated then it would be:

gsub('\b(\pL)\pL{4,}|.','\U\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP"  "DAI"

jxshen · Answer

@Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.

Here is my understanding to gsub('\b(\pL)\pL{4,}|.','\U\1',sentences,perl = TRUE). Post in the hope of being helpful to others.

Background information:

\b: boundary of word
\pL matches any kind of letter from any language
{4,} is an occurrence indicator
- {m}: The preceding item is matched exactly m times.
- {m,}: The preceding item is matched m or more times, i.e., m+
- {m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.

\U\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.

With all the background knowledge, the interpretation of the command is

replace words matching \b(\pL)\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group

Here are two great places I learned all these backgrounds.

https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

Extract first letter in each word in R

Tags:

regex

r

Bruno Pinheiro

2 Answers

KU99

jxshen

Recent Activity

Donate For Us

Extract first letter in each word in R

Tags:

regex

r

Bruno Pinheiro

2 Answers

KU99

jxshen

Related questions

Recent Activity

Donate For Us