Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting two consecutive "Proper Case" words in a string using R

Tags:

r

text-mining

I've been scratching my head about this one for a while now. I'm attempting to do some text mining in R, and am looking to try and classify names, places and organisations which are made up of multiple words. For the purposes of this task I'm only looking at consecutive words in the string which begin with capital letters.

Example String:

origString <- 'The current president of the United States is Donald Trump'

Is there a way of finding words starting with a capital letter within this string and grouping them together to return something like this?

newString <- 'The current president of the UnitedStates is DonaldTrump'

Any help you can give would be greatly appreciated.

like image 672
Reuben Kandiah Avatar asked Mar 17 '26 00:03

Reuben Kandiah


1 Answers

The following solution would work well for groups of two words at a time:

origString <- 'The current president of the United States is Donald Trump'
gsub('([A-Z]\\w*?)\\s+([A-Z]\\w*)', '\\1\\2', origString)

Output:

[1] "The current president of the UnitedStates is DonaldTrump"

Demo here:

Rextester

Update:

Following is a script which should work for any number of clustered capitalized words. It required a workaround/hack because the regex flavor which gsub() uses, even in Perl mode, does not support variable length lookbehinds. The strategy here is instead to selectively remove the whitespace in between all capitalized words which appear in groups of two or more.

origString <- 'The current president of the United States Donald Trump'
temp <- gsub('([A-Z]\\w*)', '\\1\\$MARK\\$', origString)
output <- gsub('(?<=\\$MARK\\$)\\s+(?=[A-Z])', '', temp, perl=TRUE)
output <- gsub('\\$MARK\\$', '', output)
output

[1] "The current president of the UnitedStatesDonaldTrump"

Demo

like image 124
Tim Biegeleisen Avatar answered Mar 19 '26 14:03

Tim Biegeleisen