Extract and count common word-pairs from character vector

Question

How can someone find frequent pairs of adjacent words in a character vector? Using the crude data set, for example, some common pairs are "crude oil", "oil market", and "million barrels".

The code for the small example below tries to identify frequent terms and then, using a positive lookahead assertion, count how many times those frequent terms are followed immediately by a frequent term. But the attempt crashed and burned.

Any guidance would be appreciated as to how to create a data frame that shows in the first column ("Pairs") the common pairs and in the second column ("Count") the number of times they appeared in the text.

   library(qdap)
   library(tm)

# from the crude data set, create a text file from the first three documents, then clean it

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10) 

# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")

# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")

Here is where the effort falters.

Not knowing Java or Python, these did not help Java count word pairs Python count word pairs but they may be useful references for others.

Thank you.

Steven Beaupré · Accepted Answer

First, modify your initial text list from:

text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])

to:

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])

Then, you can go on with your text cleaning (note that your method will create ill-formed words like "oilcanadian", but it will suffice for the example at hand):

text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") 
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

Build a new Corpus:

v <- Corpus(VectorSource(text))

Create a bigram tokenizer function:

BigramTokenizer <- function(x) { 
  unlist(
    lapply(ngrams(words(x), 2), paste, collapse = " "), 
    use.names = FALSE
  ) 
}

Create your TermDocumentMatrix using the control parameter tokenize:

tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))

Now that you have your new tdm, to get your desired output, you could do:

library(dplyr)
data.frame(inspect(tdm)) %>% 
  add_rownames() %>% 
  mutate(total = rowSums(.[,-1])) %>% 
  arrange(desc(total))

Which gives:

#Source: local data frame [272 x 5]
#
#             rowname X1 X2 X3 total
#1          crude oil  2  0  1     3
#2            mln bpd  0  3  0     3
#3         oil prices  0  3  0     3
#4       cut contract  2  0  0     2
#5        demand opec  0  2  0     2
#6        dlrs barrel  2  0  0     2
#7    effective today  1  0  1     2
#8  emergency meeting  0  2  0     2
#9      oil companies  1  1  0     2
#10      oil industry  0  2  0     2
#..               ... .. .. ..   ...

agstudy · Answer

One idea here , is to create a new corpus with bigrams.:

A bigram or digram is every sequence of two adjacent elements in a string of tokens

A recursive function to extract bigram :

bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }

Then applying this to crude data from tm package. ( I did some text cleaning here, but this steps depends in the text).

res <- unlist(lapply(crude,function(x){

  x <- tm::removeNumbers(tolower(x))
  x <- gsub('
|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  freqs <- table(bigram(strsplit(x," ")[[1]]))
  freqs[freqs>1]
}))


 as.data.frame(tail(sort(res),5))
                          tail(sort(res), 5)
reut-00022.xml.hold_a                      3
reut-00022.xml.in_the                      3
reut-00011.xml.of_the                      4
reut-00022.xml.a_futures                   4
reut-00010.xml.abdul_aziz                  5

The bigrams "abdul aziz" and "a futures" are the most common. You should reclean the data to remove (of, the,..). But this should be a good start.

edit after OP comments :

In case you want to get bigrams-frequency over all the corpus , on idea is to compute the bigrams in the loop and then compute the frequency for the loop result. I profit to add better text processing-cleanings.

res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('
|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14

Extract and count common word-pairs from character vector

Tags:

r

regex-lookarounds

tm

qdap

lawyeR

2 Answers

Steven Beaupré

edit after OP comments :

agstudy

Recent Activity

Donate For Us

Extract and count common word-pairs from character vector

Tags:

r

regex-lookarounds

tm

qdap

lawyeR

2 Answers

Steven Beaupré

edit after OP comments :

agstudy

Related questions

Recent Activity

Donate For Us