Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient algorithm for finding co occurrence matrix of phrases

I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:

content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
 for j in range(i+1,len(L)):
  wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
  wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
  phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
  if (phrasecoccur>0):
    f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")

Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?

like image 659
rivu Avatar asked Dec 07 '25 06:12

rivu


1 Answers

You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:

"brown fox", 3
"lazy dog", 8
etc.

You can either capture all of the output and post-process it, or you can process the matches as they're found.

It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.

I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.

like image 54
Jim Mischel Avatar answered Dec 08 '25 18:12

Jim Mischel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!