How to find common phrases in a large body of text

Tags:

I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following:

The dog jumped over the woman.
The dog jumped into the car.
The dog jumped up the stairs.

From the above example I would want to extract "the dog jumped" as it is the most common phrase in the text. At first I thought, "oh lets use a directed graph [with repeated nodes]":

directed graph http://img.skitch.com/20091218-81ii2femnfgfipd9jtdg32m74f.png

EDIT: Apologies, I made a mistake while making this diagram "over", "into" and "up" should all link back to "the".

I was going to maintain a count of how many times a word occurred in each node object ("the" would be 6; "dog" and "jumped", 3; etc.) but despite many other problems the main one came up when we add a few more examples like (please ignore the bad grammar :-)):

Dog jumped up and down.
Dog jumped like no dog had ever jumped before.
Dog jumped happily.

We now have a problem since "dog" would start a new root node (at the same level as "the") and we would not identify "dog jumped" as now being the most common phrase. So now I am thinking maybe I could use an undirected graph to map the relationships between all the words and eventually pick out the common phrases but I'm not sure how this is going to work either, as you lose the important relationship of order between the words.

So does anyone have any general ideas on how to identify common phrases in a large body of text and what data structure I would use.

Thanks, Ben

475

asked Dec 18 '09 15:12

benmcredmond

1 Answers

Check out this related question: What techniques/tools are there for discovering common phrases in chunks of text? Also related to the longest common substring problem.

I've posted this before, but I use R for all of my data-mining tasks and it's well suited to this kind of analysis. In particular, look at the tm package. Here are some relevant links:

Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

101

answered Oct 08 '22 05:10

Shane

Related questions
                            
                                Is there a way to see the native code produced by theJITter for given C# / CIL?
                            
                                Are there any applications written in the Io programming language? (Or, distributing Io applications.)
                            
                                Automatically creating C# wrappers from c headers?
                            
                                Why do InterruptedExceptions clear a thread's interrupted status?
                            
                                What is the point of Convert.ToDateTime(bool)?
                            
                                Planning a competition
                            
                                Confusing behaviour of const_get in Ruby?
                            
                                Should small simple structs be passed by const reference?
                            
                                Is there anything like rubygems.org for scala libraries [closed]
                            
                                Is there a way to port a chrome extension to other browsers?
                            
                                Get a list of all UNC shared folders on a local network server
                            
                                How can I get list of open tabs in Firefox via a command-line application?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With