Algorithms to detect phrases and keywords from text

Tags:

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.

If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?

521

asked Oct 29 '09 13:10

Kimvais

2 Answers

Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).

Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.

Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.

POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)

[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]

[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]

142

answered Sep 18 '22 23:09

mjv

I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.

answered Sep 22 '22 23:09

Jonathan Feinberg

Related questions
                            
                                how to get started with TopCoder to update/develop algorithm skills?
                            
                                Algorithm to cover maximal number of points with one circle of given radius
                            
                                what is the fastest way to find the gcd of n numbers?
                            
                                how to calculate exact foot step count using accelerometer in android?
                            
                                What is the fastest way to find the closest point to a given point?
                            
                                Tennis match scheduling
                            
                                FSharp runs my algorithm slower than Python
                            
                                What is the meaning of "from distinct vertex chains" in this nearest neighbor algorithm?
                            
                                Google Interview: Arrangement of Blocks
                            
                                Design patterns for converting recursive algorithms to iterative ones
                            
                                minimum number of steps to reduce number to 1
                            
                                base of logarithms in time-complexity algorithms
                            
                                Algorithm to generate random 2D polygon
                            
                                Algorithm for detecting "clusters" of dots [closed]
                            
                                An intuitive understanding of heapsort?
                            
                                O(n) algorithm to find the median of n² implicit numbers
                            
                                What is the best way to check the strength of a password?
                            
                                Why isn't smoothsort more common? [closed]
                            
                                What is the Big O analysis of this algorithm?
                            
                                Mesh generation from points with x, y and z coordinates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithms to detect phrases and keywords from text

Tags:

algorithm

text-processing

nlp

Kimvais

People also ask

2 Answers

mjv

Jonathan Feinberg

Recent Activity

Donate For Us