Word frequency algorithm for natural language processing

Tags:

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

What I'd like:

ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc)
preserve proper nouns
ignore hyphenation, except for soft kind

Reaching for the stars, these would be peachy:

handling stemming & plurals (e.g. like, likes, liked, liking match the same result)
grouping of adjectives (adverbs, etc) with their subjects ("great service" as opposed to "great", "service")

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

303

asked Sep 18 '08 06:09

Mark McDonald

2 Answers

You'll need not one, but several nice algorithms, along the lines of the following.

ignoring pronouns is done via a stoplist.
preserving proper nouns? You mean, detecting named entities, like Hoover Dam and saying "it's one word" or compound nouns, like programming language? I'll give you a hint: that's tough one, but there exist libraries for both. Look for NER (Named entitiy recognition) and lexical chunking. OpenNLP is a Java-Toolkit that does both.
ignoring hyphenation? You mean, like at line breaks? Use regular expressions and verify the resulting word via dictionary lookup.
handling plurals/stemming: you can look into the Snowball stemmer. It does the trick nicely.
"grouping" adjectives with their nouns is generally a task of shallow parsing. But if you are looking specifically for qualitative adjectives (good, bad, shitty, amazing...) you may be interested in sentiment analysis. LingPipe does this, and a lot more.

I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.

answered Sep 21 '22 07:09

Aleksandar Dimitrov

Welcome to the world of NLP ^_^

All you need is a little basic knowledge and some tools.

There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:

$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english  # Word  POS     surface form Without IN  without getting VVG get a   DT  a degree  NN  degree in  IN  in information NN  information retrieval   NN  retrieval ,   ,   , I   PP  I 'd  MD  will like    VV  like to  TO  to know    VV  know if  IN  if there   EX  there exists  VVZ exist any DT  any algorithms  NNS algorithm for IN  for counting    VVG count the DT  the frequency   NN  frequency that    IN/that that words   NNS word occur   VVP occur in  IN  in a   DT  a given   VVN give body    NN  body of  IN  of text    NN  text .   SENT    .

As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.

To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.

Here are some popular POS taggers:

TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)

To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.

answered Sep 21 '22 07:09

underspecified

Related questions
                            
                                How can I find the center of a cluster of data points?
                            
                                Why not use hashing/hash tables for everything?
                            
                                Best algorithm to test if a linked list has a cycle
                            
                                Finding the second highest number in array
                            
                                Permutations without recursive function call
                            
                                Dijkstra vs. Floyd-Warshall: Finding optimal route on all node pairs
                            
                                Packing different sized circles into rectangle - d3.js
                            
                                quicksort algorithm stability
                            
                                Example of Big O of 2^n
                            
                                Is timsort general-purpose or Python-specific?
                            
                                Where can I find source or algorithm of Python's hash() function?
                            
                                upper bound, lower bound
                            
                                Algorithm for iterating over an outward spiral on a discrete 2D grid from the origin
                            
                                Algorithm for implementing C# yield statement
                            
                                how do I create a line of arbitrary thickness using Bresenham?
                            
                                QuickSelect Algorithm Understanding
                            
                                HashMap - contains and get methods should not be used together
                            
                                How to find the index of an element in a TreeSet?
                            
                                Bridges in a connected graph
                            
                                maximum subarray whose sum equals 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Word frequency algorithm for natural language processing

Tags:

algorithm

nlp

word-frequency

Mark McDonald

People also ask

2 Answers

Aleksandar Dimitrov

underspecified

Recent Activity

Donate For Us