Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.
What I'd like:
Reaching for the stars, these would be peachy:
I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.
Term frequency (TF) is how often a word appears in a document, divided by how many words there are. TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
We highlighted such concepts as simple similarity metrics, text normalization, vectorization, word embeddings, popular algorithms for NLP (naive bayes and LSTM). All these things are essential for NLP and you should be aware of them if you start to learn the field or need to have a general idea about the NLP.
Natural language processing (NLP) algorithms support computers by simulating the human ability to understand language data, including unstructured text data. The 500 most used words in the English language have an average of 23 different meanings.
You'll need not one, but several nice algorithms, along the lines of the following.
I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.
If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.
I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.
Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.
... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.
Welcome to the world of NLP ^_^
All you need is a little basic knowledge and some tools.
There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:
$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english # Word POS surface form Without IN without getting VVG get a DT a degree NN degree in IN in information NN information retrieval NN retrieval , , , I PP I 'd MD will like VV like to TO to know VV know if IN if there EX there exists VVZ exist any DT any algorithms NNS algorithm for IN for counting VVG count the DT the frequency NN frequency that IN/that that words NNS word occur VVP occur in IN in a DT a given VVN give body NN body of IN of text NN text . SENT .
As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.
To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.
Here are some popular POS taggers:
TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)
To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With