Correlating word proximity

Question

Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is?

I'm suspecting some sort of cluster analysis or PCA.

jayunit100 · Accepted Answer

To determine word proximity, you will have to build a graph:

each word is a vertex (or "node"), and
left and right words are edges

So "I like dogs" would have 2 edges and 3 vertices.

Now, the next step will be to decide based on this model what your definition of "close" is.

This is where the statistics comes in.

To determine "groups" of correlated words

MCL clustering - This will give you a number of clusters which algorithmically have high odds of being seen together.
K MEANS clustering - This will give you "k" groups of words.
Thresholding - this is the most reliable and intuitive method. Plot all the relationships for a small subset of data that you understand (for example, a paragraph from a news clip or article you have read) and run your method to generate a graph, and visualize the graph using a tool such as graphviz or cytoscape. Once you can see the relatedness, you can count how many edges are generally found between different words that clearly cluster together. You might find that, for example, two words that cluster together will have an edge for every 5 instances. Use this as a cutoff and write your own graph analysis script which outputs word-pairs that have at least 1 edge for every 5 instances of the word in your vertex graph.
1. Evaluating 3 by ROC curves. You can titrate this value of your cutoff higher and higher until you have very few "clusters". If you then run your algorithm against a paragraph with known, expected results (created by a human who already knows what words should be reported as correlated), you can evaluate the precision of your algorithm using a receiver operating characteristic which compares the correlated-words output to a precalculated gold standard.

Correlating word proximity

Tags:

text

algorithm

statistics

cluster-analysis

Tyler Rinker

1 Answers

jayunit100

Recent Activity

Donate For Us

Correlating word proximity

Tags:

text

algorithm

statistics

cluster-analysis

Tyler Rinker

1 Answers

jayunit100

Related questions

Recent Activity

Donate For Us