Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlating word proximity

Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is?

I'm suspecting some sort of cluster analysis or PCA.

like image 288
Tyler Rinker Avatar asked Oct 10 '22 16:10

Tyler Rinker


1 Answers

To determine word proximity, you will have to build a graph:

  1. each word is a vertex (or "node"), and
  2. left and right words are edges

So "I like dogs" would have 2 edges and 3 vertices.

Now, the next step will be to decide based on this model what your definition of "close" is.

This is where the statistics comes in.

To determine "groups" of correlated words

  1. MCL clustering - This will give you a number of clusters which algorithmically have high odds of being seen together.

  2. K MEANS clustering - This will give you "k" groups of words.

  3. Thresholding - this is the most reliable and intuitive method. Plot all the relationships for a small subset of data that you understand (for example, a paragraph from a news clip or article you have read) and run your method to generate a graph, and visualize the graph using a tool such as graphviz or cytoscape. Once you can see the relatedness, you can count how many edges are generally found between different words that clearly cluster together. You might find that, for example, two words that cluster together will have an edge for every 5 instances. Use this as a cutoff and write your own graph analysis script which outputs word-pairs that have at least 1 edge for every 5 instances of the word in your vertex graph.

    1. Evaluating 3 by ROC curves. You can titrate this value of your cutoff higher and higher until you have very few "clusters". If you then run your algorithm against a paragraph with known, expected results (created by a human who already knows what words should be reported as correlated), you can evaluate the precision of your algorithm using a receiver operating characteristic which compares the correlated-words output to a precalculated gold standard.
like image 167
jayunit100 Avatar answered Oct 16 '22 22:10

jayunit100