Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is?
I'm suspecting some sort of cluster analysis or PCA.
To determine word proximity, you will have to build a graph:
So "I like dogs" would have 2 edges and 3 vertices.
Now, the next step will be to decide based on this model what your definition of "close" is.
This is where the statistics comes in.
To determine "groups" of correlated words
MCL clustering - This will give you a number of clusters which algorithmically have high odds of being seen together.
K MEANS clustering - This will give you "k" groups of words.
Thresholding - this is the most reliable and intuitive method. Plot all the relationships for a small subset of data that you understand (for example, a paragraph from a news clip or article you have read) and run your method to generate a graph, and visualize the graph using a tool such as graphviz or cytoscape. Once you can see the relatedness, you can count how many edges are generally found between different words that clearly cluster together. You might find that, for example, two words that cluster together will have an edge for every 5 instances. Use this as a cutoff and write your own graph analysis script which outputs word-pairs that have at least 1 edge for every 5 instances of the word in your vertex graph.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With