Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).
I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.
Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.
One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.
Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.
I would like to obtain a distance matrix for all texts T_i
such that D_ij
is the probability that text T_i
and text T_j
are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"
How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?
My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With