I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify emails using c5.0 based on the occurence/absence of each of the words any in test email.
Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!
I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!
Thanks a lot!!
This CMU course exercise seems to have what you want. Yes, they recommend you use SRILM, but see the "Language Model" section -- it points to a book chapter, a tutorial from Microsoft Research and a presentation for that tutorial.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With