Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the perplexity of a language model for email classification

I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify emails using c5.0 based on the occurence/absence of each of the words any in test email.

Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!

I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!

Thanks a lot!!

like image 858
B. Bowles Avatar asked Mar 21 '11 15:03

B. Bowles


1 Answers

This CMU course exercise seems to have what you want. Yes, they recommend you use SRILM, but see the "Language Model" section -- it points to a book chapter, a tutorial from Microsoft Research and a presentation for that tutorial.

Hope this helps!

like image 113
michel-slm Avatar answered Oct 24 '22 11:10

michel-slm