Calculating the perplexity of a language model for email classification

Question

I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify emails using c5.0 based on the occurence/absence of each of the words any in test email.

Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!

I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!

Thanks a lot!!

michel-slm · Accepted Answer

This CMU course exercise seems to have what you want. Yes, they recommend you use SRILM, but see the "Language Model" section -- it points to a book chapter, a tutorial from Microsoft Research and a presentation for that tutorial.

Hope this helps!

Calculating the perplexity of a language model for email classification

Tags:

java

email

perl

classification

B. Bowles

1 Answers

michel-slm

Recent Activity

Donate For Us

Calculating the perplexity of a language model for email classification

Tags:

java

email

perl

classification

B. Bowles

1 Answers

michel-slm

Related questions

Recent Activity

Donate For Us