Python NTL - Identifying text interest / topic

Question

I am attempting to build a model that will attempt to identify the interest category / topic of supplied text. For example:

"Enjoyed playing a game of football earlier."

would resolve to a top level category like:

"Sport".

I'm not sure what the correct terminology is for what I am trying to achieve here so Google hasn't turned up any libraries that may be able to help. With that in mind, my approach would be something like:

Extract features from text. Use tagging to classify each feature / identify names / places. Would probably use NTLK for this, or Topia.
Run a Naive Bayes classifier for each interest category ("Sport", "Video Games", "Politics" etc.) and get a relevancy % for each category.
Identify which category has the highest % accuracy and categorise the text.

My approach would likely involve having individual corpora for each interest category and I'm sure the accuracy would be fairly miserable - I understand it will never be that accurate.

Generally looking for some advice on the viability of what I am trying to accomplish, but the crux of my question: a) is my approach is correct? b) are there any libraries / resources that may be of assistance?

ChrisP · Accepted Answer

You seem to know a lot of the right terminology. Try searching for "document classification." That is the general problem you are trying to solve. A classifier trained on a representative corpus will be more accurate than you think.

(a) There is no one correct approach. The approach you outline will work, however.
(b) Scikit Learn is a wonderful library for this sort of work.

There is plenty of other information, including tutorials, online about this topic:

This Naive Bayesian Classifier on github probably already does most of what you want to accomplish.
This NLTK tutorial explains the topic in depth.
If you really want to get into it, I am sure a Google Scholar search will turn up thousands of academic articles in computer science and linguistics about exactly this topic.

jonnydedwards · Answer

You should check out Latent Dirichlet Allocation it will give you categories without labels , as always ed chens bolg is a good start.

Python NTL - Identifying text interest / topic

Tags:

python

machine-learning

classification

nltk

Hanpan

2 Answers

ChrisP

jonnydedwards

Recent Activity

Donate For Us

Python NTL - Identifying text interest / topic

Tags:

python

machine-learning

classification

nltk

Hanpan

2 Answers

ChrisP

jonnydedwards

Related questions

Recent Activity

Donate For Us