Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python NTL - Identifying text interest / topic

I am attempting to build a model that will attempt to identify the interest category / topic of supplied text. For example:

"Enjoyed playing a game of football earlier."

would resolve to a top level category like:

"Sport".

I'm not sure what the correct terminology is for what I am trying to achieve here so Google hasn't turned up any libraries that may be able to help. With that in mind, my approach would be something like:

  1. Extract features from text. Use tagging to classify each feature / identify names / places. Would probably use NTLK for this, or Topia.
  2. Run a Naive Bayes classifier for each interest category ("Sport", "Video Games", "Politics" etc.) and get a relevancy % for each category.
  3. Identify which category has the highest % accuracy and categorise the text.

My approach would likely involve having individual corpora for each interest category and I'm sure the accuracy would be fairly miserable - I understand it will never be that accurate.

Generally looking for some advice on the viability of what I am trying to accomplish, but the crux of my question: a) is my approach is correct? b) are there any libraries / resources that may be of assistance?

like image 208
Hanpan Avatar asked Dec 26 '22 00:12

Hanpan


2 Answers

You seem to know a lot of the right terminology. Try searching for "document classification." That is the general problem you are trying to solve. A classifier trained on a representative corpus will be more accurate than you think.

  • (a) There is no one correct approach. The approach you outline will work, however.
  • (b) Scikit Learn is a wonderful library for this sort of work.

There is plenty of other information, including tutorials, online about this topic:

  • This Naive Bayesian Classifier on github probably already does most of what you want to accomplish.
  • This NLTK tutorial explains the topic in depth.
  • If you really want to get into it, I am sure a Google Scholar search will turn up thousands of academic articles in computer science and linguistics about exactly this topic.
like image 83
ChrisP Avatar answered Dec 28 '22 23:12

ChrisP


You should check out Latent Dirichlet Allocation it will give you categories without labels , as always ed chens bolg is a good start.

like image 39
jonnydedwards Avatar answered Dec 28 '22 23:12

jonnydedwards