Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text Classification into Categories

I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.

Let me explain what I want to implement. ( take for example )

List of Words:

  1. java
  2. programming
  3. language
  4. c-sharp

List of Categories.

  1. java
  2. c-sharp

here we will train the set, as:

  1. java maps to category 1. java
  2. programming maps to category 1.java
  3. programming maps to category 2.c-sharp
  4. language maps to category 1.java
  5. language maps to category 2.c-sharp
  6. c-sharp maps to category 2.c-sharp

Now we have a phrase "The best java programming book" from the given phrase following words are a match to our "List of Words.":

  1. java
  2. programming

"programming" has two mapped categories "java" & "c-sharp" so it is a common word.

"java" is mapped to category "java" only.

So our matching category for the phrase is "java"

This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..

like image 360
Ajay Jadeja Avatar asked Nov 15 '11 12:11

Ajay Jadeja


1 Answers

Of course this can be implemented. If you train a Naive Bayes classifier or linear SVM on the right dataset (titles of Java and C# programming books, I guess), it should learn to associate the term "Java" with Java, "C#" and ".NET" with C#, and "programming" with both. I.e., a Naive Bayes classifier would likely learn a roughly even probability of Java or C# for common terms like "programming" if the dataset is divided evenly.

like image 100
Fred Foo Avatar answered Sep 30 '22 17:09

Fred Foo