Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text classification/categorization algorithm [closed]

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше?

like image 288
Max Avatar asked Aug 27 '10 13:08

Max


People also ask

Which algorithm is best for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

Which CNN model is best for text classification?

TCN and Ensemble CNN-GRU models are the best performing algorithms we obtained in this series of text classification tasks.

What is text categorization in NLP?

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Is text classification supervised or unsupervised?

Text classification uses supervised machine learning and has various applications, including ticket routing. In this example, incoming messages would be automatically tagged by topic, language, sentiment, intent, and more, and routed to the right customer support team based on their expertise.


1 Answers

Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category.

Yet, in natural language text, the keywords would usually not be in their stem form. You would need some morphology tools to find the stem form and use it on the dictionary.

But then somebody could write something like: "This article is not about ...". This would introduce the need for syntax and semantical analysis.

And then you would find that certain keywords can be used in several categories: "band" could be used in musics, Technics, or even handicraft work. You would therefore need an ontology and statistical or other methods to weigh the probability of the category to choose if not definite.

Some of the keywords might not even be easy to fit into an ontology: is mathematician closer to programmer or gardener? But you said in your question that the categories are built by men, so they could also help building the ontology.

Have a look on computational linguistics here and in Wikipedia for further studies.

Now, the more narrow the field your texts are from, the more structured they are, and the smaller the vocabulary, the easier the problem becomes.

Again some keywords for further studies: morphology, syntax analysis, semantics, ontology, computational linguistics, indexing, keywording

like image 145
Ralph M. Rickenbach Avatar answered Oct 09 '22 06:10

Ralph M. Rickenbach