Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text preprocessing for text classification using fastText

What text preprocessing produces the best results for supervised text classification using fastText?

The official documentation shows a only a simple prepocessing consisting of lower-casing and separating punctuations. Would classic preprocessing like lemmatization, stopwords removal, masking numbers would help?

like image 747
Gino Avatar asked Apr 12 '26 08:04

Gino


1 Answers

There is no general answer. It very much depends on what task you are trying to solve, how big data you have, and what language the text is in. Usually, if you have enough data, simple tokenization that you described is all you need.

Lemmatization: FastText computes the word embeddings from embeddings of character n-grams, it should cover most morphology in most (at least European) languages, given you don't have very small data. In that case, lemmatization might help.

Removing stopwords: It depends on the task. If the task is based on grammar/syntax, you definitely should not remove the stopwords, because they form the grammar. If the task depends more on lexical semantics, removing stopwords should help. If your training data is large enough, the model should learn non-informative stopword embeddings that would not influence the classification.

Masking numbers: If you are sure that your task does not benefit from knowing the numbers, you can mask them out. Usually, the problem is that numbers do not appear frequently in the training data, so you don't learn appropriate weights/embeddings for them. Not so much in FastText which will compose their embeddings from embeddings of their substrings. It will make them probably uninformative at the end, not influencing the classification.

like image 100
Jindřich Avatar answered Apr 14 '26 21:04

Jindřich