Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.

I haven't tried Fasttext but here are few pro and con for LDA based on my experience

Pro

  1. Iterative model, having support for Apache spark

  2. Takes in a corpus of document and does topic modeling.

  3. Not only finds out what the document is talking about but also finds out related documents

  4. Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries

Con

  1. Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that

  2. Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about

Topic:

  1. Term:discipline

  2. Term:disciplines

  3. Term:notestable

  4. Term:winning

  5. Term:pathways

  6. Term:chapterclosingtable

  7. Term:metaprograms

  8. Term:breakthroughs

  9. Term:distinctions

  10. Term:rescue

If anyone has done research in Fasttext can you please update with your learning?

like image 875
Nabs Avatar asked Aug 22 '16 04:08

Nabs


People also ask

How do you use Word2Vec for topic modeling?

One of the basic ideas to achieve topic modeling with Word2Vec is to use the output vectors of Word2Vec as an input to any clustering algorithm. This will result in a group of clusters, and each represents a topic. This approach will produce similar but less accurate LDA results.

What is the fastText model?

What is fastText? FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

Does LDA use word Embeddings?

With LDA, documents are represented as bags of words. Each word contributes to a distribution over topics for the document which you can treat as a sort of document embedding. The contribution of the single words (or topic distribution for a single-word document) can be interpreted as a word embedding.

What is fastText word Embeddings?

fastText is another word embedding method that is an extension of the word2vec model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters.


1 Answers

fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network. The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.

Pros:

=> It is extremely easy to train your own fastText model,

$ ./fasttext skipgram -input data.txt -output model

Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.

=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.

=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.

=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.

=> The option to predict likely labels for a piece of text has been included too.

=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.

Cons:

=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.

=> No in-built method to find similar words or paragraphs.

For those who wish to read more, here are the links to the official research papers:

1) https://arxiv.org/pdf/1607.04606.pdf

2) https://arxiv.org/pdf/1607.01759.pdf

And link to the official repo:

https://github.com/facebookresearch/fastText

like image 51
Aanchal1103 Avatar answered Oct 19 '22 18:10

Aanchal1103