Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the disadvantage of LDA for short texts?

Tags:

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word co-occurrences".

From my point of view, the generation part of LDA is reasonable for any kind of texts, but what causes bad results in short texts is the sampling procedure. I am guessing LDA samples a topic for a word based on two parts: (1) topics of other words in the same doc (2) topic assignments of other occurrences of this word. Since the (1) part of a short text cannot reflect the true distribution of it, that causes a poor topic assignment for each word.

If you have found this question, please feel free to post your idea and help me understand this.

like image 866
Shuguang Zhu Avatar asked Apr 22 '15 03:04

Shuguang Zhu


People also ask

What are the limitations of LDA?

Common LDA limitations: Fixed K (the number of topics is fixed and must be known ahead of time) Uncorrelated topics (Dirichlet topic distribution cannot capture correlations) Non-hierarchical (in data-limited regimes hierarchical models allow sharing of data)

What is LDA in text analysis?

Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Here we are going to apply LDA to a set of documents and split them into topics.

What is short text topic modeling?

Topic modeling aims to identify patterns in a text corpus and extract the main themes, entities, or topics, depending on how they are referred to in a particular model. Topic modeling is notoriously more challenging to do when the text is shorter.

Is LDA better than LSA?

Both LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix dimension while LDA solves topic modeling problems. I will not go through mathematical detail and as there is lot of great material for that. You may check it from reference.


2 Answers

Probabilistic models such as LDA exploit statistical inference to discover latent patterns of data. In short, they infer model parameters from observations. For instance, there is a black box containing many balls with different colors. You draw some balls out from the box and then infer the distributions of colors of the balls. That is a typical process of statistical inference. The accuracy of statistical inference depends on the number of your observations.

Now consider the problem of LDA over short texts. LDA models a document as a mixture of topics, and then each word is drawn from one of its topic. You can imagine a black box contains tons of words generated from such a model. Now you have seen a short document with only a few of words. The observations is obvious too few to infer the parameters. It is the data sparsity problem we mentioned.

Actually, besides the the lack of observations, the problem also comes from the over-complexity of the model. Usually, a more flexible model requires more observations to infer. The Biterm Topic Model tries to making topic inference easier by reducing the model complexity. First, it models the whole corpus as a mixture of topics. Since inferring the topic mixture over the corpus is easier than inferring the topic mixture over a short document. Second, it supposes each biterm is draw from a topic. Inferring the topic of a biterm is also easier than inferring the topic of a single word in LDA, since more context is added.

I hope the explanation make sense for you. Thanks for mentioning our paper.

like image 65
Xiaohui Yan Avatar answered Oct 18 '22 13:10

Xiaohui Yan


Doing a bit of digging, Hong and Davison (2010) showed up as a great example of these not working well on classifying tweets. Unfortunately, they don't really give much insight into why it doesn't work.

I suspect there's two reasons LDA doesn't work well for short documents.

First of all, when working on smaller documents, the extra topic layer doesn't add anything to the classification, and what doesn't help probably hurts. If you have really short documents, like tweets, it's really hard to break documents into topics. There isn't much room for anything but one topic in a tweet, after all. Since the topic layer can't contribute much to the classification, it makes room for error to arise in the system.

Second, linguistically, Twitter users prefer to strip off "unnecessary fluff" when tweeting. When working with full documents, there are features --words, word collocations, etc.--that are probably specific, common, and often repeated within a genre. When tweeting, though, these common elements get dropped first because what's interesting, new, and more perplex is what remains when the fluff is removed.

For example, let's look at my own tweets because I believe in shameless self-promotion:

Progressbar.py is a fun little package, though I don't get  a chance to use it too often. it even does ETAs for you  https://pypi.python.org/pypi/progressbar …  From a capitalist perspective, the social sciences exist so  idiot engineers don't waste money on building **** no one needs.  Abstract enough to be reusable, specific enough to be useful. 

The first is about Python. If you're parsing the URLs, you'll get that--and the .py would give it to you too. However, in a more expressive medium, I'd probably have put the word "Python" in somewhere. The second is programming related as well, but a bit more on the business end. Not once does it even mention anything specific to programming, though. The last one too is programming related, but ties more into the art of programming, expressing a sort of double-bind programmers face while coding. It is as difficult as the second, feature-wise.

In both of those last two examples, had I not been writing a microblog post, these would have immediately been followed up with examples that would have been very useful to a classifier, or themselves included more data. Twitter doesn't have room for that kind of stuff, though, and the content that would typify the genre a tweet belongs to is stripped out.

So, in the end, we have two problems. The length is a problem for LDA, because the topics add an extra, unnecessary degree of freedom, and the tweets are a problem for any classifier, because features typically useful in classification get selectively removed by the authors.

like image 28
Dan Avatar answered Oct 18 '22 12:10

Dan