What's the disadvantage of LDA for short texts?

Tags:

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word co-occurrences".

From my point of view, the generation part of LDA is reasonable for any kind of texts, but what causes bad results in short texts is the sampling procedure. I am guessing LDA samples a topic for a word based on two parts: (1) topics of other words in the same doc (2) topic assignments of other occurrences of this word. Since the (1) part of a short text cannot reflect the true distribution of it, that causes a poor topic assignment for each word.

If you have found this question, please feel free to post your idea and help me understand this.

866

asked Apr 22 '15 03:04

Shuguang Zhu

2 Answers

Probabilistic models such as LDA exploit statistical inference to discover latent patterns of data. In short, they infer model parameters from observations. For instance, there is a black box containing many balls with different colors. You draw some balls out from the box and then infer the distributions of colors of the balls. That is a typical process of statistical inference. The accuracy of statistical inference depends on the number of your observations.

Now consider the problem of LDA over short texts. LDA models a document as a mixture of topics, and then each word is drawn from one of its topic. You can imagine a black box contains tons of words generated from such a model. Now you have seen a short document with only a few of words. The observations is obvious too few to infer the parameters. It is the data sparsity problem we mentioned.

Actually, besides the the lack of observations, the problem also comes from the over-complexity of the model. Usually, a more flexible model requires more observations to infer. The Biterm Topic Model tries to making topic inference easier by reducing the model complexity. First, it models the whole corpus as a mixture of topics. Since inferring the topic mixture over the corpus is easier than inferring the topic mixture over a short document. Second, it supposes each biterm is draw from a topic. Inferring the topic of a biterm is also easier than inferring the topic of a single word in LDA, since more context is added.

I hope the explanation make sense for you. Thanks for mentioning our paper.

answered Oct 18 '22 13:10

Xiaohui Yan

Doing a bit of digging, Hong and Davison (2010) showed up as a great example of these not working well on classifying tweets. Unfortunately, they don't really give much insight into why it doesn't work.

I suspect there's two reasons LDA doesn't work well for short documents.

First of all, when working on smaller documents, the extra topic layer doesn't add anything to the classification, and what doesn't help probably hurts. If you have really short documents, like tweets, it's really hard to break documents into topics. There isn't much room for anything but one topic in a tweet, after all. Since the topic layer can't contribute much to the classification, it makes room for error to arise in the system.

Second, linguistically, Twitter users prefer to strip off "unnecessary fluff" when tweeting. When working with full documents, there are features --words, word collocations, etc.--that are probably specific, common, and often repeated within a genre. When tweeting, though, these common elements get dropped first because what's interesting, new, and more perplex is what remains when the fluff is removed.

For example, let's look at my own tweets because I believe in shameless self-promotion:

Progressbar.py is a fun little package, though I don't get  a chance to use it too often. it even does ETAs for you  https://pypi.python.org/pypi/progressbar …  From a capitalist perspective, the social sciences exist so  idiot engineers don't waste money on building **** no one needs.  Abstract enough to be reusable, specific enough to be useful.

The first is about Python. If you're parsing the URLs, you'll get that--and the .py would give it to you too. However, in a more expressive medium, I'd probably have put the word "Python" in somewhere. The second is programming related as well, but a bit more on the business end. Not once does it even mention anything specific to programming, though. The last one too is programming related, but ties more into the art of programming, expressing a sort of double-bind programmers face while coding. It is as difficult as the second, feature-wise.

In both of those last two examples, had I not been writing a microblog post, these would have immediately been followed up with examples that would have been very useful to a classifier, or themselves included more data. Twitter doesn't have room for that kind of stuff, though, and the content that would typify the genre a tweet belongs to is stripped out.

So, in the end, we have two problems. The length is a problem for LDA, because the topics add an extra, unnecessary degree of freedom, and the tweets are a problem for any classifier, because features typically useful in classification get selectively removed by the authors.

answered Oct 18 '22 12:10

Dan

Related questions
                            
                                Finding length of the longest list in an irregular list of lists
                            
                                Why do you need Arbitraries in scalacheck?
                            
                                Functional Java - Interaction between whenComplete and exceptionally
                            
                                How do I create file hardlink in PowerShell on Windows 10?
                            
                                How to minify CSS and JavaScript files in Visual Studio 2015
                            
                                DisconnectedContext error when running Unit Tests in debug in VS2015
                            
                                django:django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet
                            
                                How do I run PhantomJS on AWS Lambda with NodeJS
                            
                                Installing multiple npm versions
                            
                                Why I can not return initializer list from lambda
                            
                                Is move constructor called twice in C++?
                            
                                npm init doesn't create package.json

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With