Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good algorithm to find themes in tweets ranked by follower counts?

I'm new to data mining and experimenting a bit.

Let's say I have N twitter users and what I want to find is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.

Then I want to merge all themes if there're similar enough but still retain the weighting by twitter count.

So basically a list of "important" themes ranked by authority (user's twitter count)

For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.

I'd prefer something in python since that's the language I'm most familiar with.

Any ideas?

Thanks

EDIT: Here's a good example of what I'm trying to do (but with diff data) http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858

Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.

Where would I begin to solve this and generate such graphs?

like image 414
James Avatar asked Oct 14 '22 18:10

James


1 Answers

Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.

Some things you have to consider :

  • define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
  • define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
  • define "weight" : Is that equivalent to the number of users? The square root? Some category?

If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.

like image 115
Joris Meys Avatar answered Oct 31 '22 14:10

Joris Meys