Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make R tm corpus of 100 million tweets?

I want to make a text corpus of 100 million tweets using R’s distributed computing tm package (called tm.plugin.dc). The tweets are stored in a large MySQL table on my laptop. My laptop is old, so I am using a Hadoop cluster that I set up on Amazon EC2.

The tm.plugin.dc documentation from CRAN says that only DirSource is currently supported. The documentation seems to suggest that DirSource allows only one document per file. I need the corpus to treat each tweet as a document. I have 100 million tweets -- does this mean I need to make 100 million files on my old laptop? That seems excessive. Is there a better way?

What I have tried so far:

  1. Make a file dump of the MySQL table as a single (massive) .sql file. Upload the file to S3. Transfer the file from S3 to the cluster. Import the file into Hive using Cloudera’s Sqoop tool. Now what? I can’t figure out how to make DirSource work with Hive.

  2. Make each tweet an XML file on my laptop. But how? My computer is old and can’t do this well. ... If I could get past that, then I would: Upload all 100 million XML files to a folder in Amazon’s S3. Copy the S3 folder to the Hadoop cluster. Point DirSource to the folder.

like image 303
user554481 Avatar asked May 05 '13 19:05

user554481


2 Answers

wouldn't be easier and more reasonable to make huge HDFS file with 100 million tweets and then process them by standard R' tm package?

This approach seems to me more natural since HDFS is developed for big files and distributed environment while R is great analytical tool but without parallelism (or limited). Your approach looks like using tools for something they were not developed for...

like image 173
xhudik Avatar answered Sep 22 '22 02:09

xhudik


I would strongly recommend to check this url http://www.quora.com/How-can-R-and-Hadoop-be-used-together. This will give you necessary insights to your problem.

like image 41
Siva Karthikeyan Avatar answered Sep 22 '22 02:09

Siva Karthikeyan