Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big Data Process and Analysis in R

Tags:

r

bigdata

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

  1. Read and process the data into a data frame
  2. Basic descriptive analysis, including text mining (frequent terms, etc.)
  3. Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

like image 358
Btibert3 Avatar asked Dec 01 '11 14:12

Btibert3


People also ask

How do you process large data in R?

table package can be used to load and explore a large dataset more efficiently. The data. table package can help you read a large dataset into R and explore it more efficiently. The fread function in this package, for example, can read large flat files in much more quickly than comparable base R packages.

What is big data in data analytics using R?

Big Data analytics is the process of examining large and complex data sets that often exceed the computational capabilities. R is a leading programming language of data science, consisting of powerful functions to tackle all problems related to Big Data processing.

Can R handle large data sets?

As a rule of thumb: Data sets that contain up to one million records can easily processed with standard R. Data sets with about one million to one billion records can also be processed in R, but need some additional effort.

Can I use R in Hadoop?

Using R on Hadoop will provide highly scalable data analytics platform which can be scaled depending on the size of the dataset. Integrating Hadoop with R lets data scientists run R in parallel on large dataset as none of the data science libraries in R language will work on a dataset that is larger than its memory.


1 Answers

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

  • an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.

  • you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.

like image 138
qethanm Avatar answered Oct 16 '22 09:10

qethanm