I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught. Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do: <ol> <li>Read and process the data into a data frame </li> <li>Basic descriptive analysis, including text mining (frequent terms, etc.)</li> <li>Plotting</li> </ol> Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R. Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either. Thanks in advance.

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer. (The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.) On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R. Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway. A couple of pointers: <ul> <li>an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.</li> <li>you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.</li> </ul>

Big Data Process and Analysis in R

Tags:

r

bigdata

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

358

asked Dec 01 '11 14:12

Btibert3

1 Answers

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.

138

answered Oct 16 '22 09:10

qethanm

Related questions
                            
                                Fast alternative to split in R
                            
                                Labelling the plots with images on graph in ggplot2
                            
                                R function with no return value
                            
                                Encoding in Shiny
                            
                                How do you include data frame output inside warnings and errors?
                            
                                Median returning an error when using data.table in R
                            
                                Find the intersection of overlapping ranges in two tables using data.table function foverlaps
                            
                                text wordcloud plotting error
                            
                                dplyr & r: Anonymous functions myst be parenthesized
                            
                                Join two data frames in R based on closest timestamp
                            
                                How to use custom functions in mutate (dplyr)?
                            
                                Tidy evaluation programming with dplyr::case_when
                            
                                dplyr / tidyevaluation: How to pass an expression in mutate as a string?
                            
                                Combining Rolling Origin Forecast Resampling and Group V-Fold Cross-Validation in rsample
                            
                                Why does substitute change noquote text to a string in R?
                            
                                How to render a gganimate graph in html using rmarkdown::render(), without generating unwanted output
                            
                                How to create all combinations from a nested list while preserving the structure using R?
                            
                                iteratively constructed dataframe in R
                            
                                Using multiple ellipses arguments in R
                            
                                as.POSIXct gives an unexpected timezone

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With