I’m trying to improve my workflow, and am hoping the community can provide insights since I am relatively new to “big data”. I typically download several dataframes from public sources which can be related to one another. After pulling several dataframes, I do various data-processing steps (e.g., filtering, sorting, aggregating, custom calculations) before doing regression analysis and/or other inferential statistics on my final set of data. Specifically, which strategy would you recommend: <ol> <li>Downloading all of the data as separate files from the web onto my local machine, then using R to process it directly (as I have been doing). The potential problem I see with this is that everything is in R’s working environment, which may slow the process and crash my machine.</li> </ol> or <ol start="2"> <li>Downloading all of the data as separate files from the web onto my local machine, creating a database with <code>sqldf</code> on the local machine, and using follow-up queries with <code>sqldf</code> to pull and aggregate information from that database before doing final data analysis in R. The potential problem I see with this is that a database, made up of a handful of tables/dataframes, created on my local machine with <code>sqldf</code> is larger in size than simply saving several individual .csv files.</li> </ol> I’m pretty familiar with statistical techniques, but I admittedly have several knowledge gaps when it comes to database management and server operations. I’ve become familiar with the nuts and bolts of SQL, as a language, and I know how to use <code>sqldf</code> with dataframes running in the R working environment. However, I frankly do not know what advantage that offers over just learning how to use the base R functions to filter, sort, and aggregate data. Also, I’ve read a few webpages about the hype of pairing SQL Server with R, but I’m unsure if this is a good option for me since I run everything locally. Any tips for this newbie on how to improve my data processing and analytics via combining R with some implementation of SQL? Thank you in advance!

Given that you are looking for a "workflow best practices", there's a high premium that should be placed on both reproducibility and transparency. Since your goal is data analysis and not data collection or database management, there's not a strong reason to be creating your own databases and a custom database would likely make your workflow and analysis less transparent. In short, if you don't need to build a database, don't. It sounds like your workflow is as follows: <ol> <li>Download data from public sources (ideally .csv or a similarly nice format)</li> <li>Clean and process the data</li> <li>Run analysis on the (potentially linked) cleaned data</li> </ol> I would recommend dividing your workflow into two distinct steps: <h3>1. Download and Clean Data</h3> If your files are all .csv (or other regular delimited files), then you just need the <code>data.table</code> package for this step. You can write a single R script to download, clean, and save the data you need. A minimal example is below: <pre class="prettyprint"><code># Download data library(data.table) salary_data <- fread('https://data.phila.gov/api/views/25gh-t2gp/rows.csv') # Clean data (only looking at City Council salaries) cleaned_data <- salary_data[Department == 'CITY COUNCIL'] # Saving cleaned data save(cleaned_data, file = 'my_file_name.rda', compress = TRUE) </code></pre> Ideally, you would only have to run this file once to generate the dataset you actually perform your statistical analysis on. If you decide to clean or process your data differently, just revisit this file, make the appropriate changes, and rerun it. I would recommend having one script for each file that you are downloading so that it is easy to see how you are processing the raw data straight from the source (transparency). Simply having this file satisfies reproducibility. <h3>2. Statistical Analysis</h3> If you need to combine your datasets, <code>data.table</code> provides a fast and transparent way of doing this. Simply load in your cleaned individual datasets, identify the key that you will use to merge them, and then merge them. Then run your analysis on the merged dataset. Below is an example of this functionality: <pre class="prettyprint"><code># dt1 has salary information for 10 people and dt2 # has the number of kids for the same 10 people library(data.table) dt1 <- data.table(id = 1:10, salary = sample(0:100000, 10) dt2 <- data.table(id = 1:10, kids = sample(0:5, 10) save(dt1, file = 'dt1.rda', compress = TRUE) save(dt2, file = 'dt2.rda', compress = TRUE) # Loading and merging data load(file = 'dt1.rda') load(file = 'dt2.rda') setkey(dt1, id) setkey(dt2, id) merged_dt <- merge(dt1, dt2) # Doing regression analysis on merged data reg <- lm(salary ~ kids, data = merged_dt) </code></pre> This makes the merging procedure and subsequent analysis transparent and reproducible. <h3>Summary</h3> This procedure ensures that your data sources, data cleaning/processing, and analysis are well-documented, transparent, and reproducible. Furthermore, this procedure scales with your computer. If you do not need to build a database, then don't. What if the data is too big for my computer? If you need more space, just run the code you've already written on a dedicated server or Amazon Web Services machine. What if the data is too big for a dedicated server? Chances are the data is stored in an actual database and the only piece of the workflow that changes is your data downloading and (potentially) some of the processing will be a SQL query to the database (most likely using the <code>DBI</code> package which runs SQL queries in R), which then should be small enough to run locally or on a dedicated server. What if my data is too big for that? You probably should look into more heavy-duty big data languages like Hadoop. Supplemental Note: If your data is not in a regular delimited format (such as an Excel, SAS, or Stata file), then I would recommend using the <code>download_file()</code> function coupled with the <code>tidyverse</code> package (which has a fantastic ability to read these less pleasant, but common files) <pre class="prettyprint"><code>library(tidyverse) taxi_data_excel <- download.file(url = 'http://www.nyc.gov/html/tlc/downloads/excel/current_medallion_drivers.xls', destfile = 'taxi_data_excel.xls') taxi_data <- read_excel('taxi_data_excel.xls') </code></pre> Then do your cleaning as usual.

First things first. <code>sqldf</code> is not a database, it is a package that allows you to manipulate a <code>data.frame</code> object in SQL syntax. Well, to be precise, it uses <code>SQLite</code> in the back-end, but you shouldn't consider a <code>sqldf</code> package as a database. <code>sqldf</code> is a good and convenient package. In some cases, it might be effective as well, but effectiveness is not its primary goal. I recommend you to consider a data.table package. It is designed for effectiveness and the performance might surprise you. The first and main advice in choosing a strategy goes as follows: Respect the trade off factor! Deploying actual SQL database with R might give you huge advantage, but it puts significant overhead in development process. It all depends on the scope of project. There are no general rules, but I can try to indicate some rules of thumb. <ul> <li>Be default I would try to avoid involving SQL database, unless I face project specific arguments for SQL. </li> <li>If the bottle neck is RAM and R is necessary only for aggregated data, then you should really consider using SQL database. For example, MySQL will take care of paging, caching and multi-threading - this is might be important arguments.</li> <li>If the data structure of different sources has significant differences then using SQL will put extra overhead because you will have to manage it in R and in SQL - try to avoid that. On the other hand, if there are a lot of sources with same data structure then database will give you good improvement.</li> <li>If you only need to proceed source data, then dealing with files is OK. But if you need to run it repeatedly and save all the outputs, changes, versions and so on, then database becomes a necessity.</li> </ul> This is just my humble opinion.

Best use of R and SQL if restricted to a local machine

Tags:

sql

database

sql-server

r

sqldf

I’m trying to improve my workflow, and am hoping the community can provide insights since I am relatively new to “big data”.

I typically download several dataframes from public sources which can be related to one another. After pulling several dataframes, I do various data-processing steps (e.g., filtering, sorting, aggregating, custom calculations) before doing regression analysis and/or other inferential statistics on my final set of data.

Specifically, which strategy would you recommend:

Downloading all of the data as separate files from the web onto my local machine, then using R to process it directly (as I have been doing). The potential problem I see with this is that everything is in R’s working environment, which may slow the process and crash my machine.

Downloading all of the data as separate files from the web onto my local machine, creating a database with sqldf on the local machine, and using follow-up queries with sqldf to pull and aggregate information from that database before doing final data analysis in R. The potential problem I see with this is that a database, made up of a handful of tables/dataframes, created on my local machine with sqldf is larger in size than simply saving several individual .csv files.

I’m pretty familiar with statistical techniques, but I admittedly have several knowledge gaps when it comes to database management and server operations. I’ve become familiar with the nuts and bolts of SQL, as a language, and I know how to use sqldf with dataframes running in the R working environment. However, I frankly do not know what advantage that offers over just learning how to use the base R functions to filter, sort, and aggregate data. Also, I’ve read a few webpages about the hype of pairing SQL Server with R, but I’m unsure if this is a good option for me since I run everything locally.

Any tips for this newbie on how to improve my data processing and analytics via combining R with some implementation of SQL?

Thank you in advance!

280

asked Mar 30 '17 09:03

Jayden.Cameron

3 Answers

Given that you are looking for a "workflow best practices", there's a high premium that should be placed on both reproducibility and transparency. Since your goal is data analysis and not data collection or database management, there's not a strong reason to be creating your own databases and a custom database would likely make your workflow and analysis less transparent. In short, if you don't need to build a database, don't.

It sounds like your workflow is as follows:

Download data from public sources (ideally .csv or a similarly nice format)
Clean and process the data
Run analysis on the (potentially linked) cleaned data

I would recommend dividing your workflow into two distinct steps:

1. Download and Clean Data

If your files are all .csv (or other regular delimited files), then you just need the data.table package for this step. You can write a single R script to download, clean, and save the data you need. A minimal example is below:

# Download data
library(data.table)
salary_data <- fread('https://data.phila.gov/api/views/25gh-t2gp/rows.csv')

# Clean data (only looking at City Council salaries)
cleaned_data <- salary_data[Department == 'CITY COUNCIL']

# Saving cleaned data
save(cleaned_data, file = 'my_file_name.rda', compress = TRUE)

Ideally, you would only have to run this file once to generate the dataset you actually perform your statistical analysis on. If you decide to clean or process your data differently, just revisit this file, make the appropriate changes, and rerun it. I would recommend having one script for each file that you are downloading so that it is easy to see how you are processing the raw data straight from the source (transparency). Simply having this file satisfies reproducibility.

2. Statistical Analysis

If you need to combine your datasets, data.table provides a fast and transparent way of doing this. Simply load in your cleaned individual datasets, identify the key that you will use to merge them, and then merge them. Then run your analysis on the merged dataset. Below is an example of this functionality:

# dt1 has salary information for 10 people and dt2 
# has the number of kids for the same 10 people
library(data.table)
dt1 <- data.table(id = 1:10, salary = sample(0:100000, 10)
dt2 <- data.table(id = 1:10, kids = sample(0:5, 10)
save(dt1, file = 'dt1.rda', compress = TRUE)
save(dt2, file = 'dt2.rda', compress = TRUE)

# Loading and merging data
load(file = 'dt1.rda')
load(file = 'dt2.rda')
setkey(dt1, id)
setkey(dt2, id)
merged_dt <- merge(dt1, dt2)

# Doing regression analysis on merged data
reg <- lm(salary ~ kids, data = merged_dt)

This makes the merging procedure and subsequent analysis transparent and reproducible.

Summary

This procedure ensures that your data sources, data cleaning/processing, and analysis are well-documented, transparent, and reproducible. Furthermore, this procedure scales with your computer. If you do not need to build a database, then don't.

What if the data is too big for my computer? If you need more space, just run the code you've already written on a dedicated server or Amazon Web Services machine.

What if the data is too big for a dedicated server? Chances are the data is stored in an actual database and the only piece of the workflow that changes is your data downloading and (potentially) some of the processing will be a SQL query to the database (most likely using the DBI package which runs SQL queries in R), which then should be small enough to run locally or on a dedicated server.

What if my data is too big for that? You probably should look into more heavy-duty big data languages like Hadoop.

Supplemental Note: If your data is not in a regular delimited format (such as an Excel, SAS, or Stata file), then I would recommend using the download_file() function coupled with the tidyverse package (which has a fantastic ability to read these less pleasant, but common files)

library(tidyverse)
taxi_data_excel <- download.file(url = 'http://www.nyc.gov/html/tlc/downloads/excel/current_medallion_drivers.xls', destfile = 'taxi_data_excel.xls')
taxi_data <- read_excel('taxi_data_excel.xls')

Then do your cleaning as usual.

181

answered Oct 08 '22 21:10

Mallick Hossain

First things first. sqldf is not a database, it is a package that allows you to manipulate a data.frame object in SQL syntax. Well, to be precise, it uses SQLite in the back-end, but you shouldn't consider a sqldf package as a database.

sqldf is a good and convenient package. In some cases, it might be effective as well, but effectiveness is not its primary goal. I recommend you to consider a data.table package. It is designed for effectiveness and the performance might surprise you.

The first and main advice in choosing a strategy goes as follows: Respect the trade off factor! Deploying actual SQL database with R might give you huge advantage, but it puts significant overhead in development process. It all depends on the scope of project. There are no general rules, but I can try to indicate some rules of thumb.

Be default I would try to avoid involving SQL database, unless I face project specific arguments for SQL.
If the bottle neck is RAM and R is necessary only for aggregated data, then you should really consider using SQL database. For example, MySQL will take care of paging, caching and multi-threading - this is might be important arguments.
If the data structure of different sources has significant differences then using SQL will put extra overhead because you will have to manage it in R and in SQL - try to avoid that. On the other hand, if there are a lot of sources with same data structure then database will give you good improvement.
If you only need to proceed source data, then dealing with files is OK. But if you need to run it repeatedly and save all the outputs, changes, versions and so on, then database becomes a necessity.

This is just my humble opinion.

answered Oct 08 '22 21:10

Vyga

It very much depends on the infrastructure of your environment, but in the world of "Big Data" I would recommend using both because each has advantages that are difficult to give up.

Most actions of data cleansing and manipulation can be performed on both platforms some at the cost of performance and some at the cost of resources.

In-Memory: R's environment is mostly within RAM. Which is much faster, but not always necessary. If you have a 100 GB dataset loading it into RAM would be infeasible. Most databases have already introduced in-memory tables so if there are specific tables you want faster access to then you can always load them into RAM.

Indexes & Partitions: Its much easier to query data that has been efficiently indexed and partitioned on a database than through CSV files. Most exploratory analysis is done on partitions or clusters of data and giving this up is a huge performance trade-off.

Downloading and Storing: In R it is very easy to write a script for downloading data and uploading it to a database. On a database the data can be stored more easily for fast access as well as be compressed efficiently for performance and scalability.

Table Views: There are many datasets or basic manipulations on datasets that you would want to store for later use. In a database you can make use of table views which can join and manipulate data across any number of tables. In order to obtain the same result in R you would have to load all the relevant tables and perform the merges and manipulations every time you would like to access the same data.

Analysis: This is what R was built for. Many databases make it impossible to perform even the most basic analysis and therefore I would leave all of the statistical analysis in R.

I'm sure there are many more advantages/disadvantages that can be compared between R and using a databases. Again, if you are working on small amounts of data for fun you can use R all the way. Otherwise, use both. Its easier, faster and a lot more comfortable.

answered Oct 08 '22 21:10

AYR

Related questions
                            
                                When do I use a consensus algorithm like Paxos vs using a something like a Vector Clock?
                            
                                What does Override application root URL really do?
                            
                                Importing Module without routes
                            
                                What is the point of naming a companion object in kotlin
                            
                                Firestore offline cache
                            
                                Spring - Multiple Spring Data modules found, entering strict repository configuration mode
                            
                                WakefulBroadcastReceiver is deprecated
                            
                                Why do we need the spaceship <=> operator in C++?
                            
                                this.props.history.push not re-rendering react component
                            
                                Flutter BLoC pattern - How can I navigate to another screen after a stream event?
                            
                                Why using parallel streams in static initializer leads to not stable deadlock
                            
                                GCC and Clang don't compile std::hash<std::nullptr_t> in C++17

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best use of R and SQL if restricted to a local machine

Tags:

sql

database

sql-server

r

sqldf

Jayden.Cameron

People also ask

3 Answers

1. Download and Clean Data

2. Statistical Analysis

Summary

Mallick Hossain

Vyga

AYR

Recent Activity

Donate For Us