I’m trying to improve my workflow, and am hoping the community can provide insights since I am relatively new to “big data”.
I typically download several dataframes from public sources which can be related to one another. After pulling several dataframes, I do various data-processing steps (e.g., filtering, sorting, aggregating, custom calculations) before doing regression analysis and/or other inferential statistics on my final set of data.
Specifically, which strategy would you recommend:
or
sqldf
on the local machine, and using follow-up queries with sqldf
to pull and aggregate information from that database before doing final data analysis in R. The potential problem I see with this is that a database, made up of a handful of tables/dataframes, created on my local machine with sqldf
is larger in size than simply saving several individual .csv files.I’m pretty familiar with statistical techniques, but I admittedly have several knowledge gaps when it comes to database management and server operations. I’ve become familiar with the nuts and bolts of SQL, as a language, and I know how to use sqldf
with dataframes running in the R working environment. However, I frankly do not know what advantage that offers over just learning how to use the base R functions to filter, sort, and aggregate data. Also, I’ve read a few webpages about the hype of pairing SQL Server with R, but I’m unsure if this is a good option for me since I run everything locally.
Any tips for this newbie on how to improve my data processing and analytics via combining R with some implementation of SQL?
Thank you in advance!
Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.
Key Benefits of R R makes performing common data analysis tasks such as loading data, transforming, manipulating, aggregating, charting and sharing your analyses very easy, and the workflow is much more seamless than in SQL.
This suggests that, in the end, you should focus more on R or Python than SQL. One thing to remember is that SQL is a big first step to some more complex languages (Python, R, JavaScript, etc.). Once you understand how a computer thinks, it is easy to learn a new programming language to analyze your data.
But R can connect easily to many relational databases like MySql, Oracle, Sql server etc. and fetch records from them as a data frame. Once the data is available in the R environment, it becomes a normal R data set and can be manipulated or analyzed using all the powerful packages and functions.
Given that you are looking for a "workflow best practices", there's a high premium that should be placed on both reproducibility and transparency. Since your goal is data analysis and not data collection or database management, there's not a strong reason to be creating your own databases and a custom database would likely make your workflow and analysis less transparent. In short, if you don't need to build a database, don't.
It sounds like your workflow is as follows:
I would recommend dividing your workflow into two distinct steps:
If your files are all .csv (or other regular delimited files), then you just need the data.table
package for this step. You can write a single R script to download, clean, and save the data you need. A minimal example is below:
# Download data
library(data.table)
salary_data <- fread('https://data.phila.gov/api/views/25gh-t2gp/rows.csv')
# Clean data (only looking at City Council salaries)
cleaned_data <- salary_data[Department == 'CITY COUNCIL']
# Saving cleaned data
save(cleaned_data, file = 'my_file_name.rda', compress = TRUE)
Ideally, you would only have to run this file once to generate the dataset you actually perform your statistical analysis on. If you decide to clean or process your data differently, just revisit this file, make the appropriate changes, and rerun it. I would recommend having one script for each file that you are downloading so that it is easy to see how you are processing the raw data straight from the source (transparency). Simply having this file satisfies reproducibility.
If you need to combine your datasets, data.table
provides a fast and transparent way of doing this. Simply load in your cleaned individual datasets, identify the key that you will use to merge them, and then merge them. Then run your analysis on the merged dataset. Below is an example of this functionality:
# dt1 has salary information for 10 people and dt2
# has the number of kids for the same 10 people
library(data.table)
dt1 <- data.table(id = 1:10, salary = sample(0:100000, 10)
dt2 <- data.table(id = 1:10, kids = sample(0:5, 10)
save(dt1, file = 'dt1.rda', compress = TRUE)
save(dt2, file = 'dt2.rda', compress = TRUE)
# Loading and merging data
load(file = 'dt1.rda')
load(file = 'dt2.rda')
setkey(dt1, id)
setkey(dt2, id)
merged_dt <- merge(dt1, dt2)
# Doing regression analysis on merged data
reg <- lm(salary ~ kids, data = merged_dt)
This makes the merging procedure and subsequent analysis transparent and reproducible.
This procedure ensures that your data sources, data cleaning/processing, and analysis are well-documented, transparent, and reproducible. Furthermore, this procedure scales with your computer. If you do not need to build a database, then don't.
What if the data is too big for my computer? If you need more space, just run the code you've already written on a dedicated server or Amazon Web Services machine.
What if the data is too big for a dedicated server? Chances are the data is stored in an actual database and the only piece of the workflow that changes is your data downloading and (potentially) some of the processing will be a SQL query to the database (most likely using the DBI
package which runs SQL queries in R), which then should be small enough to run locally or on a dedicated server.
What if my data is too big for that? You probably should look into more heavy-duty big data languages like Hadoop.
Supplemental Note: If your data is not in a regular delimited format (such as an Excel, SAS, or Stata file), then I would recommend using the download_file()
function coupled with the tidyverse
package (which has a fantastic ability to read these less pleasant, but common files)
library(tidyverse)
taxi_data_excel <- download.file(url = 'http://www.nyc.gov/html/tlc/downloads/excel/current_medallion_drivers.xls', destfile = 'taxi_data_excel.xls')
taxi_data <- read_excel('taxi_data_excel.xls')
Then do your cleaning as usual.
First things first. sqldf
is not a database, it is a package that allows you to manipulate a data.frame
object in SQL syntax. Well, to be precise, it uses SQLite
in the back-end, but you shouldn't consider a sqldf
package as a database.
sqldf
is a good and convenient package. In some cases, it might be effective as well, but effectiveness is not its primary goal. I recommend you to consider a data.table package. It is designed for effectiveness and the performance might surprise you.
The first and main advice in choosing a strategy goes as follows: Respect the trade off factor! Deploying actual SQL database with R might give you huge advantage, but it puts significant overhead in development process. It all depends on the scope of project. There are no general rules, but I can try to indicate some rules of thumb.
Be default I would try to avoid involving SQL database, unless I face project specific arguments for SQL.
If the bottle neck is RAM and R is necessary only for aggregated data, then you should really consider using SQL database. For example, MySQL will take care of paging, caching and multi-threading - this is might be important arguments.
If the data structure of different sources has significant differences then using SQL will put extra overhead because you will have to manage it in R and in SQL - try to avoid that. On the other hand, if there are a lot of sources with same data structure then database will give you good improvement.
If you only need to proceed source data, then dealing with files is OK. But if you need to run it repeatedly and save all the outputs, changes, versions and so on, then database becomes a necessity.
This is just my humble opinion.
It very much depends on the infrastructure of your environment, but in the world of "Big Data" I would recommend using both because each has advantages that are difficult to give up.
Most actions of data cleansing and manipulation can be performed on both platforms some at the cost of performance and some at the cost of resources.
In-Memory: R's environment is mostly within RAM. Which is much faster, but not always necessary. If you have a 100 GB dataset loading it into RAM would be infeasible. Most databases have already introduced in-memory tables so if there are specific tables you want faster access to then you can always load them into RAM.
Indexes & Partitions: Its much easier to query data that has been efficiently indexed and partitioned on a database than through CSV files. Most exploratory analysis is done on partitions or clusters of data and giving this up is a huge performance trade-off.
Downloading and Storing: In R it is very easy to write a script for downloading data and uploading it to a database. On a database the data can be stored more easily for fast access as well as be compressed efficiently for performance and scalability.
Table Views: There are many datasets or basic manipulations on datasets that you would want to store for later use. In a database you can make use of table views which can join and manipulate data across any number of tables. In order to obtain the same result in R you would have to load all the relevant tables and perform the merges and manipulations every time you would like to access the same data.
Analysis: This is what R was built for. Many databases make it impossible to perform even the most basic analysis and therefore I would leave all of the statistical analysis in R.
I'm sure there are many more advantages/disadvantages that can be compared between R and using a databases. Again, if you are working on small amounts of data for fun you can use R all the way. Otherwise, use both. Its easier, faster and a lot more comfortable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With