When to use R, when to use SQL?

Q: What is better SQL or R?

SQL is a special-purpose language used for accessing data. For most tasks, SQL is more efficient than Python or R. R is a language for statistical computing. It's different from Python in that is has a different syntax and different data types.

Q: Can I use R instead of SQL?

To be clear, R is not considered an alternative for database servers and/or SQL. Another main advantage of database servers is that a good database design will ensure that you can query your database fast by performing query optimization. To achieve this database servers keep track of the design of a table.

Q: Is SQL faster than R?

Turns out that R can do that a lot faster at 200 – 300 microseconds compared to around 1000 microseconds for SQL.

Q: Can I use SQL and R together?

Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.

My Question:

At what point is it beneficial to stop increasing the complexity of an SQL statement in favor of the data subsetting functionality in R (e.g., merge, *apply, maply, dlply, etc.)in R.

On one hand, SQL's join is easier than selecting all contents of each table and using the R merge function to join them. Also, doing the conditional selects in SQL would reduce the amount of data that has to be imported to R; but the speed difference is not significant.

On the other hand, a big join with a complex where clause becomes less easy to understand than the R syntax.

Below I have some untested code for illustrative purposes: I am asking this question at before having working code, and the answer to my question doesn't require working code (although this is always appreciated) - the "most elegant approach", "fewest lines", or "amazing implementation of X" are always appreciated, but what I am particularly interested in is the "most sensible / practical / canonical / based on first principles" rationale.

I am interested in the general answer of which steps should use a SQL where clause and which steps would be easier to accomplish using R.

Illustration:

Database description

there are three tables: a, ab, and b. Tables a and b each have a primary key id. They have a many-many relationship that is represented by a lookup table, ab, which contains fields ab.a_id and ab.b_id that join to a.id and b.id, respectively. Both tables have a time field, and a has a group field.

Goal:

Here is a minimal example of the join and subsetting that I want to do;

(MySQL naming of elements, e.g. a.id is equivalent to a$id in R)

Join tables a and b using ab, appending multiple values of b.time associated with each a.id as a new column;

select a_time, b.time, a.id, b.id from         a join ab on a.id = ab.a_id         join b on b.id = ab.b_id and then append b.time for distinct values of b.id;

I don't need repeated values of b.time, I only need a value of b.max: for repeated values of b.time joined to each a.id, b.max is the value of b.time closest to but not greater than a.time
```
b.max <- max(b.time[b.time < a.time)) 
```
append the value dt <- a.time - b.max to the table, for example, in R,
for each distinct value in a.group, select which(min(x.dt)))
```
x.dt <- a.time - b.max 
```

527

asked Mar 20 '12 21:03

David LeBauer

1 Answers

I usually do the data manipulations in SQL until the data I want is in a single table, and then, I do the rest in R. Only when there is a performance issue do I start to move some of the computations to the database. This is already what you are doing.

Computations involving timestamps often become unreadable in SQL (the "analytic functions", similar to ddply, are supposed to simplify this, but I think they are not available in MySQL).

However, your example can probably be written entirely in SQL as follows (not tested).

-- Join the tables and compute the maximum CREATE VIEW t1 AS SELECT a.id    AS a_id,         a.group AS a_group,        b.id    AS b_id,        a.time  AS a_time,         a.time - MAX(b.time) AS dt FROM   a, b, ab WHERE  a.id = ab.a_id AND b.id = ab.b_id AND    b.time < a.time GROUP  BY a.id, a.group, b.id;  -- Extract the desired rows CREATE VIEW t2 AS  SELECT t1.* FROM t1, (SELECT group, MIN(dt) AS min_dt FROM t1) X WHERE t1.a_id = X.a_id  AND   t1.b_id = X.b_id  AND   t1.a_group = X.a.group;

119

answered Oct 21 '22 13:10

Vincent Zoonekynd

Related questions
                            
                                How can I join two tables but only return rows that don't match?
                            
                                T-SQL - Aliasing using "=" versus "as" [closed]
                            
                                How can I select rows by range? [duplicate]
                            
                                Delete empty rows
                            
                                Omitting the Milliseconds in a Date
                            
                                Subtraction between two sql queries
                            
                                T-SQL Subquery Max(Date) and Joins
                            
                                Validate email addresses in Mysql
                            
                                How can I avoid ResultSet is closed exception in Java?
                            
                                MySQL - Select only numeric values from varchar column
                            
                                Getting the Last Insert ID with SQLite.NET in C#
                            
                                Redmine: Copy issue multiple times
                            
                                What GOOD tools are available for generating ERD from a SQL Server Database? [closed]
                            
                                Wiki Database, is there one?
                            
                                MySql Tinytext vs Varchar vs Char
                            
                                PDO SQL-state "00000" but still error? [duplicate]
                            
                                Is it bad to omit semicolon in MySQL queries? [closed]
                            
                                Good resources for learning PL/pgSQL? [closed]
                            
                                Whats the size of an SQL Int(N)?
                            
                                LIMITing an SQL JOIN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to use R, when to use SQL?

Tags:

sql

database

r

data.table