Relational database versus R/Python data frames

Tags:

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.

Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).

For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.

The questions that keeps arising is: When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?

For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey

How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.

435

asked May 14 '15 18:05

KarthikS

1 Answers

What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.

The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.

I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:

A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.

Some of the benefits of this might include:

Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.

Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

194

answered Oct 22 '22 00:10

Jo Douglass

Related questions
                            
                                CodeIgniter session behind proxy
                            
                                Key/Value store extremely slow on SSD
                            
                                Riak fails at MapReduce queries. Which configuration to use?
                            
                                Change encoding in PostgreSQL 9.1
                            
                                SQLiteDatabase update not working?
                            
                                Django custom unique together constraint
                            
                                Syntax issue SQL Server. Combining Pivot, XML parse and JOIN
                            
                                Execute A Dynamic SQL statement Stored in a Column of a table
                            
                                Best practice for storing constant stream of data
                            
                                Hibernate SaveOrUpdate - multiple workthreads
                            
                                What is the difference of using TRY...CATCH and @@ERROR in SQL Server?
                            
                                Oracle data pump impdp to remote server
                            
                                Data model for subscriptions, single purchase products and variable services
                            
                                SELECT INTO with HSQLDB
                            
                                Play! Framework FakeApplication - What does it actually do?
                            
                                Insert unique value using Clustered ColumnStore Index
                            
                                Laravel with Eloquent doesn't save model properties on database
                            
                                Get MYSQL database size in Java
                            
                                Error: Code too large [duplicate]
                            
                                After I restart PostgreSQL all my tables are empty (zero rows)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Relational database versus R/Python data frames

Tags:

database

dataframe

database-design

data-processing

data-collection

KarthikS

People also ask

1 Answers

Jo Douglass

Recent Activity

Donate For Us