Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relational database versus R/Python data frames

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.

Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).

For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.

The questions that keeps arising is: When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?

For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey

How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.

like image 435
KarthikS Avatar asked May 14 '15 18:05

KarthikS


People also ask

What is the difference between a DataFrame and a database?

DataFrame: A Data Frame is used for storing data in tables. It is equivalent to a table in a relational database but with richer optimization. It is a data abstraction and domain-specific language (DSL) applicable to a structure and semi-structured data.

Is a pandas DataFrame a relational database?

As we mentioned above (but we're going to repeat this fact many times), Pandas is not a relational database library, but instead a “data frame” library.

Is a DataFrame a database?

A data frame isn't a database. It's more like a single table in a relational database, or a single sheet in a spreadsheet. In R terms, you can also think of it as a hybrid of a list and a matrix.


1 Answers

What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.

The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.

I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:

A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.

Some of the benefits of this might include:

  • Individuals saving time when they otherwise would have needed to combine the data themselves.
  • If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
  • If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
  • If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.

Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

like image 194
Jo Douglass Avatar answered Oct 22 '22 00:10

Jo Douglass