Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set up large database in MySQL for analysis in R

Tags:

mysql

macos

r

I have reached the limit of RAM in analyzing large datasets in R. I think my next step is to import these data into a MySQL database and use the RMySQL package. Largely because I don't know database lingo, I haven't been able to figure out how to get beyond installing MySQL with hours of Googling and RSeeking (I am running MySQL and MySQL Workbench on Mac OSX 10.6, but can also run Ubuntu 10.04).

Is there a good reference on how to get started with this usage? At this point I don't want to do any sort of relational databasing. I just want to import .csv files into a local MySQL database and do the subsetting in with RMySQL.

I appreciate any pointers (including "You're way off base!" as I'm new to R and newer to large datasets... this one's around 80 mb)

like image 558
Richard Herron Avatar asked Jul 27 '10 03:07

Richard Herron


People also ask

Can MySQL be used for large databases?

MySQL Server was originally developed to handle large databases much faster than existing solutions and has been successfully used in highly demanding production environments for several years.

Can MySQL be used for data analysis?

As mentioned MySQL is an open-source relational database management system with easier operations enabling us to carry out data analysis on a database.

Why would you want to use MySQL with R?

Clearly, using MySQL with R will not only prevent unnecessary data clogging the memory but also saves time since the chunkWise approach cuts the writing time down significantly.


1 Answers

The documentation for RMySQL is pretty good - but it does assume that you know the basics of SQL. These are:

  • creating a database
  • creating a table
  • getting data into the table
  • getting data out of the table

Step 1 is easy: in the MySQL console, simply "create database DBNAME". Or from the command line, use mysqladmin, or there are often MySQL admin GUIs.

Step 2 is a little more difficult, since you have to specify the table fields and their type. This will depend on the contents of your CSV (or other delimited) file. A simple example would look something like:

use DBNAME;
create table mydata(
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  height FLOAT(3,2)
); 

Which says create a table with 2 fields: id, which will be the primary key (so has to be unique) and will autoincrement as new records are added; and height, which here is specified as a float (a numeric type), with 3 digits total and 2 after the decimal point (e.g. 100.27). It's important that you understand data types.

Step 3 - there are various ways to import data to a table. One of the easiest is to use the mysqlimport utility. In the example above, assuming that your data are in a file with the same name as the table (mydata), the first column a tab character and the second the height variable (with no header row), this would work:

mysqlimport -u DBUSERNAME -pDBPASSWORD DBNAME mydata

Step 4 - requires that you know how to run MySQL queries. Again, a simple example:

select * from mydata where height > 50;

Means "fetch all rows (id + height) from the table mydata where height is more than 50".

Once you have mastered those basics, you can move to more complex examples such as creating 2 or more tables and running queries that join data from each.

Then - you can turn to the RMySQL manual. In RMySQL, you set up the database connection, then use SQL query syntax to return rows from the table as a data frame. So it really is important that you get the SQL part - the RMySQL part is easy.

There are heaps of MySQL and SQL tutorials on the web, including the "official" tutorial at the MySQL website. Just Google search "mysql tutorial".

Personally, I don't consider 80 Mb to be a large dataset at all; I'm surprised that this is causing a RAM issue and I'm sure that native R functions can handle it quite easily. But it's good to learn new skill such as SQL, even if you don't need them for this problem.

like image 141
neilfws Avatar answered Oct 24 '22 03:10

neilfws