Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL Performance issues with large amounts of data

I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.

I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:

CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
  `header_id` tinyint(3) unsigned NOT NULL auto_increment,
  `sensor` varchar(10) NOT NULL,
  `bytes` smallint(5) unsigned NOT NULL,
  PRIMARY KEY  (`header_id`),
  UNIQUE KEY `header_id_UNIQUE` (`header_id`),
  UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
  `internal_id` bigint(20) NOT NULL auto_increment,
  `time_sec` bigint(20) unsigned NOT NULL,
  `time_nsec` bigint(20) unsigned NOT NULL,
  `transverse` bit(1) NOT NULL default b'0',
  `data` varbinary(1080) NOT NULL,
  PRIMARY KEY  (`internal_id`,`time_sec`,`time_nsec`),
  UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
  KEY `time` (`time_sec`)
  KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
  `internal_id` bigint(20) NOT NULL auto_increment,
  `raster_id` int(10) unsigned NOT NULL,
  `time_sec` bigint(20) unsigned NOT NULL,
  `time_nsec` bigint(20) unsigned NOT NULL,
  `header_id` tinyint(3) unsigned NOT NULL,
  `data_id` bigint(20) unsigned NOT NULL,
  PRIMARY KEY  (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
  KEY `raster_id` (`raster_id`),
  KEY `time` (`time_sec`),
  KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.

Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.

The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.

Here is an example SQL Query we use to get history data:

SELECT 
  rel_RASTER_TABLE.time_sec, 
  rel_RASTER_TABLE.time_nsec, 
  RAW_DATA_TABLE.transverse, 
  HEADER_TABLE.bytes, 
  RAW_DATA_TABLE.data 
FROM 
  RASTER_DB.HEADER_TABLE, 
  RASTER_DB.RAW_DATA_TABLE, 
  RASTER_DB.rel_RASTER_TABLE 
WHERE 
  rel_RASTER_TABLE.raster_id = 2952704 AND 
  rel_RASTER_TABLE.time_sec >= 1315849228 AND 
  rel_RASTER_TABLE.time_sec <= 1315935628 AND 
  rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND 
  rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;

I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!

EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?

like image 717
demarr Avatar asked Sep 16 '11 18:09

demarr


People also ask

How much data is too much for MySQL?

Row Size Limit Examples The MySQL maximum row size limit of 65,535 bytes is demonstrated in the following InnoDB and MyISAM examples. The limit is enforced regardless of storage engine, even though the storage engine may be capable of supporting larger rows.

Is MySQL suitable for big data?

MySQL was not designed for running complicated queries against massive data volumes (which requires crunching through a lot of data on a huge scale). MySQL optimizer is quite limited, executing a single query at a time using a single thread.

Can MySQL handle 100 million records?

Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.

How does MySQL handle large data?

When your database is large consider having a DDL (Data Definition Language) for your database table in MySQL/MariaDB. Adding a primary or unique key for your table requires a table rebuild. Changing a column data type also requires a table rebuild as the algorithm applicable to be applied is only ALGORITHM=COPY.


2 Answers

First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.

like image 71
Kibbee Avatar answered Oct 01 '22 06:10

Kibbee


I would say partitioning is an absolute must in a case like this:

  • large amount of data
  • new data coming in continuously
  • implicit: old data getting deleted continuously.

Check out this for mySQL.

Looking at your select stmt (which filters on time), I'll say partition on the time column.

Of course you might wanna add a few indexes based on the frequent queries you want to use.

--edit--

I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date. So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first. Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.

--edit end--

A quickie on partitioning if you're new to it.

  • Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
  • Advantages
    • insertions are faster as physically it's inserted into a smaller file (partition).
    • deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
    • queries with a where clause on the key by which the table is partitioned is much much faster.
    • Index building is faster.
like image 39
Kashyap Avatar answered Oct 01 '22 06:10

Kashyap