MySQL Performance issues with large amounts of data

Tags:

I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.

I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:

CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
  `header_id` tinyint(3) unsigned NOT NULL auto_increment,
  `sensor` varchar(10) NOT NULL,
  `bytes` smallint(5) unsigned NOT NULL,
  PRIMARY KEY  (`header_id`),
  UNIQUE KEY `header_id_UNIQUE` (`header_id`),
  UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
  `internal_id` bigint(20) NOT NULL auto_increment,
  `time_sec` bigint(20) unsigned NOT NULL,
  `time_nsec` bigint(20) unsigned NOT NULL,
  `transverse` bit(1) NOT NULL default b'0',
  `data` varbinary(1080) NOT NULL,
  PRIMARY KEY  (`internal_id`,`time_sec`,`time_nsec`),
  UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
  KEY `time` (`time_sec`)
  KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
  `internal_id` bigint(20) NOT NULL auto_increment,
  `raster_id` int(10) unsigned NOT NULL,
  `time_sec` bigint(20) unsigned NOT NULL,
  `time_nsec` bigint(20) unsigned NOT NULL,
  `header_id` tinyint(3) unsigned NOT NULL,
  `data_id` bigint(20) unsigned NOT NULL,
  PRIMARY KEY  (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
  KEY `raster_id` (`raster_id`),
  KEY `time` (`time_sec`),
  KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;

The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.

Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.

The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.

Here is an example SQL Query we use to get history data:

SELECT 
  rel_RASTER_TABLE.time_sec, 
  rel_RASTER_TABLE.time_nsec, 
  RAW_DATA_TABLE.transverse, 
  HEADER_TABLE.bytes, 
  RAW_DATA_TABLE.data 
FROM 
  RASTER_DB.HEADER_TABLE, 
  RASTER_DB.RAW_DATA_TABLE, 
  RASTER_DB.rel_RASTER_TABLE 
WHERE 
  rel_RASTER_TABLE.raster_id = 2952704 AND 
  rel_RASTER_TABLE.time_sec >= 1315849228 AND 
  rel_RASTER_TABLE.time_sec <= 1315935628 AND 
  rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND 
  rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;

I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!

EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?

717

asked Sep 16 '11 18:09

demarr

2 Answers

First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.

answered Oct 01 '22 06:10

Kibbee

I would say partitioning is an absolute must in a case like this:

large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.

Check out this for mySQL.

Looking at your select stmt (which filters on time), I'll say partition on the time column.

Of course you might wanna add a few indexes based on the frequent queries you want to use.

--edit--

I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date. So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first. Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.

--edit end--

A quickie on partitioning if you're new to it.

Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
- insertions are faster as physically it's inserted into a smaller file (partition).
- deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
- queries with a where clause on the key by which the table is partitioned is much much faster.
- Index building is faster.

answered Oct 01 '22 06:10

Kashyap

Related questions
                            
                                C++: constructor / initializer for array?
                            
                                How do all the different size types relate to each other?
                            
                                Friend functions of a class template
                            
                                Linux C++ project directory layout - CMake
                            
                                Where is the undefined behavior when using const_cast<>?
                            
                                Simpler "Preventing derived classes" in C++
                            
                                Why does this code produce a warning referring to the comma operator?
                            
                                ATL COM class registration .rgs file defaults
                            
                                C++ dependency injection - by reference or by boost::shared_ptr?
                            
                                Handling a C++ libsigc++ signal with a lambda function
                            
                                Why are there so many different ways to use new operator in C++
                            
                                template template class with const template parameter
                            
                                Google's dense_hash_map crashing in set_empty_key() function
                            
                                C++: dynamically allocating a member array of structs using non-default constructor
                            
                                how to Convert Lat ,long to an XY coordinate system (e.g. UTM) and then map this to pixel space of my Image
                            
                                Is there a cleaner way to register Qt custom events?
                            
                                Is the copy CTOR required even if never called?
                            
                                Shouldn't declaration match its definition when array is involved?
                            
                                Fast dynamic casting progress
                            
                                How to get maximum frame size of each function compiled by gcc?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MySQL Performance issues with large amounts of data

Tags:

c++

performance

mysql

demarr

People also ask

2 Answers

Kibbee

Kashyap

Recent Activity

Donate For Us