Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a good way to structure a 100M record table for fast ad-hoc queries?

The scenario is quite simple, there are about 100M records in a table with 10 columns (kind of analytics data), and I need to be able to perform queries on any combination of those 10 columns. For example something like this:

  • how many records with a = 3 && b > 100 are there in past 3 months?

Basically all of the queries are going to be a kind of how many records with attributes X are there in time interval Y, where X can be any combination of those 10 columns.

The data will keep coming in, it is not just a pre-given set of 100M records, but it is growing over time.

Since the column selection can be completely random, creating indexes for popular combinations is most likely not possible.

The question has two parts:

  • How should I structure this in a SQL database to make the queries as fast as possible, and what are some general steps I can take to improve performance?
  • Is there any kind of NoSQL database that is optimized for this kind of search? I can think of only ElasticSearch, but I'm not it would perform very well on this large data set.
like image 792
Jakub Arnold Avatar asked Apr 27 '12 07:04

Jakub Arnold


People also ask

Which database is best for millions of records?

Oracle Database Oracle has provided high-quality database solutions since the 1970s. The most recent version of Oracle Database was designed to integrate with cloud-based systems, and it allows you to manage massive databases with billions of records. Traditionally, Oracle has offered RDBMS solutions.

How can I speed up my large table queries?

Use temp tables Speed up query execution in your SQL server by taking any data needed out of the large table, transferring it to a temp table and join with that. This reduces the power required in processing.

Can SQL handle 100 million records?

Use the SQL Server BCP to export big tables data This table includes 100 million rows and it's size is about 7.5 GB. In our first testing, we will run the SQL Server BCP with default values in order to export 100 M rows.


1 Answers

Without indexes your options for tuning an RDBMS to support this kind of processing are severely limited. Basically you need massive parallelism and super-fast kit. But clearly you're not storing realtional data so an RDBMS is the wrong fit.

Pursuing the parallel route, the industry standard is Hadoop. You can still use SQL style queries through Hive.

Another noSQL option would be to consider a columnar database. These are an alternative way of organising data for analytics without using cubes. They are good at loading data fast. Vectorwise is the latest player in the arena. I haven't used it personally, but somebody at last night's LondonData meetup was raving to me about it. Check it out.

Of course, moving away from SQL databases - in whatever direction you go - will incur a steep learning curve.

like image 107
APC Avatar answered Oct 12 '22 22:10

APC