Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Does Insert Order of Rows Matter?

I am a DB novice exploring Data Warehousing for the first time. I've completed the process of copying over a large volume of data from one of our production systems (MS SQL Server 2012) to our data warehouse (MySQL).

The problem I encountered was that the software/hardware resources I had available for the ETL process were not robust enough to copy over all of the data in my larger tables using a single query (program would run out of memory and crash). To get around this, I broke those tables into 12 chunks by adding a where clause using the modulo operator on the table id, since that was fast and easy to write:

SELECT * FROM table WHERE table.tableID % 12 = 0;
SELECT * FROM table WHERE table.tableID % 12 = 1;
SELECT * FROM table WHERE table.tableID % 12 = 2;
etc.

What I'm wondering now is whether this is going to impact my datawarehouse's performance relative to the original database. In the old database, data was inserted roughly chronologically and obviously that won't be the case for the new data warehouse.

I don't know enough about how DB engines actually store data to know if this is a problem. I have all the same indexes on the data warehouse as existed on the original table, but I don't know if DB engines will actually rearrange the data in memory according to the index to make reads faster.

Have I gotten myself into trouble by copying and inserting the data this way?

like image 710
busoni34 Avatar asked Jun 16 '26 19:06

busoni34


2 Answers

This probably won't make a difference. Databases can generally only take advantage of ordering within a table when a clustered index (of some sort) is declared. If you have one declared, then the data will be ordered on the data pages, regardless of the insertion order. If you don't have one, then the optimizer cannot take advantage of the ordering.

There are some types of queries (notably exists) whose performance might be affected by the actual order the data arrives when reading it. But this would not be common. You could also have bad performance if the tables do not fit in memory and you are relying on similar data being co-located for performance. That is generally a bad assumption, but it could be underlying some queries.

There are some cases where data ordering might be producing results that seem correct, but these are "bad" SQL:

  • A query that has no ORDER BY clause but expects the results in a particular ordr.
  • A query that uses the MySQL mis(feature) that allows non-aggregated, non-key columns in the SELECT of an aggregation query.
  • A query that depends on the ordering of values in a GROUP_CONCAT() that has no ORDER BY clause.

These are "bad" because they are depending on observed behavior of the system, rather than documented behavior (and no doubt, I may have missed some).

Of course, you can test your new system to see if this is the case. But a priori the ordering of the inserts would not be my first concern.

like image 100
Gordon Linoff Avatar answered Jun 19 '26 11:06

Gordon Linoff


if you have same indexes, data will be stored more or less in the same way, let's say that you have an hash index on a column the implementation of this structure will be similar in MySql DB and MySql server. The problem is that the oltp workload is different from the olap one so a good index for the oltp is not still a good index for a datawarehouse, but it depends on your data. Have a look to this article to better understand differences from oltp and olap: oltp vs olap. Try to think about how you can reduce table cardinalities, let's say that in your oltp system you store information about sales and you have something like this:

|  DateTime        | Product | QTY |
| ---------------- | --------|-----|
| 2018-03-05 10:50 |  prod1  |  5  |

table with 10^8 records. Maybe you want to store only the amount of products in a date, having something like this:

|    Date    |     Qty     |
|------------|-------------|
| 2018-03-05 |    10000    |

This will be reduce the cardinality of you table and will improve the efficiency of your application

like image 25
GJCode Avatar answered Jun 19 '26 12:06

GJCode