I have a table with 70 million records and there is an index missing. I want to calculate the time to add the index without backing up the table and doing the index on the backed up table.
I am just wondering if it will be twice as slow(linear) or if it is exponential.
database: mysql 5.0
Thanks a lot
Table Size. It is not recommended to create indexes on small tables, as it takes the SQL Server Engine less time scanning the underlying table than traversing the index when searching for a specific data.
Creating indexes on a very large table takes over 5 hours.
Indexing makes columns faster to query by creating pointers to where data is stored within a database. Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it.
Each table can have up to 999 nonclustered indexes, regardless of how the indexes are created: either implicitly with PRIMARY KEY and UNIQUE constraints, or explicitly with CREATE INDEX .
(Disclaimer: I have minimal experience on MySQL)
It should be somewhere in-between.
The absolutely lowest complexity of the whole operation would be the one that would appear when just reading all records in order, which is a linear process - O(n)
. This is an I/O bound operation and there is not much that can be done about it - modern caching systems in most OS may help, but only in a DB that is in use and fits in the available memory.
In most SQL engines, indexes are some variation of a B-tree. The CPU complexity of inserting a single record into such a tree is roughly O(log(n))
, where n
is its size. For n
records we get a complexity of O(n log(n))
. The total complexity of the operation should be O(n log(n))
.
Of course, it's not quite that simple. Computing the index tree is not really CPU-heavy and since the index pages should fit in RAM on any modern system, the operation of inserting a single node when the tree is not rebalanced would be close to O(1)
time-wise: a single disk operation to update a leaf page of the index.
Since the tree does get rebalanced, however, things are probably a bit more complex. Multiple index pages may have to be commited to disk, thus increasing the necessary time. As a rough guess, I'd say O(n log(n))
is a good start...
It should never come anywhere close to an exponential complexity, though.
EDIT:
It just occured to me that 70,000,000 B-tree entries may not, in fact, fit in the in-memory cache. It would depend heavily on what is being indexed. INTEGER
columns would probably be fine, but TEXT
columns are another story altogether. If the average field length is 100 bytes (e.g. HTTP links or 30 characters of non-English UTF-8 text) you'd need more than 7GB of memory to store the index.
Bottom line:
If the index fits in the cache, then since building the index should be a single DB transaction, it would be I/O-bound and roughly linear as all the records have to be parsed and then the index itelse has to be written-out to permanent storage.
If the index does not fit in the cache, then the complexity rises, as I/O wait-times on the index itself become involved in each operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With