Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Uniqueidentifier PK: Is a SQL Server heap the right choice?

OK. I've read things here and there about SQL Server heaps, but nothing too definitive to really guide me. I am going to try to measure performance, but was hoping for some guidance on what I should be looking into. This is SQL Server 2008 Enterprise. Here are the tables:

Jobs

  • JobID (PK, GUID, externally generated)
  • StartDate (datetime2)
  • AccountId
  • Several more accounting fields, mainly decimals and bigints

JobSteps

  • JobStepID (PK, GUID, externally generated)
  • JobID FK
  • StartDate
  • Several more accounting fields, mainly decimals and bigints

Usage: Lots of inserts (hundreds/sec), usually 1 JobStep per Job. Estimate perhaps 100-200M rows per month. No updates at all, and the only deletes are from archiving data older than 3 months.

Do ~10 queries/sec against the data. Some join JobSteps to Jobs, some just look at Jobs. Almost all queries will range on StartDate, most of them include AccountId and some of the other accounting fields (we have indexes on them). Queries are pretty simple - the largest part of the execution plans is the join for JobSteps.

The priority is the insert performance. Some lag (5 minutes or so) is tolerable for data to appear in the queries, so replicating to other servers and running queries off them is certainly allowable.

Lookup based on the GUIDs is very rare, apart from joining JobSteps to Jobs.

Current Setup: No clustered index. The only one that seems like a candidate is StartDate. But, it doesn't increase perfectly. Jobs can be inserted anywhere in a 3 hour window after their StartDate. That could mean a million rows are inserted in an order that is not final.

Data size for a 1 Job + 1 JobStepId, with my current indexes, is about 500 bytes.

Questions:

  • Is this a good use of a heap?

  • What's the effect of clustering on StartDate, when it's pretty much non-sequential for ~2 hours/1 million rows? My guess is the constant re-ordering would kill insert perf.

  • Should I just add bigint PKs just to have smaller, always increasing keys? (I'd still need the guids for lookups.)

I read GUIDs as PRIMARY KEYs and/or the clustering key, and it seemed to suggest that even inventing a key will save considerable space on other indexes. Also some resources suggest that heaps have some sort of perf issues in general, but I'm not sure if that still applies in SQL 2008.

And again, yes, I'm going to try to perf test and measure. I'm just trying to get some guidance or links to other articles so I can make a more informed decision on what paths to consider.

like image 991
MichaelGG Avatar asked Dec 22 '22 09:12

MichaelGG


1 Answers

Yes, heaps have issues. Your data will logically fragment all over the show and can not be defragmented simply.

Imagine throwing all your telephone directory into a bucket and then trying to find "bob smith". Or using a conventional telephone directory with a clustered index on lastname, firstname.

The overhead of maintaining the index is trivial.

StartDate, unless unique, is not a good choice. A clustered index requires internal uniqueness for the non-clustered indexes. If not declared unique, SQL Server will add a 4 byte "uniquifier".

Yes, I'd use int or bigint to make it easier. As for GUIDs: see the questions at the right hand side of the screen.

Edit:

Note, PK and clustered index are 2 separate issues even if SQL Server be default will make the PK clustered.

like image 160
gbn Avatar answered Mar 01 '23 22:03

gbn