Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing Performance BigInt vs VarChar

This is a FACT Table in a Data Warehouse

It has a composite index as follows

ALTER TABLE [dbo].[Fact_Data] 
ADD  CONSTRAINT [PK_Fact_Data] 
PRIMARY KEY CLUSTERED 
(
    [Column1_VarChar_10] ASC,
    [Column2_VarChar_10] ASC,
    [Column3_Int] ASC,
    [Column4_Int] ASC,
    [Column5_VarChar_10] ASC,
    [Column6_VarChar_10] ASC,
    [Column7_DateTime] ASC,
    [Column8_DateTime] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON
) ON [PRIMARY]
GO

In this structure, all of the varchar 10 columns have numeric values only. Is it going to be beneficial for me to change this 78 million row structure to hold BIGINT instead of VARCHAR in terms of querying and indexing?

Any other benefits / drawbacks that I should consider?

like image 377
Raj More Avatar asked Oct 21 '09 20:10

Raj More


People also ask

What is difference between BIGINT and varchar?

BIGINT is always 8 bytes, VARCHAR(15) is 1.. 16 bytes depending on value length, so BIGINT needs less memory on large numbers, but more memory on small numbers (shorter than 7 digits). Also, BIGINT is faster.

Is int faster than varchar?

1 Answer. Int comparisons are faster than varchar comparisons, and ints take much less space than varchars. This is applicable true for both unindexed and indexed access. You can use an indexed int column to make it faster.

Does BIGINT take up more space?

BIGINT (-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)) can store a much larger number than INT (-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)) can, but uses double the storage space (8 bytes compared to 4 bytes).

Can varchar be indexed?

As you might know, the key columns of an index in SQL Server (Clustered Index, Non-Clustered Index) have a size limitation of a maximum of 900 bytes. Imagine now that you want to fool SQL Server, and create an index on a VARCHAR(8000) column and insert more than 900 bytes in the index key column.


2 Answers

You should DEFINITELY introduce a surrogate INT IDENTITY() primary key!! INT already gives you potentially up to 2 billion rows - isn't that enough??

This primary key / clustered key on SQL Server will be up to 64 bytes in size (instead of 4, for a INT) - which will make your clustered index AND all your non-clustered index be bloated beyond recognition. The whole clustering key (all your 8 columns) will be included on every single page of every single non-clustered index on that table - wasting lots and lots of space for sure.

So on any given index table, you would have up to 16 times more entries with a surrogate INT clustered key - that means lots less I/O, lots less time wasted reading index pages.

And just imagine trying to establish a foreign-key relationship to that table.... any child table would have to have all 8 columns of your primary key as foreign key columns, and specify all 8 columns in every join - what a nightmare!!

At 78 million rows, even just changing the clustering key to INT IDENTITY will save you up to 60 bytes per row - that alone would come out to be up to 4 GByte of disk space (and RAM usage in your server). And that's not even beginning to calculate the savings on the non-clustered indices.......

And of course, yes, I would also change the VARCHAR(10) to INT or BIGINT - if it's a number, make the field type numeric - no point in leaving it at VARCHAR(10), really. But that alone is not going to make a huge difference in terms of speed or performance - it just makes working with the data that much easier (don't have to always cast around to numeric types when e.g. comparing values and so forth).

Marc

like image 88
marc_s Avatar answered Nov 07 '22 05:11

marc_s


Two things that can affect index (and overall DB) performance:

1) Size of index page 2) Comparison speed

So for the first one, in general the smaller your index/data page is, the more pages you can hold in memory, and the greater the likelihood that a given query will be able to find the page in cache vs. slow disk. Thus, you'd want to use the smallest datatype that can comfortably fit your existing and proposed future needs.

BigInt is 8 bytes; the VARCHAR's can be smaller if the size of the data is small, so it really depends on your data. However, 10 character long numbers may be able to fit in SQL Server's INT datatype (http://msdn.microsoft.com/en-us/library/ms187745.aspx) depending on the size, so int vs. bigint depends on your domain.

Also, if your entire row is of a fixed length there are some certain optimizations SQL Server can do in scans since it knows exactly where on disk the next row will be (assuming the rows are contiguous). An edge case, to be sure, but it can help.

For the second one, it is faster to compare integers than unicode strings. So, if you are only storing number data, you definitely should switch to an appropriately sized numeric datatype.

Finally, Marc is correct that this becomes a very convoluted primary key. However, if your data warrants it -- such as these being your ONLY columns and you are never doing add'l queries -- you may be perfectly fine making the optimized version (with Bigints etc.) your primary key. Kind of a code smell, though, so I will echo his advise to really take a look at your data model and see if this is correct.

like image 42
Matt Rogish Avatar answered Nov 07 '22 06:11

Matt Rogish