Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I store short text strings into a SQL Server database?

varchar(255), varchar(256), nvarchar(255), nvarchar(256), nvarchar(max), etc?

256 seems like a nice, round, space-efficient number. But I've seen 255 used a lot. Why?

What's the difference between varchar and nvarchar?

like image 876
Zack Peterson Avatar asked Sep 10 '08 15:09

Zack Peterson


2 Answers

In MS SQL Server (7.0 and up), varchar data is represented internally with up to three values:

  • The actual string of characters, which will be from 0 to something over 8000 bytes (it’s based on page size, the other columns stored for the row, and a few other factors)
  • Two bytes used to indicate how long the data string is (which produces a value from 0 to 8000+)
  • If the column is nullable, one bit in the row’s null bitmask (so the null status of up to eight nullable columns can be represented in one byte)

The important part is that two-byte data length indicator. If it was one byte, you could only properly record strings of length 0 to 255; with two bytes, you can record strings of length 0 to something over 64000+ (specifically, 2^16 -1). However, the SQL Server page length is 8k, which is where that 8000+ character limit comes from. (There's data overflow stuff in SQL 2005, but if your strings are going to be that long you should just go with varchar(max).)

So, no matter how long you declare your varchar datatype column to be (15, 127, 511), what you will actually be storing for each and every row is:

  • 2 bytes to indicate how long the string is
  • The actual string, i.e. the number of characters in that string

Which gets me to my point: a number of older systems used only 1 byte to store the string length, and that limited you to a maximum length of 255 characters, which isn’t all that long. With 2 bytes, you have no such arbitrary limit... and so I recommend picking a number that makes sense to the (presumed non-technically oriented) user. , I like 50, 100, 250, 500, even 1000. Given that base of 8000+ bytes of storage, 255 or 256 is just as efficient as 200 or 250, and less efficient when it comes time to explain things to the end users.

This applies to single byte data (i.e. ansii, SQL_Latin1*_*General_CP1, et. al.). If you have to store data for multiple code pages or languages using different alphabets, you’ll need to work with the nvarchar data type (which I think works the same, two bytes for number of charactesr, but each actual character of data requires two bytes of storage). If you have strings likely to go over 8000, or over 4000 in nvarchar, you will need to use the [n]varchar(max) datatypes.

And if you want to know why it is so very important to take up space with extra bytes just to track how long the data is, check out http://www.joelonsoftware.com/articles/fog0000000319.html

Philip

like image 96
Philip Kelley Avatar answered Oct 20 '22 07:10

Philip Kelley


There are a couple of other points to consider when defining char/varchar and the N variations.

First, there is some overhead to storing variable length strings in the database. A good general rule of thumb is to use CHAR for strings less than 10 chars long, since N/VARCHAR stores both the string and the length and the difference between storing short strings in N/CHAR vs. N/VARCHAR under 10 isn't worth the overhead of the string length.

Second, a table in SQL server is stored on 8KB pages, so the max size of the row of data is 8060 bytes (the other 192 are used for overhead by SQL). That's why SQL allows a max defined column of VARCHAR(8000) and NVARCHAR(4000). Now, you can use VARCHAR(MAX) and the unicode version. But there can be extra overhead associated with that.

If I'm not mistaken, SQL server will try to store the data on the same page as the rest of the row but, if you attempt to put too much data into a VARCHAR(Max) column, it will treat it as binary and store it on another page.

Another big difference between CHAR and VARCHAR has to do with page splits. Given that SQL Server stores data in 8KB pages, you could have any number of rows of data stored on a page. If you UPDATE a VARCHAR column with a value that is large enough that the row will no longer fit on the page, the server will split that page, moving off some number of records. If the database has no available pages and the database is set to auto grow, the server will first grow the database to allocate blank pages to it, then allocate blank pages to the table and finally split the single page into two.

like image 20
Josef Avatar answered Oct 20 '22 08:10

Josef