I have a contacts
table which contains fields such as postcode
, first name
, last name
, town
, country
, phone number
etc, all of which are defined as VARCHAR(255)
even though none of these fields will ever come close to having 255 characters. (If you're wondering, it's this way because Ruby on Rails migrations map String fields to VARCHAR(255)
by default and I never bothered to override it).
Since VARCHAR will only store the number of actual characters of the field (along with the field length), is there any distinct advantage (performance or otherwise) to using, say, VARCHAR(16)
over VARCHAR(255)
?
Additionally, most of these fields have indexes on them. Does a larger VARCHAR size on the field affect the size or performance of the index at all?
FYI I'm using MySQL 5.
In storage, VARCHAR(255)
is smart enough to store only the length you need on a given row, unlike CHAR(255)
which would always store 255 characters.
But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: as rows are copied from the storage engine layer to the SQL layer, VARCHAR
fields are converted to CHAR
to gain the advantage of working with fixed-width rows. So the strings in memory become padded out to the maximum length of your declared VARCHAR
column.
When your query implicitly generates a temporary table, for instance while sorting or GROUP BY
, this can use a lot of memory. If you use a lot of VARCHAR(255)
fields for data that doesn't need to be that long, this can make the temporary table very large.
You may also like to know that this "padding out" behavior means that a string declared with the utf8 character set pads out to three bytes per character even for strings you store with single-byte content (e.g. ascii or latin1 characters). And likewise utf8mb4 character set causes the string to pad out to four bytes per character in memory.
So a VARCHAR(255)
in utf8 storing a short string like "No opinion" takes 11 bytes on disk (ten lower-charset characters, plus one byte for length) but it takes 765 bytes in memory, and thus in temp tables or sorted results.
I have helped MySQL users who unknowingly created 1.5GB temp tables frequently and filled up their disk space. They had lots of VARCHAR(255)
columns that in practice stored very short strings.
It's best to define the column based on the type of data that you intend to store. It has benefits to enforce application-related constraints, as other folks have mentioned. But it has the physical benefits to avoid the memory waste I described above.
It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR
that is certainly longer than any address. And 255 is customary because it is the maximum length of a VARCHAR
for which the length can be encoded with one byte. It was also the maximum VARCHAR
length in MySQL older than 5.0.
In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.
Defining maximum limits for strings is a good thing to do to prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.
For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.
Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).
Examples of real data that has no reason to exceed a certain length:
And so on...
Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.
By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that will come back to haunt your code later.
And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.
I'm with you. Fussy attention to detail is a pain in the neck and has limited value.
Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.
If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)
However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).
Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.
However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.
Nowadays, i can't imagine it really matters any more.
There's a computational overhead to using variable length fields, but with the excesses of CPUs today, it's not even worth considering. The I/O system are so slow as to make any computational costs to handle varchars effectively non-existent. In fact, the price of a varchar computationally is probably a net win over the amount of diskspace saved by using variable length fields over fixed length fields. You most likely have greater row density.
Now, the complexity of varchar fields is that you can't easily locate a record via it's record number. When you have a fixed length row size (with fixed length fields), it's trivial to compute the disk block that a row id points to. With a variable length rowsize, that kind of goes out the window.
So, now you need to maintain some kind of record number index, just like any other primary key, OR you need to make a robust row identifier that encodes details (such as the block, etc.) in to the identifier. If you do that, though, the id would have to be recalculated if ever the row is moved on persistent storage. No big deal, just need to rewrite all of the index entries and make sure the you either a) never expose it to the consumer or b) never assert that the number is reliable.
But since we have varchar fields today, the only value of varchar(16) over varchar(255) is that the DB will enforce the 16 char limit on the varchar(16). If the DB model is supposed to be actually representative of the physical data model, then having fields lengths can be of value. If, however, it's simply "storage" rather than a "model AND storage", there's no need whatsoever.
Then you simply need to discern between a text field that is indexable (such varchar) vs something that is not (like a text or CLOB field). The indexable fields tend to have a limit on size to facilitate the index whereas the CLOB fields do not (within reason).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With