How are varchar columns handled internally by a database engine?
For a column defined as char(100)
, the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100)
, that presumably isn't the case, since the whole point of varchar
is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100)
column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?
It seems that varchar
columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar
results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?
There is no performance impact whether you use the full length VARCHAR declaration VARCHAR(16777216) or use a smaller precision VARCHAR datatype column.
MySQL VARCHAR vs. TEXT: The Short Answer In most circumstances, VARCHAR provides better performance, it's more flexible, and can be fully indexed. If you need to store longer strings, use MEDIUMTEXT or LONGTEXT, but be aware that very large amounts of data can be stored in columns of these types.
Index Fragmentation Can Hinder Performance As you insert data into a table, if the data is under the SQL Server's data page size, then SQL Server will allocate one page to store that data. Otherwise, SQL Server will allocate multiple pages to store the data, and these data pages are often not full.
Because of the fixed field lengths, data is pulled straight from the column without doing any data manipulation and index lookups against varchar are slower than that of char fields. CHAR is better than VARCHAR performance wise, however, it takes unnecessary memory space when the data does not have a fixed-length.
You make a lot of assumptions in your question that aren't necessarily true.
The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.
In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.
The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.
The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.
For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)
In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:
As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.
At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!
The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With