Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does varchar result in performance hit due to data fragmentation?

How are varchar columns handled internally by a database engine?

For a column defined as char(100), the DBMS allocates 100 contiguous bytes on the disk. However, for a column defined as varchar(100), that presumably isn't the case, since the whole point of varchar is to not allocate any more space than required to store the actual data value stored in the column. So, when a user updates a database row containing an empty varchar(100) column to a value consisting of 80 characters for instance, where does the space for that 80 characters get allocated from?

It seems that varchar columns must result in a fair amount of fragmentation of the actual database rows, at least in scenarios where column values are initially inserted as blank or NULL, and then updated later with actual values. Does this fragmentation result in degraded performance on database queries, as opposed to using char type values, where the space for the columns stored in the rows is allocated contiguously? Obviously using varchar results in less disk space than using char, but is there a performance hit when optimizing for query performance, especially for columns whose values are frequently updated after the initial insert?

like image 997
E Brown Avatar asked Oct 27 '09 14:10

E Brown


People also ask

Does VARCHAR size affect performance?

There is no performance impact whether you use the full length VARCHAR declaration VARCHAR(16777216) or use a smaller precision VARCHAR datatype column.

Is VARCHAR more efficient than text?

MySQL VARCHAR vs. TEXT: The Short Answer In most circumstances, VARCHAR provides better performance, it's more flexible, and can be fully indexed. If you need to store longer strings, use MEDIUMTEXT or LONGTEXT, but be aware that very large amounts of data can be stored in columns of these types.

Does index fragmentation affect performance?

Index Fragmentation Can Hinder Performance As you insert data into a table, if the data is under the SQL Server's data page size, then SQL Server will allocate one page to store that data. Otherwise, SQL Server will allocate multiple pages to store the data, and these data pages are often not full.

Why performance of char is better than VARCHAR?

Because of the fixed field lengths, data is pulled straight from the column without doing any data manipulation and index lookups against varchar are slower than that of char fields. CHAR is better than VARCHAR performance wise, however, it takes unnecessary memory space when the data does not have a fixed-length.


3 Answers

You make a lot of assumptions in your question that aren't necessarily true.

The type of the a column in any DBMS tells you nothing at all about the nature of the storage of that data unless the documentation clearly tells you how the data is stored. IF that's not stated, you don't know how it is stored and the DBMS is free to change the storage mechanism from release to release.

In fact some databases store CHAR fields internally as VARCHAR, while others make a decision about how to the store the column based on the declared size of the column. Some database store VARCHAR with the other columns, some with BLOB data, and some implement other storage, Some databases always rewrite the entire row when a column is updated, others don't. Some pad VARCHARs to allow for limited future updating without relocating the storage.

The DBMS is responsible for figuring out how to store the data and return it to you in a speedy and consistent fashion. It always amazes me how many people to try out think the database, generally in advance of detecting any performance problem.

like image 147
Larry Lustig Avatar answered Nov 10 '22 10:11

Larry Lustig


The data structures used inside a database engine is far more complex than you are giving it credit for! Yes, there are issues of fragmentation and issues where updating a varchar with a large value can cause a performance hit, however its difficult to explain /understand what the implications of those issues are without a fuller understanding of the datastructures involved.

For MS Sql server you might want to start with understanding pages - the fundamental unit of storage (see http://msdn.microsoft.com/en-us/library/ms190969.aspx)

In terms of the performance implications of fixes vs variable storage types on performance there are a number of points to consider:

  • Using variable length columns can improve performance as it allows more rows to fit on a single page, meaning fewer reads
  • Using variable length columns requires special offset values, and the maintenance of these values requires a slight overhead, however this extra overhead is generally neglible.
  • Another potential cost is the cost of increasing the size of a column when the page containing that row is nearly full

As you can see, the situation is rather complex - generally speaking however you can trust the database engine to be pretty good at dealing with variable data types and they should be the data type of choice when there may be a significant variance of the length of data held in a column.

At this point I'm also going to recommend the excellent book "Microsoft Sql Server 2008 Internals" for some more insight into how complex things like this really get!

like image 38
Justin Avatar answered Nov 10 '22 09:11

Justin


The answer will depend on the specific DBMS. For Oracle, it is certainly possible to end up with fragmentation in the form of "chained rows", and that incurs a performance penalty. However, you can mitigate against that by pre-allocating some empty space in the table blocks to allow for some expansion due to updates. However, CHAR columns will typically make the table much bigger, which has its own impact on performance. CHAR also has other issues such as blank-padded comparisons which mean that, in Oracle, use of the CHAR datatype is almost never a good idea.

like image 3
Tony Andrews Avatar answered Nov 10 '22 10:11

Tony Andrews