It is perfectly acceptable to use a varchar column as the primary key. This is often the case when one uses a natural key that doesn't happen to be an integer.
Int comparisons are faster than varchar comparisons, for the simple fact that ints take up much less space than varchars. This holds true both for unindexed and indexed access. The fastest way to go is an indexed int column.
By itself, a primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity.
For a primary key to be an Integer, is easier to manage, and makes its index more effective. As you know, while the keys are auto indexed, the indexes are stored as Binary tree which is best for integers in traversing. There is no restriction on making a key to be int, you can declare it a varchar too.
You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.
It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.
INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.
I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
The tables:
create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
$pdo = get_pdo();
$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];
for ($k = 0; $k < 10; $k++) {
for ($j = 0; $j < 1000; $j++) {
$val = '';
for ($i = 0; $i < 1000; $i++) {
$val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
}
$val = rtrim($val, ',');
$pdo->query('INSERT INTO jan_char VALUES ' . $val);
}
echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}
For int
tables, the bit ($keys[rand(0, 9)])
was replaced with just rand(0, 9)
, and for varchar
tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string()
generates a 10-character random string.
Then I ran in MySQL:
SET SESSION query_cache_type=0;
jan_int
table:
SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
myindex = 'califo'
for char
tables and myindex = 'california'
for varchar
tables.Times of the BENCHMARK
query on each table:
Regarding table & index sizes, here's the output of show table status from janperformancetest;
(w/ a few columns not shown):
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int | InnoDB | 10 | Dynamic | 9739094 | 43 | 422510592 | 0 | 0 | 4194304 | NULL | utf8mb4_unicode_520_ci |
| jan_int_index | InnoDB | 10 | Dynamic | 9740329 | 43 | 420413440 | 0 | 132857856 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_char | InnoDB | 10 | Dynamic | 9726613 | 51 | 500170752 | 0 | 0 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_char_index | InnoDB | 10 | Dynamic | 9719059 | 52 | 513802240 | 0 | 202342400 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar | InnoDB | 10 | Dynamic | 9722049 | 53 | 521142272 | 0 | 0 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar_index | InnoDB | 10 | Dynamic | 9738381 | 49 | 486539264 | 0 | 202375168 | 7340032 | NULL | utf8mb4_unicode_520_ci |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
My conclusion is that there's no performance difference for this particular use case.
Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..
So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...
At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .
For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...
Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.
For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With