PostgreSQL using UUID vs Text as primary key

Tags:

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.

My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.

I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.

It would be a significant project to change these, and I'm wondering if it would be worth it?

430

asked Nov 20 '15 21:11

Scottie

2 Answers

When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.

The data type text requires more space in RAM and on disk, is slower to process and more error prone. @khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.

This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?

`bigint`?

Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:

-9223372036854775808 to +9223372036854775807

That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.

If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.

UUID is really just for distributed systems and other special cases.

answered Sep 24 '22 13:09

Erwin Brandstetter

As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

Edit in response to comment from OP:

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.

e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.

There's also a module available for generating uuids, uuid-ossp.

answered Sep 21 '22 13:09

khampson

Related questions
                            
                                Display default access privileges for relations, sequences and functions in Postgres
                            
                                PostGIS - convert multipolygon to single polygon
                            
                                PG::ConnectionBad FATAL: role "Myname" does not exist
                            
                                How to install Postgres extensions at database creation?
                            
                                Change/reset postgresql user password on windows 7
                            
                                Select query with offset limit is too much slow
                            
                                Make Sqlalchemy Use Date In Filter Using Postgresql
                            
                                Installed Postgres.app but it won't work
                            
                                How to put psql on the path when using Postgres.app on OS X?
                            
                                PostgreSQL: database restore from dump - syntax error
                            
                                GeoDjango on Windows: "Could not find the GDAL library" / "OSError: [WinError 126] The specified module could not be found"
                            
                                Prevent recursive trigger in PostgreSQL
                            
                                How do I model a PostgreSQL failover cluster with Docker/Kubernetes?
                            
                                Index spanning multiple tables in PostgreSQL
                            
                                How many table partitions is too many in Postgres?
                            
                                Is there a way to ensure WHERE clause happens after DISTINCT?
                            
                                Populate MySQL database from postgresql dump file
                            
                                FOR EACH STATEMENT trigger example
                            
                                How to GROUP BY and CONCATENATE fields in redshift
                            
                                Postgres Hstore vs. Redis - performance wise

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PostgreSQL using UUID vs Text as primary key

Tags:

uuid

postgresql

primary-key

Scottie

People also ask

2 Answers

`bigint`?

Erwin Brandstetter

khampson

Recent Activity

Donate For Us

PostgreSQL using UUID vs Text as primary key

Tags:

uuid

postgresql

primary-key

Scottie

People also ask

2 Answers

bigint?

Erwin Brandstetter

khampson

Related questions

Recent Activity

Donate For Us

`bigint`?