Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL using UUID vs Text as primary key

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.

My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.

I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.

It would be a significant project to change these, and I'm wondering if it would be worth it?

like image 430
Scottie Avatar asked Nov 20 '15 21:11

Scottie


People also ask

Is UUID good for primary key Postgres?

You can use UUID as primary key in your table as it will be unique. However do keep in mind that UUID will occupy a bit more space as compared to SEQUENCE. And also they are not very fast. But yes they are for sure unique and hence you are guaranteed to get a consistent data.

Is it good to use UUID as primary key?

Pros. Using UUID for a primary key brings the following advantages: UUID values are unique across tables, databases, and even servers that allow you to merge rows from different databases or distribute databases across servers. UUID values do not expose the information about your data so they are safer to use in a URL.

Why is UUID better than auto increment?

UUID always occupies 16 bytes. For Auto Increment Integer, when stored as in long format, it occupies 8 bytes. If the table itself has only a few columns, the extra primary key space overhead will become more significant.

What is the use of UUID in PostgreSQL?

The PostgreSQL UUID data type is used to store the UUID values for a specified column. We can use the CREATE Extension command to install the uuid-ossp module to creates the UUID values. We can use the uuid_generate_v4() functionto automatically retrieve the UUID values for the table's particular column.


2 Answers

When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.

The data type text requires more space in RAM and on disk, is slower to process and more error prone. @khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.

This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:

  • Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
  • What is the optimal data type for an MD5 field?

bigint?

Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:

-9223372036854775808 to +9223372036854775807 

That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.

If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.

UUID is really just for distributed systems and other special cases.

like image 66
Erwin Brandstetter Avatar answered Sep 24 '22 13:09

Erwin Brandstetter


As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

Edit in response to comment from OP:

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.

e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.

There's also a module available for generating uuids, uuid-ossp.

like image 26
khampson Avatar answered Sep 21 '22 13:09

khampson