Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: Difference between "bytea" and "bit varying" types

The PostgreSQL types bytea and bit varying sound similar:

  • bytea stores binary strings.
  • bit varying stores strings of 1's and 0's.

The documentation does not mention a maximum size for either. Is it 1GB like character varying?

I have two separate use cases, both over a table with millions of rows:

Storing MD5 hashes

That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:

  • Deduplication: Heavy use of GROUP BY, with an index I suppose.
  • Querying with WHERE md5 = for exact matches only.
  • Displaying as a hex string for human use.

Storing arbitrary binary data

Strings of binary data of varying length up to 4kB for:

  • Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
  • Extracting some bytes, for instance get the integer value of the byte 14 in my string.
  • Some deduplication.

Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?

CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';

Which of bytea and bit varying is the more appropriate?

I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

like image 798
Victor Avatar asked Oct 29 '14 16:10

Victor


People also ask

What is Bytea type in Postgres?

The bytea data type allows the storage of binary strings or what is typically thought of as “raw bytes”. Materialize supports both the typical formats for input and output: the hex format and the historical PostgreSQL escape format. The hex format is preferred.

What is type Bytea?

The BYTEA data type allows storage of binary strings. It stores a LOB within the table, respectively using TOAST. It is thus limited to 1 GB. The storage is octal and allows non printable characters (in contrast to character strings which don't). The input/output format is HEX (as of PostgreSQL 9.0).

What is serial data type in Postgres?

PostgreSQL has a special kind of database object generator called SERIAL. It is used to generate a sequence of integers which are often used as the Primary key of a table. Syntax: variable_name SERIAL.

What is integer [] in PostgreSQL?

Integer ( INT ) is a 4-byte integer that has a range from -2,147,483,648 to 2,147,483,647. Serial is the same as integer except that PostgreSQL will automatically generate and populate values into the SERIAL column. This is similar to AUTO_INCREMENT column in MySQL or AUTOINCREMENT column in SQLite.


Video Answer


2 Answers

In general, if you're not using bitwise operations you should be using bytea.

I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.

MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.

I think bit varying really has two uses:

  • To store flags fields that are literally bit strings; and
  • As an interim data type for internal calculations

For pretty much everything else, use bytea.

There's nothing stopping you storing a 4k bitfield if that's what it is, though.

like image 70
Craig Ringer Avatar answered Oct 17 '22 15:10

Craig Ringer


  1. It appears the maximum length of bytea is 1 GB. [1]
  2. For bitwise operation use bit varying (explanation see below)
  3. For storing MD5 hash use bytea. It will take less storage than bit varying
  4. The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).

For bitwise operation use bit varying

If you concern about storage: bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:

bit varying vs bytea

So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.

PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]

Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]

Comparison between UUID and bytea

  create table u(uuid uuid primary key, payload character(300));
  create table b( bytea bytea primary key, payload character(300));

  INSERT INTO u                                                  
  SELECT uuid_generate_v4()                                                     
  FROM generate_series(1,1000*1000);

  INSERT INTO b                                                   
  SELECT random_bytea(16)                                                       
  FROM generate_series(1,1000*1000);

  VACUUM ANALYZE u;
  VACUUM ANALYZE b;

  ## Your table size
  SELECT pg_size_pretty(pg_total_relation_size('u'));
  pg_size_pretty 
  ---------------- 
  81 MB

  SELECT pg_size_pretty(pg_total_relation_size('b'));
  pg_size_pretty 
  ---------------- 
  101 MB

  ## Speed comparison
  \timing on

  ## Common select
  select * from u limit 1000;
  Time: 1.433 ms

  select * from b limit 1000;
  Time: 1.396 ms

  ## Random Select
  SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
  Time: 42.453 ms

  SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
  Time: 10.962 ms

Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)

Note: No Index, there is only one connection

Some source :

  1. PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book
like image 9
Bagus Trihatmaja Avatar answered Oct 17 '22 14:10

Bagus Trihatmaja