The PostgreSQL types <code>bytea</code> and <code>bit varying</code> sound similar: <ul> <li> <code>bytea</code> stores binary strings.</li> <li> <code>bit varying</code> stores strings of 1's and 0's.</li> </ul> The documentation does not mention a maximum size for either. Is it 1GB like <code>character varying</code>? I have two separate use cases, both over a table with millions of rows: Storing MD5 hashes That would be a <code>bytea</code> with a length of 16 bytes or a <code>bit(128)</code>. It would be used for: <ul> <li>Deduplication: Heavy use of <code>GROUP BY</code>, with an index I suppose.</li> <li>Querying with <code>WHERE md5 =</code> for exact matches only.</li> <li>Displaying as a hex string for human use.</li> </ul> Storing arbitrary binary data Strings of binary data of varying length up to 4kB for: <ul> <li>Bitwise operations to find the strings matching a certain mask. Example at the end of this post.</li> <li>Extracting some bytes, for instance get the integer value of the byte 14 in my string.</li> <li>Some deduplication.</li> </ul> Working example for the bitwise operation, using <code>bit varying</code>. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with <code>bytea</code>? <pre class="prettyprint"><code>CREATE TABLE test1 (mystring bit varying); INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC'); SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00'; </code></pre> Which of <code>bytea</code> and <code>bit varying</code> is the more appropriate? I saw the <code>UUID</code> type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

In general, if you're not using bitwise operations you should be using <code>bytea</code>. I store larger values in <code>bytea</code> and then convert substrings to <code>bit varying</code> for bitwise operations where possible, mostly because clients understand <code>bytea</code> much more consistently than <code>bit varying</code> and the I/O format is more compact. MD5 values should be stored as <code>bytea</code>. Bitwise operations on them make no sense, and you generally want to fetch them as binary. I think <code>bit varying</code> really has two uses: <ul> <li>To store flags fields that are literally bit strings; and</li> <li>As an interim data type for internal calculations</li> </ul> For pretty much everything else, use <code>bytea</code>. There's nothing stopping you storing a 4k bitfield if that's what it is, though.

<ol> <li>It appears the maximum length of <code>bytea</code> is 1 GB. [1]</li> <li>For bitwise operation use <code>bit varying</code> (explanation see below)</li> <li>For storing MD5 hash use <code>bytea</code>. It will take less storage than <code>bit varying</code> </li> <li>The benefit using <code>UUID</code> is <code>UUID</code> algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate <code>UUID</code> in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in <code>UUID</code> (comparison between <code>bytea</code> and <code>UUID</code> see below).</li> </ol> For bitwise operation use <code>bit varying</code> If you concern about storage: <code>bit varying</code> takes more storage than <code>bytea</code>. If you are okay then you should try comparing the function they both offer: bit varying vs bytea So far I can see <code>bit varying</code> will be more suitable for you to do bitwise operation though <code>bytea</code> is generally accepted way to store arbitrary data. PostgreSQL offers a single <code>bytea</code> operator: concatenation. You can append one <code>byte</code> value to another <code>bytea</code> value using the concatenation operator <code>||</code>. [1] Note that you cannot compare two <code>bytea</code> value, even for equality/inequality. You can, of course, convert <code>bytea</code> value into another value using the <code>CAST()</code>, and that opens up other operators. [1] Comparison between <code>UUID</code> and <code>bytea</code> <pre class="prettyprint"><code> create table u(uuid uuid primary key, payload character(300)); create table b( bytea bytea primary key, payload character(300)); INSERT INTO u SELECT uuid_generate_v4() FROM generate_series(1,1000*1000); INSERT INTO b SELECT random_bytea(16) FROM generate_series(1,1000*1000); VACUUM ANALYZE u; VACUUM ANALYZE b; ## Your table size SELECT pg_size_pretty(pg_total_relation_size('u')); pg_size_pretty ---------------- 81 MB SELECT pg_size_pretty(pg_total_relation_size('b')); pg_size_pretty ---------------- 101 MB ## Speed comparison \timing on ## Common select select * from u limit 1000; Time: 1.433 ms select * from b limit 1000; Time: 1.396 ms ## Random Select SELECT * FROM u OFFSET random()*1000 LIMIT 10000; Time: 42.453 ms SELECT * FROM b OFFSET random()*1000 LIMIT 10000; Time: 10.962 ms </code></pre> Conclusion : I don't think there will be more benefit using <code>UUID</code> except its uniqueness and smaller size (will be faster to insert) Note: No Index, there is only one connection Some source : <ol> <li>PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book</li> </ol>

PostgreSQL: Difference between "bytea" and "bit varying" types

Tags:

bit-manipulation

postgresql

The PostgreSQL types bytea and bit varying sound similar:

bytea stores binary strings.
bit varying stores strings of 1's and 0's.

The documentation does not mention a maximum size for either. Is it 1GB like character varying?

I have two separate use cases, both over a table with millions of rows:

Storing MD5 hashes

That would be a bytea with a length of 16 bytes or a bit(128). It would be used for:

Deduplication: Heavy use of GROUP BY, with an index I suppose.
Querying with WHERE md5 = for exact matches only.
Displaying as a hex string for human use.

Storing arbitrary binary data

Strings of binary data of varying length up to 4kB for:

Bitwise operations to find the strings matching a certain mask. Example at the end of this post.
Extracting some bytes, for instance get the integer value of the byte 14 in my string.
Some deduplication.

Working example for the bitwise operation, using bit varying. The mask is X'00FF00' and the it returns only the row X'AAAAAA'. I shortened the strings for the example but it would be over their full length, up to 4kB. Is it possible to do something similar with bytea?

CREATE TABLE test1 (mystring bit varying);
INSERT INTO test1 VALUES (X'AAAAAA'), (X'ABCABC');
SELECT * FROM test1 WHERE mystring & X'00FF00' = X'00AA00';

Which of bytea and bit varying is the more appropriate?

I saw the UUID type is made to store exactly 16 bytes, would that be any advantage to store the MD5's?

798

asked Oct 29 '14 16:10

Victor

Video Answer

2 Answers

In general, if you're not using bitwise operations you should be using bytea.

I store larger values in bytea and then convert substrings to bit varying for bitwise operations where possible, mostly because clients understand bytea much more consistently than bit varying and the I/O format is more compact.

MD5 values should be stored as bytea. Bitwise operations on them make no sense, and you generally want to fetch them as binary.

I think bit varying really has two uses:

To store flags fields that are literally bit strings; and
As an interim data type for internal calculations

For pretty much everything else, use bytea.

There's nothing stopping you storing a 4k bitfield if that's what it is, though.

answered Oct 17 '22 15:10

Craig Ringer

It appears the maximum length of bytea is 1 GB. [1]
For bitwise operation use bit varying (explanation see below)
For storing MD5 hash use bytea. It will take less storage than bit varying
The benefit using UUID is UUID algorithm somehow guarantees your uniqueness, not only in your table, but also in your database or even across your database (even if you generate UUID in your application). I think if you are using UUID without dashes it will be more efficient for storing, comparing and sorting in UUID (comparison between bytea and UUID see below).

For bitwise operation use bit varying

If you concern about storage: bit varying takes more storage than bytea. If you are okay then you should try comparing the function they both offer:

bit varying vs bytea

So far I can see bit varying will be more suitable for you to do bitwise operation though bytea is generally accepted way to store arbitrary data.

PostgreSQL offers a single bytea operator: concatenation. You can append one byte value to another bytea value using the concatenation operator ||. [1]

Note that you cannot compare two bytea value, even for equality/inequality. You can, of course, convert bytea value into another value using the CAST(), and that opens up other operators. [1]

Comparison between UUID and bytea

  create table u(uuid uuid primary key, payload character(300));
  create table b( bytea bytea primary key, payload character(300));

  INSERT INTO u                                                  
  SELECT uuid_generate_v4()                                                     
  FROM generate_series(1,1000*1000);

  INSERT INTO b                                                   
  SELECT random_bytea(16)                                                       
  FROM generate_series(1,1000*1000);

  VACUUM ANALYZE u;
  VACUUM ANALYZE b;

  ## Your table size
  SELECT pg_size_pretty(pg_total_relation_size('u'));
  pg_size_pretty 
  ---------------- 
  81 MB

  SELECT pg_size_pretty(pg_total_relation_size('b'));
  pg_size_pretty 
  ---------------- 
  101 MB

  ## Speed comparison
  \timing on

  ## Common select
  select * from u limit 1000;
  Time: 1.433 ms

  select * from b limit 1000;
  Time: 1.396 ms

  ## Random Select
  SELECT * FROM u OFFSET random()*1000 LIMIT 10000;
  Time: 42.453 ms

  SELECT * FROM b OFFSET random()*1000 LIMIT 10000;
  Time: 10.962 ms

Conclusion : I don't think there will be more benefit using UUID except its uniqueness and smaller size (will be faster to insert)

Note: No Index, there is only one connection

Some source :

PostgreSQL: "The Comprehensive Guide to Building, Programming, And Administratoring PostgreSQL Databases" Book

answered Oct 17 '22 14:10

Bagus Trihatmaja

Related questions
                            
                                Will an AFTER trigger in Postgres block an insert/update?
                            
                                Why does Play action fail with "no suitable driver found" with Slick and PostgreSQL?
                            
                                error : subquery must return only one column
                            
                                How to get a pg_dump -s to include the CREATE DATABASE command?
                            
                                PSQLException: password-based authentication
                            
                                Connection refused with postgresql using psycopg2
                            
                                Large SQL transaction: runs out of memory on PostgreSQL, yet works on SQL Server
                            
                                How to use "RAISE INFO, RAISE LOG， RAISE DEBUG” to track log in PostgreSQL function?
                            
                                How to aggregate matching pairs into "connected components" in Python
                            
                                Failure to connect to Docker Postgresql instance from Python
                            
                                How to compare character varying (varcar) to UUID in PostgreSQL?
                            
                                Select does not return values Postgres-11.4
                            
                                PostgreSQL: how to combine multiple rows?
                            
                                Get warning messages through psycopg2
                            
                                Lot of SHOW TRANSACTION ISOLATION LEVEL queries in postgres
                            
                                How to select most frequent value in a column per each id group?
                            
                                How to pass container ip as ENV to other container in docker-compose file
                            
                                How can I pool connections using psycopg and gevent?
                            
                                Deferrable, case-insensitive unique constraint
                            
                                How to check if number is NaN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With