For various reasons that don't matter here we are storing chunks of text in either an encrypted or base64 encoded format in PostgreSQL. However, we want to be able to use PostgreSQL's fulltext search to find and return data which in its unencrypted/decoded form matches a search query.
How would one go about accomplishing this? I've seen other posts mention the ability to build the tsvector values before sending data to the database, but I was hoping there would be something available on the Postgres end of things (at least for the base64 text).
The most basic approach to searching through encrypted data is to download the data to the client's computer, decrypt it locally, and then search for the desired results in the plaintext data. For many reasons, this approach is neither practical nor scalable due to the possible significant data size.
In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.
Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.
PostgreSQL offers encryption at several levels, and provides flexibility in protecting data from disclosure due to database server theft, unscrupulous administrators, and insecure networks. Encryption might also be required to secure sensitive data such as medical records or financial transactions.
For encrypted values you can't. Even if you created the tsvector
client-side, the tsvector would contain a form of the encrypted text so it wouldn't be acceptable for most applications. Observe:
regress=> SELECT to_tsvector('my secret password is CandyStrip3r');
to_tsvector
------------------------------------------
'candystrip3r':5 'password':3 'secret':2
(1 row)
... whoops. It doesn't matter if you create that value client side instead of using to_tsvector
, it'll still have your password in cleartext. You could encrypt the tsvector, but then you couldn't use it for fulltext seach.
Sure, given the encrypted value:
CREATE EXTENSION pgcrypto;
regress=> SELECT encrypt( convert_to('my s3kritPassw1rd','utf-8'), '\xdeadbeef', 'aes');
encrypt
--------------------------------------------------------------------
\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff
(1 row)
you can (but shouldn't) do something like this:
regress=> SELECT to_tsvector( convert_from(decrypt('\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff', '\xdeadbeef', 'aes'), 'utf-8') );
to_tsvector
--------------------
's3kritpassw1rd':2
(1 row)
... but if the problems with that aren't immediately obvious after scrolling right in the code display box then you should really be getting somebody else to do your security design for you ;-)
There's been tons of research on ways to perform operations on encrypted values without decrypting them, like adding two encrypted numbers together to produce a result that's encrypted with the same key, so the process doing the adding doesn't need the ability to decrypt the inputs in order to get the output. It's possible some of this could be applied to fts - but it's way beyond my level of expertise in the area and likely to be horribly inefficient and/or cryptographically weak anyway.
For base64 you decode
the base64 before feeding it into to_tsvector
. Because decode
returns a bytea
and you know the encoded data is text you need to use convert_from
to decode the bytea
into text
in the database encoding, eg:
regress=> SELECT encode(convert_to('some text to search','utf-8'), 'base64');
encode
------------------------------
c29tZSB0ZXh0IHRvIHNlYXJjaA==
(1 row)
regress=> SELECT to_tsvector(convert_from( decode('c29tZSB0ZXh0IHRvIHNlYXJjaA==', 'base64'), getdatabaseencoding() ));
to_tsvector
---------------------
'search':4 'text':2
(1 row)
In this case I've used the database encoding as the input to convert_from
, but you need to make sure you use the encoding that the underlying base64
encoded text was in. Your application is responsible for getting this right. I suggest either storing the encoding in a 2nd column or ensuring that your application always encodes the text as utf-8
before applying base64
encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With