Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you do a PostgreSQL fulltext search on encoded or encrypted data?

For various reasons that don't matter here we are storing chunks of text in either an encrypted or base64 encoded format in PostgreSQL. However, we want to be able to use PostgreSQL's fulltext search to find and return data which in its unencrypted/decoded form matches a search query.

How would one go about accomplishing this? I've seen other posts mention the ability to build the tsvector values before sending data to the database, but I was hoping there would be something available on the Postgres end of things (at least for the base64 text).

like image 876
Seth Avatar asked Mar 14 '13 21:03

Seth


People also ask

How do you query encrypted data?

The most basic approach to searching through encrypted data is to download the data to the client's computer, decrypt it locally, and then search for the desired results in the plaintext data. For many reasons, this approach is neither practical nor scalable due to the possible significant data size.

How do I do a full text search in PostgreSQL?

In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.

Is PostgreSQL good for full text search?

Yes, You Can Keep Full-Text Search in Postgres You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions.

Does Postgres store data encrypted?

PostgreSQL offers encryption at several levels, and provides flexibility in protecting data from disclosure due to database server theft, unscrupulous administrators, and insecure networks. Encryption might also be required to secure sensitive data such as medical records or financial transactions.


1 Answers

Encrypted values

For encrypted values you can't. Even if you created the tsvector client-side, the tsvector would contain a form of the encrypted text so it wouldn't be acceptable for most applications. Observe:

regress=> SELECT to_tsvector('my secret password is CandyStrip3r');
               to_tsvector                
------------------------------------------
 'candystrip3r':5 'password':3 'secret':2
(1 row)

... whoops. It doesn't matter if you create that value client side instead of using to_tsvector, it'll still have your password in cleartext. You could encrypt the tsvector, but then you couldn't use it for fulltext seach.

Sure, given the encrypted value:

CREATE EXTENSION pgcrypto;

regress=> SELECT encrypt( convert_to('my s3kritPassw1rd','utf-8'), '\xdeadbeef', 'aes');
                              encrypt                               
--------------------------------------------------------------------
 \x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff
(1 row)

you can (but shouldn't) do something like this:

regress=> SELECT to_tsvector( convert_from(decrypt('\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff', '\xdeadbeef', 'aes'), 'utf-8') );
    to_tsvector     
--------------------
 's3kritpassw1rd':2
(1 row)

... but if the problems with that aren't immediately obvious after scrolling right in the code display box then you should really be getting somebody else to do your security design for you ;-)

There's been tons of research on ways to perform operations on encrypted values without decrypting them, like adding two encrypted numbers together to produce a result that's encrypted with the same key, so the process doing the adding doesn't need the ability to decrypt the inputs in order to get the output. It's possible some of this could be applied to fts - but it's way beyond my level of expertise in the area and likely to be horribly inefficient and/or cryptographically weak anyway.

Base64-encoded values

For base64 you decode the base64 before feeding it into to_tsvector. Because decode returns a bytea and you know the encoded data is text you need to use convert_from to decode the bytea into text in the database encoding, eg:

regress=> SELECT encode(convert_to('some text to search','utf-8'), 'base64');
            encode            
------------------------------
 c29tZSB0ZXh0IHRvIHNlYXJjaA==
(1 row)

regress=> SELECT to_tsvector(convert_from( decode('c29tZSB0ZXh0IHRvIHNlYXJjaA==', 'base64'), getdatabaseencoding() ));
     to_tsvector     
---------------------
 'search':4 'text':2
(1 row)

In this case I've used the database encoding as the input to convert_from, but you need to make sure you use the encoding that the underlying base64 encoded text was in. Your application is responsible for getting this right. I suggest either storing the encoding in a 2nd column or ensuring that your application always encodes the text as utf-8 before applying base64 encoding.

like image 96
Craig Ringer Avatar answered Oct 31 '22 21:10

Craig Ringer