Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres Large Text Search Advice

Tags:

postgresql

I'm quite new to databases, and am looking for some high level advice.

The Situation
I'm building a database using Postgres 9.3, within the database is a table in which I store log files.

CREATE TABLE errorlogs (
     id SERIAL PRIMARY KEY,
     archive_id INTEGER NOT NULL REFERENCES archives,
     filename VARCHAR(256) NOT NULL,
     content TEXT);

The text in content can vary in length anywhere from 1k to 50MB.

The Problem
I'd like to be able to perform reasonably fast text searches on the data within the "content" column (eg, WHERE CONTENT LIKE '%some_error%'). Right now the searches are very slow (>10 minutes to search through 8206 rows).

I know that indexing is intended to be the solution to my problem, but I don't seem to be able to create indexes -- whenever I try I get errors that the index would be too large.

=# CREATE INDEX error_logs_content_idx ON errorlogs (content text_pattern_ops);
ERROR: index row requires 1796232 bytes, maximum size is 8191

I was hoping for some advice on how to get around this problem. Can I change the maximum index size? Or should I not be trying to use Postgres for full text search on text fields as large as this?

Any advice is much appreciated!

like image 250
JBeFat Avatar asked Jan 15 '15 17:01

JBeFat


1 Answers

Text search vectors can't handle data this big --- see documented limits. Their strength is fuzzy searching, so you can search for 'swim' and finding 'swim,' 'swimming,' 'swam,' and 'swum' in the same call. They are not meant to replace grep.

The reason for the limits are in the source code as MAXSTRLEN (and MAXSTRPOS). Text search vectors are stored in one long, continuous array up to 1 MiB in length (total of all characters for all unique lexemes). To access these, the ts_vector index structure allows 11 bits for word length and 20 bits for its position in the array. These limits allow the index structure fit into a 32-bit unsigned int.

You are probably running into one or both of these limits if you have either too many unique words in a file OR words are repeated very frequently --- something quite possible if you have 50MB log file with quasi-random data.

Are you sure you need to store log files in a database? You're basically replicating the file system, and grep or python can do the searching there quite nicely. If you really need to, though, you might consider this:

CREATE TABLE errorlogs (
    id SERIAL PRIMARY KEY
    , archive_id INTEGER NOT NULL REFERENCES archives
    , filename VARCHAR(256) NOT NULL
);

CREATE TABLE log_lines (
    line PRIMARY KEY
    , errorlog INTEGER REFERENCES errorlogs(id)
    , context TEXT
    , tsv TSVECTOR
);

CREATE INDEX log_lines_tsv_idx ON log_lines USING gin( line_tsv );

Here, you treat each log line as a "document." To search, you'd do something like

SELECT e.id, e.filename, g.line, g.context
FROM errorlogs e JOIN log_lines g ON e.id = g.errorlog 
WHERE g.tsv @@ to_tsquery('some & error');
like image 125
afs76 Avatar answered Sep 22 '22 08:09

afs76