Full-text search in Postgres or CouchDB?

Tags:

I took geonames.org and imported all their data of German cities with all districts.

If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :( The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.

Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.

Postgres needs about 3 seconds per query which makes the auto complete unusable.

Now I thought of CouchDb, have already worked with it. My question:

I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.

Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!

239

asked Mar 12 '11 21:03

Jan L.

3 Answers

If I understand your problem right, probably all you need is already built in the CouchDB.

To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.

Here is a view function to make this:

function(doc) {
    for (var name in doc.places) {
        emit(name, doc._id);
    }
}

Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.

answered Sep 23 '22 19:09

ssmir

Optimized search with PostgreSQL

Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.

If it gets more fuzzy or your search is not anchored at the start, look here for more:

Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.

1) `text_pattern_ops`

Assuming your column is of type text, you would use a special index for text pattern operators like this:

CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);

SELECT name
FROM   tbl
WHERE  name ~~ ('Hambu' || '%');

This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:

If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.

2) Index on expression

I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:

CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(name) ~~ (lower('Hambu') || '%');

To make use of the index, the WHERE clause has to match the the index expression.

3) Optimize index size and speed

Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:

CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%');

left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.

4) Cover all possible requests

What about strings with more than 10 characters?

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND    lower(name) ~~ (lower('Hambu678910') || '%');

This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.

5) Optimize disc representation with `CLUSTER`

So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:

When a table is clustered, it is physically reordered based on the index information.

With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.

First call:

CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;

Information which index to use will be saved and successive calls will re-cluster the table:

CLUSTER tbl;
CLUSTER;    -- cluster all tables in the db that have previously been clustered.

If you don't want to repeat it:

ALTER TABLE tbl SET WITHOUT CLUSTER;

However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.

6) Prevent too many rows in the result

Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT  501;

If your query returns more than 500 rows, tell the user to narrow down his search.

7) Optimize filter method (operators)

If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:

SELECT name
FROM   tbl
WHERE  lower(left(name, 10)) ~>=~ lower('Hambu')
AND    lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));

You gain very little with this last stunt. Normally, standard operators are the better choice.

If you do all that, search time will be reduced to a matter of milliseconds.

answered Sep 23 '22 19:09

Erwin Brandstetter

I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.

Having said that, there's a project integrating CouchDB with Lucene.

answered Sep 22 '22 19:09

deluan

Related questions
                            
                                When should I use primitives instead of wrapping objects?
                            
                                Switch from Microsofts STL to STLport
                            
                                performance difference between User Defined Function and Stored Procedures
                            
                                Performance of DrawingVisual vs Canvas.OnRender for lots of constantly changing shapes
                            
                                Varnish: cache only specific domain
                            
                                does order of members of objects of a class have any impact on performance?
                            
                                iphone - Animation's performance is very poor when view's shadow is on
                            
                                Efficiency of equalsIgnoreCase() versus toUpperCase().equals and toLowerCase().equals
                            
                                Optional vs if/else-if performance java 8
                            
                                Memory vs. Performance
                            
                                Which is faster in ruby - a hash lookup or a function with a case statement?
                            
                                Is Arrays.stream(array_name).sum() slower than iterative approach?
                            
                                How to get an ideal number of threads in parallel programs in Java?
                            
                                Fast circle collision detection
                            
                                Do sse instructions consume more power/energy?
                            
                                Tomcat 7 Async Processing
                            
                                PHP performance
                            
                                setTimeout() with string or (anonymous) function reference? speedwise [closed]
                            
                                Can I get a faster output pipe than /dev/null?
                            
                                Under C# how much of a performance hit is a try, throw and catch block

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Full-text search in Postgres or CouchDB?

Tags:

performance

indexing

full-text-search

postgresql

couchdb

Jan L.

People also ask

3 Answers

ssmir

Optimized search with PostgreSQL

1) `text_pattern_ops`

2) Index on expression

3) Optimize index size and speed

4) Cover all possible requests

5) Optimize disc representation with `CLUSTER`

6) Prevent too many rows in the result

7) Optimize filter method (operators)

Erwin Brandstetter

deluan

Recent Activity

Donate For Us

Full-text search in Postgres or CouchDB?

Tags:

performance

indexing

full-text-search

postgresql

couchdb

Jan L.

People also ask

3 Answers

ssmir

Optimized search with PostgreSQL

1) text_pattern_ops

2) Index on expression

3) Optimize index size and speed

4) Cover all possible requests

5) Optimize disc representation with CLUSTER

6) Prevent too many rows in the result

7) Optimize filter method (operators)

Erwin Brandstetter

deluan

Related questions

Recent Activity

Donate For Us

1) `text_pattern_ops`

5) Optimize disc representation with `CLUSTER`