Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implement find/find next algorithm

Tags:

I have a database table (mysql/pgsql) with the following format:

id|text  1| the cat is black  2| a cat is a cat  3| a dog 

I need to select the line that contains nth match of a word:

eg: "Select the 3rd match for the word cat, that is the number 2 entry." Results: the 2nd row from the result where the 3rd word is cat

The only solution I could find is to search for all entries that have the text cat, load them in memory and find the match by counting them. But this is not efficient for a big number of matches(>1 million).

How would you handle this in an efficient way? Is there anything you can do directly in the database? Maybe using other technologies like lucene?

Update: having 1 million strings in memory might not be a big issue but the expectation of the application is to have between 1k-50k active users that might do this operation concurrently.

like image 819
johnlemon Avatar asked Oct 23 '15 19:10

johnlemon


2 Answers

Consider creating another table with the below structure

Table : index_table columns :           index_id , word, occurrence, id(foreign key to your original table) 

Do one time indexing process as below:

Iterate over each entry in your original table split the text into words and for each word lookup in the new table for existence if not present insert a new entry with occurrence set as 1. If exists insert a new entry with occurrence = existing occurrence +1

Once you have done this one off indexing your selects become pretty simple. For example for cat with 3rd match will be

SELECT *  FROM original_table o, index_table idx WHERE idx.word = 'cat'    AND idx.occurrence = 3    AND o.id = idx.id 
like image 179
gipsy Avatar answered Oct 07 '22 10:10

gipsy


You do not need Lucene for this job. Furthermore, if you have a large number of positive matches, the effort to pump all required data out of your DB will well exceed the computational cost.

Here's a simple solution:

Index: we require two properties:

  1. efficiently access the words for each id
  2. efficiently access all IDs in ascending order

as follows:

create index i_words on example_data (id, string_to_array(txt, ' ')); 

Query: find the ID associated with the nth match with the following query:

select id from (     select id, unnest(string_to_array(txt, ' ')) as word     from example_data ) words where word = :w     -- :w = 'cat' offset :n - 1       -- :n = 3 limit 1; 

Executes in 2ms on 1 million rows.


Here's the full PostgreSQL setup if you'd rather try for yourself than take my word for it:

drop table if exists example_data; create table example_data (     id integer primary key,     txt text not null );  insert into example_data (select generate_series(1, 1000000, 3) as id, 'the cat is black' as txt union all select generate_series(2, 1000000, 3), 'a cat is a cat' union all select generate_series(3, 1000000, 3), 'a dog' order by id);  commit;  drop index if exists i_words; create index i_words on example_data (id, string_to_array(txt, ' '));  select id from (     select id, unnest(string_to_array(txt, ' ')) as word     from example_data ) words where word = 'cat' offset 3 - 1 limit 1;  select      id, word from (     select id, unnest(string_to_array(txt, ' ')) as word     from example_data ) words where word = 'cat' offset 3 - 1 limit 1; 
like image 34
blubb Avatar answered Oct 07 '22 11:10

blubb