I have a database table (mysql/pgsql) with the following format:
id|text 1| the cat is black 2| a cat is a cat 3| a dog
I need to select the line that contains nth match of a word:
eg: "Select the 3rd match for the word cat, that is the number 2 entry." Results: the 2nd row from the result where the 3rd word is cat
The only solution I could find is to search for all entries that have the text cat, load them in memory and find the match by counting them. But this is not efficient for a big number of matches(>1 million).
How would you handle this in an efficient way? Is there anything you can do directly in the database? Maybe using other technologies like lucene?
Update: having 1 million strings in memory might not be a big issue but the expectation of the application is to have between 1k-50k active users that might do this operation concurrently.
Consider creating another table with the below structure
Table : index_table columns : index_id , word, occurrence, id(foreign key to your original table)
Do one time indexing process as below:
Iterate over each entry in your original table split the text into words and for each word lookup in the new table for existence if not present insert a new entry with occurrence set as 1. If exists insert a new entry with occurrence = existing occurrence +1
Once you have done this one off indexing your selects become pretty simple. For example for cat with 3rd match will be
SELECT * FROM original_table o, index_table idx WHERE idx.word = 'cat' AND idx.occurrence = 3 AND o.id = idx.id
You do not need Lucene for this job. Furthermore, if you have a large number of positive matches, the effort to pump all required data out of your DB will well exceed the computational cost.
Here's a simple solution:
Index: we require two properties:
as follows:
create index i_words on example_data (id, string_to_array(txt, ' '));
Query: find the ID associated with the n
th match with the following query:
select id from ( select id, unnest(string_to_array(txt, ' ')) as word from example_data ) words where word = :w -- :w = 'cat' offset :n - 1 -- :n = 3 limit 1;
Executes in 2ms on 1 million rows.
Here's the full PostgreSQL setup if you'd rather try for yourself than take my word for it:
drop table if exists example_data; create table example_data ( id integer primary key, txt text not null ); insert into example_data (select generate_series(1, 1000000, 3) as id, 'the cat is black' as txt union all select generate_series(2, 1000000, 3), 'a cat is a cat' union all select generate_series(3, 1000000, 3), 'a dog' order by id); commit; drop index if exists i_words; create index i_words on example_data (id, string_to_array(txt, ' ')); select id from ( select id, unnest(string_to_array(txt, ' ')) as word from example_data ) words where word = 'cat' offset 3 - 1 limit 1; select id, word from ( select id, unnest(string_to_array(txt, ' ')) as word from example_data ) words where word = 'cat' offset 3 - 1 limit 1;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With