Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting n random rows from a huge database with conditions

I have a database of around 8 million plus rows from which I want to select randomly n rows. First of all I've read the popular and similar question here on StackOverflow and the article on MSDN, however I feel that the answers still don't suit my needs.

The offered solutions work great if I want a certain percentage of rows randomly selected without extra conditions. But what I want to select n rows randomly (e.g. at most 5 rows), all matching a certain condition.

My database contains words with information like their part of speech, tag, lemma and token. Now I want to perform a query to select 5 random words all similar to the word in the query (e.g. give me 5 words similar to fuzzy), this is determined by looking only at words with the same part of speech and a value of the levenshtein distance above a certain threshold. I have a function in sql server that can calculate the levenshtein distance.

The problem with the aforementioned methods is that they either have to run over all the records and calculate the levenshtein distance (which takes up a lot of time!) or they only offer me to select a percentage instead of n rows.

A query that works sufficiently well is:

SELECT DISTINCT TOP 5 lower(Words.TOKEN) as LTOKEN, Words.LEMMA, TagSet.POS_Simplified, TagSet.TAG 
FROM Words JOIN TagSet on Words.TAG = TagSet.TAG 
WHERE NOT Words.LEMMA = 'monarchie' AND TagSet.POS_Simplified = 'noun' 
AND TagSet.TAG = 'NOM' AND NOT Words.TOKEN = 'monarchie'
AND [dbo].edit_distance('monarchie', Words.Token) > 0.5

However, with only top I get always the same results. I need my top to be random. Methods like using NEWID() will first go over the entire database and then select randomly, which is not my intended behavior as they take way too long.

Does anyone have an idea to select n random rows fast on a huge database?


EDIT:

Someone (not on StackOverflow) may have offered me a solution with the OPTION clause and the fast keyword, which retrieves the first n number of rows it finds.

With the OPTION(fast 5) I get the best performance so far (10 seconds on a 8 million plus row table). I also changed the Levenshtein function from an SQL implementation to a c# written library implementation which sped up the performance considerably.

Select top 5 * from (
SELECT DISTINCT lower(Words.TOKEN) as LTOKEN, Words.LEMMA, TagSet.POS_Simplified, TagSet.TAG 
FROM Words JOIN TagSet on Words.TAG = TagSet.TAG 
WHERE NOT Words.LEMMA = 'monarchie' AND TagSet.POS_Simplified = 'noun' 
AND TagSet.TAG = 'NOM' AND NOT Words.TOKEN = 'monarchie'
AND [dbo].clrEditDistance('monarchie', Words.Token) > 0.5
) AS T
ORDER BY NEWID()
OPTION(fast 5)
like image 764
Floris Devriendt Avatar asked Nov 04 '13 09:11

Floris Devriendt


1 Answers

Avoiding a full scan is going to be difficult. If you had a column that you could randomly select easily, say for example you happened to have a "dense" identity column with few gaps, replace Klark's approach with the following modification:

declare @results table (id bigint, name varchar(100))

while (select count(*) from @results) < 5
    begin
    insert  @results
            (name)
    select  name
    from    (
            select  *
            from    dbo.Words
            WHERE  IDCOLUMN = CONVERT(INT,RAND()) * APPX_NUMBER_OF_ROWS
            ) as SubQueryAlias          
    where   dbo.edit_distance(left(name,4), 'APRS', 100) < 3
    end  

select  *
from    @results)
like image 74
user3003007 Avatar answered Nov 06 '22 20:11

user3003007