Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find longest matching ngrams in MySQL

Tags:

sql

mysql

rdbms

Given a column containing ngrams in a VARCHAR with utf8mb4_unicode_ci collation:

+---------------------------+
| ngram                     |
+---------------------------+
| stack overflow            |
| stack                     |
| overflow                  |
| stack overflow protection |
| overflow protection       |
| protection                |
+---------------------------+

And a query:

SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')

Given the rows returned by this query, how can I keep only the rows with the longest ngrams from the returned rows?

In this example, I get 3 rows: stack, stack overflow, and protection.

Then, I need to filter rows like this:

  • I filter out stack, because stack overflow exists in the returned rows
  • I keep stack overflow, because no other returned row is a ngram containing stack overflow (there is stack overflow protection in the table, but it's not in the returned rows)
  • I keep protection too
  • I filter out overflow, because stack overflow exists in the returned rows

It must be done in MySQL because of collations (comparisons outside of MySQL wouldn't give the same results than in MySQL). (Unless I'm not aware of some MySQL function allowing to expose the collated version of a string.)


I can think of the following solution: (sql fiddle)

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection')
AND     NOT EXISTS (
    SELECT  1
    FROM    ngrams n2
    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND     LENGTH(n2.ngram) > LENGTH(n1.ngram)
    AND     CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)

It's inefficient, though, since the sub-query will be executed for every matched ngram.


So I'm searching for

  • either a way to make this query efficient
  • or a way to do this reliably outside of MySQL (taking collations into account)
like image 900
Arnaud Le Blanc Avatar asked May 05 '14 18:05

Arnaud Le Blanc


6 Answers

If I understand your logic correctly, this query should give you the correct result:

SELECT n1.ngram
FROM
  ngrams n1 LEFT JOIN ngrams n2
  ON
    n2.ngram IN ('stack', 'stack overflow', 'protection')
    AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
    AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
  n1.ngram IN ('stack', 'stack overflow', 'protection')
  AND n2.ngram IS NULL;

Please see fiddle here. But since I expect that your table could have a lot of records, while your list of words is certanly much limited, why not remove the shortest ngrams from this list before executing the actual query? My idea is to reduce the list

('stack', 'stack overflow', 'protection')

to

('stack overflow', 'protection')

and this query should do the trick:

SELECT *
FROM
  ngrams
WHERE
  ngram IN (
    SELECT s1.ngram
    FROM (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s1 LEFT JOIN (
      SELECT DISTINCT ngram
      FROM ngrams
      WHERE ngram IN ('stack','stack overflow','protection')
    ) s2
      ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
         AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
    WHERE
      s2.ngram IS NULL
  );

Yes I'm querying the table ngrams twice before joining the result back to ngrams again, because we have to make sure that the longest value actually exists in the table, but if you have a proper index on the ngram column the two derived queries that use DISTINCT should be very efficient:

ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);

Fiddle is here.

Edit:

As samuil correctly noted, if you just need to find the shortest ngram and not the whole rows associated to it, then you don't need the outer query, and you can just execute the inner query. With the proper index, two SELECT DISTINCT queries will be very efficient, and even if the JOIN cannot be optimized (n2.ngram LIKE CONCAT('%', n1.ngram, '%') can't take advantage of an index) it will be executed only on a few already filtered records and should be quite fast.

like image 66
fthiella Avatar answered Nov 11 '22 20:11

fthiella


After doing this without first looking at the other solutions, I see that it's similar to your existing best solution, but slightly simpler to read and possibly a bit more efficient;

SELECT n1.ngram
FROM ngrams n1
LEFT JOIN ngrams n2
  ON n2.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n1.ngram <> n2.ngram
 AND INSTR(n2.ngram, n1.ngram) > 0
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
 AND n2.ngram IS NULL;

An SQLfiddle to test with.

Since there is no calculation on the AND n1.ngram <> n2.ngram line, the query should be able to use indexes a bit more efficiently.

like image 43
Joachim Isaksson Avatar answered Nov 11 '22 20:11

Joachim Isaksson


You are trying to filter the ngrams in the query itself. It is probably more efficient to do it in two steps. Start with a table with all possible ngrams:

CREATE TABLE original (ngram varchar(100) NOT NULL)
GO

CREATE TABLE refined (ngram varchar(100) NOT NULL PRIMARY KEY)
GO

INSERT INTO original (ngram)
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack', 'stack overflow', 'protection')
GO

INSERT INTO refined (ngram)
SELECT ngram
FROM original

Then delete the ones you do not want. For each ngram, generate all possible substrings. For each substring, delete that entry (if any) from the list. It takes a couple of nested loops, but unless your ngrams contain an extremely large number of words, it should not take much time.

CREATE PROCEDURE refine()
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE words varchar(100);
    DECLARE posFrom, posTo int;
    DECLARE cur CURSOR FOR SELECT ngram FROM original;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    OPEN cur;

    read_loop: LOOP
        FETCH cur INTO words;
        IF done THEN
            LEAVE read_loop;
        END IF;

        SET posFrom = 1;
        REPEAT
            SET posTo = LOCATE(' ', words, posFrom);
            WHILE posTo > 0 DO
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom, posTo - posFrom);
                SET posTo = LOCATE(' ', words, posTo + 1);
            END WHILE;
            IF posFrom > 1 THEN
                DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom);
            END IF;
            SET posFrom = LOCATE(' ', words, posFrom) + 1;
        UNTIL posFrom = 1 END REPEAT;
    END LOOP;

    CLOSE cur;
END

What's left, is a table with only the longest ngrams:

CALL refine;

SELECT ngram FROM refined;

SQL Fiddle: http://sqlfiddle.com/#!2/029dc/1/1


EDIT: I added an index on table refined; now it should run in O(n) time.

like image 5
Ruud Helderman Avatar answered Nov 11 '22 19:11

Ruud Helderman


I think you can use self inner join on LIKE %original string% and choose only those rows that have ngram length equal to the longest joined ngram length.

SELECT n1.* FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  WHERE n1.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);

Downside of this solution is that you need to provide your string list twice.


It turns out that you don't need to provide list twice:

SELECT n1.*
  FROM ngrams n1
  INNER JOIN ngrams n2 ON
    n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
    AND n2.ngram IN ('stack overflow', 'stack')
  GROUP BY n1.ngram
  HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);
like image 4
samuil Avatar answered Nov 11 '22 19:11

samuil


This slight modification to your query:

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngrams n2
                    WHERE   n2.ngram IN ('stack', 'stack overflow', 'protection') AND
                            n2.ngram <> n1.ngram AND
                            n2.ngram LIKE CONCAT('% ', n1.ngram, ' %')
                   );

Should be pretty optimally fast with an index on ngrams(ngram). Note that this simplifies the like condition. I see no reason why you should be worried about word boundaries. Wouldn't "stacks" be a longer version of "stack"? (Although the items referred to by n-grams can be words, I associate them with letters unless otherwise noted.)

With the index, this should be equivalent in performance to other solutions using join.

If I had to do this zillions of times and the ngram table were not too big, I would preprocess it to get all pairs of "generalizations" -- ngram_pairs. This changes the above to

SELECT  ngram
FROM    ngrams n1
WHERE   n1.ngram IN ('stack', 'stack overflow', 'protection') AND
        NOT EXISTS (SELECT  1
                    FROM    ngram_pairs np
                    WHERE   np.ngram1 = n1.ngram and
                            np.ngram2 in ('stack', 'stack overflow', 'protection') 
                   )

This should perform much better than the like with an index on ngram_pairs(ngram1, ngram2). The following is the code for generating ngram_pairs:

create table ngram_pairs as
    select n1.ngram as ngram1, n2.ngram as ngram2
    from ngrams n1 join
         ngrams n2
         on length(n1.ngram) < length(n2.ngram) and
            n2.ngram like concat('%', n1.ngram, '%');

create index ngram_pairs_ngram1_ngram2 on ngram_pairs(ngram1, ngram2);
like image 4
Gordon Linoff Avatar answered Nov 11 '22 19:11

Gordon Linoff


Try this query using user variable

select 
  ngram
from 
  (select 
    ngram, 
    @t:=if(@prev=rank, @t+1, 1) as num,
    @prev:=rank
  from 
    (select 
      ngram,
      @rank:=if(@prev like concat(ngram,'%'), @rank, @rank+1) as rank,
      CHAR_LENGTH(ngram) as size,
      @prev:=ngram
    from 
      tbl 
    join 
      (select 
         @prev:='', 
         @rank:=1) t 
    where 
       ngram in ('stack overflow', 'stack', 'protection')
    order by 
       rank, size desc
   )t
  join 
    (select 
       @t:=0, 
       @prev:=0) t1
    ) t 
  where 
    num =1

Fiddle

|          NGRAM |
|----------------|
| stack overflow |
|     protection |
like image 3
Meherzad Avatar answered Nov 11 '22 20:11

Meherzad