Given a column containing ngrams in a VARCHAR
with utf8mb4_unicode_ci
collation:
+---------------------------+
| ngram |
+---------------------------+
| stack overflow |
| stack |
| overflow |
| stack overflow protection |
| overflow protection |
| protection |
+---------------------------+
And a query:
SELECT * FROM ngrams WHERE ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
Given the rows returned by this query, how can I keep only the rows with the longest ngrams from the returned rows?
In this example, I get 3 rows: stack
, stack overflow
, and protection
.
Then, I need to filter rows like this:
stack
, because stack overflow
exists in the returned rowsstack overflow
, because no other returned row is a ngram containing stack overflow
(there is stack overflow protection
in the table, but it's not in the returned rows)protection
toooverflow
, because stack overflow
exists in the returned rowsIt must be done in MySQL because of collations (comparisons outside of MySQL wouldn't give the same results than in MySQL). (Unless I'm not aware of some MySQL function allowing to expose the collated version of a string.)
I can think of the following solution: (sql fiddle)
SELECT ngram
FROM ngrams n1
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection')
AND NOT EXISTS (
SELECT 1
FROM ngrams n2
WHERE n2.ngram IN ('stack', 'stack overflow', 'protection')
AND LENGTH(n2.ngram) > LENGTH(n1.ngram)
AND CONCAT(' ', n2.ngram, ' ') LIKE CONCAT('% ', n1.ngram, ' %')
)
It's inefficient, though, since the sub-query will be executed for every matched ngram.
So I'm searching for
If I understand your logic correctly, this query should give you the correct result:
SELECT n1.ngram
FROM
ngrams n1 LEFT JOIN ngrams n2
ON
n2.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram LIKE CONCAT('%', n1.ngram, '%')
AND CHAR_LENGTH(n1.ngram) < CHAR_LENGTH(n2.ngram)
WHERE
n1.ngram IN ('stack', 'stack overflow', 'protection')
AND n2.ngram IS NULL;
Please see fiddle here. But since I expect that your table could have a lot of records, while your list of words is certanly much limited, why not remove the shortest ngrams from this list before executing the actual query? My idea is to reduce the list
('stack', 'stack overflow', 'protection')
to
('stack overflow', 'protection')
and this query should do the trick:
SELECT *
FROM
ngrams
WHERE
ngram IN (
SELECT s1.ngram
FROM (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s1 LEFT JOIN (
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack','stack overflow','protection')
) s2
ON s2.ngram LIKE CONCAT('%', s1.ngram, '%')
AND CHAR_LENGTH(s1.ngram) < CHAR_LENGTH(s2.ngram)
WHERE
s2.ngram IS NULL
);
Yes I'm querying the table ngrams
twice before joining the result back to ngrams
again, because we have to make sure that the longest value actually exists in the table, but if you have a proper index on the ngram column the two derived queries that use DISTINCT should be very efficient:
ALTER TABLE ngrams ADD INDEX idx_ngram (ngram);
Fiddle is here.
Edit:
As samuil correctly noted, if you just need to find the shortest ngram and not the whole rows associated to it, then you don't need the outer query, and you can just execute the inner query. With the proper index, two SELECT DISTINCT queries will be very efficient, and even if the JOIN cannot be optimized (n2.ngram LIKE CONCAT('%', n1.ngram, '%')
can't take advantage of an index) it will be executed only on a few already filtered records and should be quite fast.
After doing this without first looking at the other solutions, I see that it's similar to your existing best solution, but slightly simpler to read and possibly a bit more efficient;
SELECT n1.ngram
FROM ngrams n1
LEFT JOIN ngrams n2
ON n2.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
AND n1.ngram <> n2.ngram
AND INSTR(n2.ngram, n1.ngram) > 0
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection', 'overflow')
AND n2.ngram IS NULL;
An SQLfiddle to test with.
Since there is no calculation on the AND n1.ngram <> n2.ngram
line, the query should be able to use indexes a bit more efficiently.
You are trying to filter the ngrams in the query itself. It is probably more efficient to do it in two steps. Start with a table with all possible ngrams:
CREATE TABLE original (ngram varchar(100) NOT NULL)
GO
CREATE TABLE refined (ngram varchar(100) NOT NULL PRIMARY KEY)
GO
INSERT INTO original (ngram)
SELECT DISTINCT ngram
FROM ngrams
WHERE ngram IN ('stack', 'stack overflow', 'protection')
GO
INSERT INTO refined (ngram)
SELECT ngram
FROM original
Then delete the ones you do not want. For each ngram, generate all possible substrings. For each substring, delete that entry (if any) from the list. It takes a couple of nested loops, but unless your ngrams contain an extremely large number of words, it should not take much time.
CREATE PROCEDURE refine()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE words varchar(100);
DECLARE posFrom, posTo int;
DECLARE cur CURSOR FOR SELECT ngram FROM original;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
read_loop: LOOP
FETCH cur INTO words;
IF done THEN
LEAVE read_loop;
END IF;
SET posFrom = 1;
REPEAT
SET posTo = LOCATE(' ', words, posFrom);
WHILE posTo > 0 DO
DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom, posTo - posFrom);
SET posTo = LOCATE(' ', words, posTo + 1);
END WHILE;
IF posFrom > 1 THEN
DELETE FROM refined WHERE ngram = SUBSTRING(words, posFrom);
END IF;
SET posFrom = LOCATE(' ', words, posFrom) + 1;
UNTIL posFrom = 1 END REPEAT;
END LOOP;
CLOSE cur;
END
What's left, is a table with only the longest ngrams:
CALL refine;
SELECT ngram FROM refined;
SQL Fiddle: http://sqlfiddle.com/#!2/029dc/1/1
EDIT: I added an index on table refined
; now it should run in O(n) time.
I think you can use self inner join on LIKE %original string%
and choose only those rows that have ngram length equal to the longest joined ngram length.
SELECT n1.* FROM ngrams n1
INNER JOIN ngrams n2 ON
n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
AND n2.ngram IN ('stack overflow', 'stack')
WHERE n1.ngram IN ('stack overflow', 'stack')
GROUP BY n1.ngram
HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);
Downside of this solution is that you need to provide your string list twice.
It turns out that you don't need to provide list twice:
SELECT n1.*
FROM ngrams n1
INNER JOIN ngrams n2 ON
n2.ngram LIKE CONCAT('%', `n1`.`ngram`, '%')
AND n2.ngram IN ('stack overflow', 'stack')
GROUP BY n1.ngram
HAVING MAX(CHAR_LENGTH(n2.ngram)) = CHAR_LENGTH(n1.ngram);
This slight modification to your query:
SELECT ngram
FROM ngrams n1
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection') AND
NOT EXISTS (SELECT 1
FROM ngrams n2
WHERE n2.ngram IN ('stack', 'stack overflow', 'protection') AND
n2.ngram <> n1.ngram AND
n2.ngram LIKE CONCAT('% ', n1.ngram, ' %')
);
Should be pretty optimally fast with an index on ngrams(ngram)
. Note that this simplifies the like
condition. I see no reason why you should be worried about word boundaries. Wouldn't "stacks" be a longer version of "stack"? (Although the items referred to by n-grams can be words, I associate them with letters unless otherwise noted.)
With the index, this should be equivalent in performance to other solutions using join
.
If I had to do this zillions of times and the ngram table were not too big, I would preprocess it to get all pairs of "generalizations" -- ngram_pairs
. This changes the above to
SELECT ngram
FROM ngrams n1
WHERE n1.ngram IN ('stack', 'stack overflow', 'protection') AND
NOT EXISTS (SELECT 1
FROM ngram_pairs np
WHERE np.ngram1 = n1.ngram and
np.ngram2 in ('stack', 'stack overflow', 'protection')
)
This should perform much better than the like
with an index on ngram_pairs(ngram1, ngram2)
. The following is the code for generating ngram_pairs
:
create table ngram_pairs as
select n1.ngram as ngram1, n2.ngram as ngram2
from ngrams n1 join
ngrams n2
on length(n1.ngram) < length(n2.ngram) and
n2.ngram like concat('%', n1.ngram, '%');
create index ngram_pairs_ngram1_ngram2 on ngram_pairs(ngram1, ngram2);
Try this query using user variable
select
ngram
from
(select
ngram,
@t:=if(@prev=rank, @t+1, 1) as num,
@prev:=rank
from
(select
ngram,
@rank:=if(@prev like concat(ngram,'%'), @rank, @rank+1) as rank,
CHAR_LENGTH(ngram) as size,
@prev:=ngram
from
tbl
join
(select
@prev:='',
@rank:=1) t
where
ngram in ('stack overflow', 'stack', 'protection')
order by
rank, size desc
)t
join
(select
@t:=0,
@prev:=0) t1
) t
where
num =1
| NGRAM |
|----------------|
| stack overflow |
| protection |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With