Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the longest "overlapping" words

Tags:

sql

oracle

Premise: Suppose you have a table containing words, where some may be distinct and some "may overlap", meaning a longer word starts with a shorter one, eg:

---------------
|    word     |
---------------
| dog         | *
| games       | *
| stat        |
| state       | 
| statement   | *
| fulfill     |
| fulfilled   | *
| fulfillment | *
---------------

Question: How does one write a query that returns the list of non-overlapping + the longest overlapping words in such cases?

In the example above, the desired words are identified by a *, as per the following extended explanation:

  • dog and games don't overlap with anything, so they're the longest in their "solo/distinct" category
  • statement overlaps with state and stat and is the longest
  • fulfilled overlaps with fulfill and is longer (does NOT overlap with fulfillment)
  • fulfillment overlaps with fulfill and is longer (does NOT overlap with fulfilled)

Note: Please be aware that for the sake of simplicity, the data sample is reduced. In reality there are a few million records to query and, there are no search terms known before hand, so it's not possible to directly use constructs such as WHERE word LIKE 'stat%'. Not sure if relevant or not, but the words have a relatively short max length, eg 20.

like image 833
Morfic Avatar asked Nov 02 '21 13:11

Morfic


2 Answers

Something like

select word
from   your_table t1
where  not exists (
                    select word
                    from   your_table t2
                    where  t2.word like t1.word || '_%'
                  )
;

The query would benefit from the existence of an index on word. But even then it may take a long time. In any case, you can give it a try and let us know what happens.

like image 81
mathguy Avatar answered Sep 30 '22 12:09

mathguy


As long as you compare prefixes and if one word is entirely included as prefix of the other, you may try match_recognize to sequentially check prefix-matching in one pass.

But exists is more clear, though you should examine performace on your real dataset with index on word.

with a as (
  select
    column_value as word
  from sys.odcivarchar2list(
    'dog'
    , 'games'
    , 'stat'
    , 'state'
    , 'statement'
    , 'fulfill'
    , 'fulfilled'
    , 'fulfillment'
  )
)
select *
from a 
match_recognize(
  order by word desc /*Longest first*/
  measures
    a.word as word
  one row per match
  /*Exclude any number of words
    each of which is a prefix of the previous one
  */
  pattern (a {- b* -})
  define
    /*Current word is a prefix of the previous*/
    b as prev(word) like word || '%'
) t
| WORD        |
| :---------- |
| statement   |
| games       |
| fulfillment |
| fulfilled   |
| dog         |

db<>fiddle here

like image 33
astentx Avatar answered Sep 30 '22 12:09

astentx