How to get the longest "overlapping" words

Question

Premise: Suppose you have a table containing words, where some may be distinct and some "may overlap", meaning a longer word starts with a shorter one, eg:

---------------
|    word     |
---------------
| dog         | *
| games       | *
| stat        |
| state       | 
| statement   | *
| fulfill     |
| fulfilled   | *
| fulfillment | *
---------------

Question: How does one write a query that returns the list of non-overlapping + the longest overlapping words in such cases?

In the example above, the desired words are identified by a *, as per the following extended explanation:

dog and games don't overlap with anything, so they're the longest in their "solo/distinct" category
statement overlaps with state and stat and is the longest
fulfilled overlaps with fulfill and is longer (does NOT overlap with fulfillment)
fulfillment overlaps with fulfill and is longer (does NOT overlap with fulfilled)

Note: Please be aware that for the sake of simplicity, the data sample is reduced. In reality there are a few million records to query and, there are no search terms known before hand, so it's not possible to directly use constructs such as WHERE word LIKE 'stat%'. Not sure if relevant or not, but the words have a relatively short max length, eg 20.

mathguy · Accepted Answer

Something like

select word
from   your_table t1
where  not exists (
                    select word
                    from   your_table t2
                    where  t2.word like t1.word || '_%'
                  )
;

The query would benefit from the existence of an index on word. But even then it may take a long time. In any case, you can give it a try and let us know what happens.

astentx · Answer

As long as you compare prefixes and if one word is entirely included as prefix of the other, you may try match_recognize to sequentially check prefix-matching in one pass.

But exists is more clear, though you should examine performace on your real dataset with index on word.

with a as (
  select
    column_value as word
  from sys.odcivarchar2list(
    'dog'
    , 'games'
    , 'stat'
    , 'state'
    , 'statement'
    , 'fulfill'
    , 'fulfilled'
    , 'fulfillment'
  )
)
select *
from a 
match_recognize(
  order by word desc /*Longest first*/
  measures
    a.word as word
  one row per match
  /*Exclude any number of words
    each of which is a prefix of the previous one
  */
  pattern (a {- b* -})
  define
    /*Current word is a prefix of the previous*/
    b as prev(word) like word || '%'
) t

| WORD        |
| :---------- |
| statement   |
| games       |
| fulfillment |
| fulfilled   |
| dog         |

db<>fiddle here

How to get the longest "overlapping" words

Tags:

sql

oracle

Morfic

2 Answers

mathguy

astentx

Recent Activity

Donate For Us

How to get the longest "overlapping" words

Tags:

sql

oracle

Morfic

2 Answers

mathguy

astentx

Related questions

Recent Activity

Donate For Us