Premise: Suppose you have a table containing words, where some may be distinct and some "may overlap", meaning a longer word starts with a shorter one, eg:
---------------
| word |
---------------
| dog | *
| games | *
| stat |
| state |
| statement | *
| fulfill |
| fulfilled | *
| fulfillment | *
---------------
Question: How does one write a query that returns the list of non-overlapping + the longest overlapping words in such cases?
In the example above, the desired words are identified by a *
, as per the following extended explanation:
dog
and games
don't overlap with anything, so they're the longest in their "solo/distinct" categorystatement
overlaps with state
and stat
and is the longestfulfilled
overlaps with fulfill
and is longer (does NOT overlap with fulfillment
)
fulfillment
overlaps with fulfill
and is longer (does NOT overlap with fulfilled
)
Note: Please be aware that for the sake of simplicity, the data sample is reduced. In reality there are a few million records to query and, there are no search terms known before hand, so it's not possible to directly use constructs such as WHERE word LIKE 'stat%'
. Not sure if relevant or not, but the words have a relatively short max length, eg 20.
Something like
select word
from your_table t1
where not exists (
select word
from your_table t2
where t2.word like t1.word || '_%'
)
;
The query would benefit from the existence of an index on word
. But even then it may take a long time. In any case, you can give it a try and let us know what happens.
As long as you compare prefixes and if one word is entirely included as prefix of the other, you may try match_recognize
to sequentially check prefix-matching in one pass.
But exists
is more clear, though you should examine performace on your real dataset with index on word
.
with a as ( select column_value as word from sys.odcivarchar2list( 'dog' , 'games' , 'stat' , 'state' , 'statement' , 'fulfill' , 'fulfilled' , 'fulfillment' ) ) select * from a match_recognize( order by word desc /*Longest first*/ measures a.word as word one row per match /*Exclude any number of words each of which is a prefix of the previous one */ pattern (a {- b* -}) define /*Current word is a prefix of the previous*/ b as prev(word) like word || '%' ) t
| WORD | | :---------- | | statement | | games | | fulfillment | | fulfilled | | dog |
db<>fiddle here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With