Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Abbreviation of Strings that Remains Unique

I have a unique list of strings (the original idea was the column names in a table). The task is to perform a maximal possible abbreviation of the list, so the list remains distinct.

For example AAA, AB can be abbreviated to AA, AB. (But not to A, AB – as A could be prefix of both AAA and AB). AAAA, BAAAA can be shorten to A, B. But A1, A2 can’t be abbreviated at all.

Here are the sample data

create table tab as 
select 'AAA' col from dual union all
select 'AABA' col from dual union all
select 'COL1' col from dual union all
select 'COL21' col from dual union all
select 'AAAAAA' col from dual union all
select 'BBAA' col from dual union all
select 'BAAAA' col from dual union all
select 'AB' col from dual;

The expected result is

COL    ABR_COL                
------ ------------------------
AAA    AAA                      
AAAAAA AAAA                     
AABA   AAB                      
AB     AB                       
BAAAA  BA                       
BBAA   BB                       
COL1   COL1                     
COL21  COL2        

I managed a brute force solution consisting of four subqueries, which I do not post on purpose, because I hope there exists a more simple solution from which I do not want to distract.

Btw there is a similar function in r called abbreviate, but I’m looking for SQL solution. Prefered Oracle solutions for other RDBMS are welcommed.

like image 464
Marmite Bomber Avatar asked Oct 06 '18 07:10

Marmite Bomber


People also ask

What are short forms used for?

An abbreviation, simply put, is a shortened form of a word. In writing, abbreviations are useful when you need to squeeze a lot of writing into a small space. You can also use them in place of long or cumbersome phrases to make your sentences easier to read.

What is the name of short forms?

An abbreviation (from Latin brevis, meaning short) is a shortened form of a word or phrase, by any method.

How do you abbreviate for each?

ea. written abbreviation for each: used to give the price, weight, etc.


2 Answers

I would do the filtering in the recursive CTE:

with potential_abbreviations(col, abbr, lev) as (
      select col, col as abbr, 1 as lev
      from tab
      union all
      select pa.col, substr(pa.abbr, 1, length(pa.abbr) - 1) as abbr, lev + 1
      from potential_abbreviations pa
      where length(abbr) > 1 and
            not exists (select 1
                        from tab
                        where tab.col like substr(pa.abbr, 1, length(pa.abbr) - 1) || '%' and
                              tab.col <> pa.col
                       )
     )
select pa.col, pa.abbr
from (select pa.*, row_number() over (partition by pa.col order by pa.lev desc) as seqnum
      from potential_abbreviations pa
     ) pa
where seqnum = 1

Here is a db<>fiddle.

The lev is strictly not necessary. You can use length(abbr) desc in the order by. But, I usually include a recursion counter when I use recursive CTEs, so this is habit.

Doing the extra comparison in the CTE may look more complicated, but it simplifies the execution -- the recursion stops at the correct value.

This is also tested on unique single letter col values.

like image 159
Gordon Linoff Avatar answered Sep 25 '22 14:09

Gordon Linoff


This is actually possible using a recursive CTE. I don't really get it shorter than three subqueries (plus one query), but at least it is not constrained by string length. The steps are roughly as follows:

  1. Calculate all potential abbreviations with a recursive CTE. This selects all column names themselves and then the column names shortened by one letter, recursively:

Table:

 col    abbr
 --- -------
 AAA    AAA
 AAA    AA
 AAA    A
 ...
  1. For each abbreviation, count how often it occurs

Table

ABBR    CONFLICT
----    --------
AA      3
AAA     2
AABA    1
...
  1. Select the abbreviations that are the unique shortest ones, and also the abbreviations that are just the column name itself, and rank these by length of the abbreviation. In the example, you see that AAA conflicts with some other abbreviation but still must be chosen as it is equal to its unshortened name.

Table

COL     ABBR    CONFLICT    POS
-------------------------------
AAA     AAA     2           1
AAAAAA  AAAA    1           1
AAAAAA  AAAAA   1           2
AAAAAA  AAAAAA  1           3
AABA    AAB     1           1
...
  1. Choose the first ranked abbreviation (or column name itself) for each column.

Table

COL     ABBR    POS
-------------------
AAA     AAA     1
AAAAAA  AAAA    1
AABA    AAB     1
...

Complete SQL

This results in the following SQL, with the above steps as CTEs:

with potential_abbreviations(col,abbr) as (
  select
      col
    , col as abbr
  from tab
  union all
  select
    col
  , substr(abbr, 1, length(abbr)-1 ) as abbr
  from potential_abbreviations
  where length(abbr) > 1
)
, abbreviation_counts as (
  select abbr
       , count(*) as conflict
  from potential_abbreviations
  group by abbr
)
, all_unique_abbreviations(col,abbr,conflict,pos) as (
select
    p.col
  , p.abbr
  , conflict
  , rank() over (partition by col order by p.abbr) as pos
  from potential_abbreviations p
    join abbreviation_counts c on p.abbr = c.abbr
    where conflict = 1 or p.col = p.abbr
)
select col, abbr, pos
from all_unique_abbreviations
where pos = 1
 order by col, abbr

Result

COL     ABBR
------- ----
AAA     AAA
AAAAAA  AAAA
AABA    AAB
AB      AB
AC1     AC
AD      AD
BAAAA   BA
BBAA    BB
COL1    COL1
COL21   COL2

SQL Fiddle

like image 29
Corion Avatar answered Sep 24 '22 14:09

Corion