Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of unique characters in a string

I'm looking for a sql statement to count the number of unique characters in a string.

e.g.

3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3

I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.

like image 423
DynamicTornado Avatar asked Apr 30 '15 12:04

DynamicTornado


People also ask

How do you count unique characters in a string?

Approach: The given problem can be solved using the set data structure. The idea is to initialize an unordered set that stores all the distinct characters of the given string. The size of the set after the string is traversed is the required answer.

How do you count unique characters in a string in Python?

Python3. If not present append the characters to empty string. Now the empty string consists of only unique characters, use len() method to display the length.

How do you get distinct characters and their counts in a string in CPP?

First we will initialize all values of counter array to 0 and all values of index array to n (length of string). On traversal of the string str and for every character c, increase count[x], if count[x] = 1, index[x] = i. If count[x] = 2, index[x] = n. Sort indexes and print characters.


2 Answers

This is for fun right?

SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.

Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...

Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ... text in queries.

It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join (also any inner join). I use it for generating days for a year, parsing comma delimited strings etc.

Now, the sql mid function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...

SET @word='Hello World';

SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
    MID(@word, charIdx.id, 1) AS charAtIdx 
    FROM integerseries AS charIdx
    WHERE charIdx.id <= LENGTH(@word)
    ORDER BY charIdx.id ASC
    ) wordLetters
GROUP BY
   wordLetters.charAtIdx
ORDER BY charAtIdx ASC  

Output:

charAtIdx  count(charAtIdx)  
---------  ------------------
                            1
d                           1
e                           1
H                           1
l                           3
o                           2
r                           1
W                           1

Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.

This observation is used in the final query.

The final query:

The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join.

SELECT 
   wordLetterCounts.wordId,
   wordLetterCounts.word,   
   COUNT(wordLetterCounts.wordId) AS letterCount
FROM 
     (SELECT words.id AS wordId,
             words.word AS word,
             iseq.id AS charPos,
             MID(words.word, iseq.id, 1) AS charAtPos,
             COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
     FROM
          words
          JOIN integerseries AS iseq
               ON iseq.id BETWEEN 1 AND words.wordlen 
      GROUP BY
            words.id,
            MID(words.word, iseq.id, 1)
      ) AS wordLetterCounts
GROUP BY
   wordLetterCounts.wordId  

Output:

wordId  word                  letterCount  
------  --------------------  -------------
     1  3333333333                        1
     2  1113333333                        2
     3  1112222444                        3
     4  Hello World                       8
     5  funny - not so much?             13

Word Table and Data:

CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
  `wordlen` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

/*Data for the table `words` */

insert  into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert  into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert  into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);

Integerseries table: range 1 .. 30 for this example.

CREATE TABLE `integerseries` (
  `id` int(11) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
like image 200
Ryan Vincent Avatar answered Sep 27 '22 19:09

Ryan Vincent


There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+


select 
val, 
sum(case when locate('1',val) > 0 then 1 else 0 end ) 
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val


+------------+-----------+
| val        | occurence |
+------------+-----------+
| 11111111   |         1 |
| 111222222  |         2 |
| 1113333222 |         3 |
+------------+-----------+

Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)

mysql> select * from look_up ;
+------+------+
| id   | val  |
+------+------+
|    1 | 1    |
|    2 | 2    |
|    3 | 3    |
|    4 | 4    |
+------+------+
4 rows in set (0.00 sec)

select 
t1.val, 
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ 
from test t1,(select * from look_up)t2 
group by t1.val ;

+------------+------+
| val        | occ  |
+------------+------+
| 11111111   |    1 |
| 111222222  |    2 |
| 1113333222 |    3 |
+------------+------+
like image 43
Abhik Chakraborty Avatar answered Sep 27 '22 20:09

Abhik Chakraborty