I'm looking for a sql statement to count the number of unique characters in a string.
e.g.
3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3
I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.
Approach: The given problem can be solved using the set data structure. The idea is to initialize an unordered set that stores all the distinct characters of the given string. The size of the set after the string is traversed is the required answer.
Python3. If not present append the characters to empty string. Now the empty string consists of only unique characters, use len() method to display the length.
First we will initialize all values of counter array to 0 and all values of index array to n (length of string). On traversal of the string str and for every character c, increase count[x], if count[x] = 1, index[x] = i. If count[x] = 2, index[x] = n. Sort indexes and print characters.
This is for fun right?
SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.
Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...
Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ...
text in queries.
It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join
(also any inner join
). I use it for generating days for a year, parsing comma delimited strings etc.
Now, the sql mid
function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...
SET @word='Hello World';
SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
MID(@word, charIdx.id, 1) AS charAtIdx
FROM integerseries AS charIdx
WHERE charIdx.id <= LENGTH(@word)
ORDER BY charIdx.id ASC
) wordLetters
GROUP BY
wordLetters.charAtIdx
ORDER BY charAtIdx ASC
Output:
charAtIdx count(charAtIdx)
--------- ------------------
1
d 1
e 1
H 1
l 3
o 2
r 1
W 1
Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.
This observation is used in the final query.
The final query:
The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where
clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join
.
SELECT
wordLetterCounts.wordId,
wordLetterCounts.word,
COUNT(wordLetterCounts.wordId) AS letterCount
FROM
(SELECT words.id AS wordId,
words.word AS word,
iseq.id AS charPos,
MID(words.word, iseq.id, 1) AS charAtPos,
COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
FROM
words
JOIN integerseries AS iseq
ON iseq.id BETWEEN 1 AND words.wordlen
GROUP BY
words.id,
MID(words.word, iseq.id, 1)
) AS wordLetterCounts
GROUP BY
wordLetterCounts.wordId
Output:
wordId word letterCount
------ -------------------- -------------
1 3333333333 1
2 1113333333 2
3 1112222444 3
4 Hello World 8
5 funny - not so much? 13
Word Table and Data:
CREATE TABLE `words` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
`wordlen` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*Data for the table `words` */
insert into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);
Integerseries table: range 1 .. 30 for this example.
CREATE TABLE `integerseries` (
`id` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function
mysql> select * from test ;
+------------+
| val |
+------------+
| 11111111 |
| 111222222 |
| 1113333222 |
+------------+
select
val,
sum(case when locate('1',val) > 0 then 1 else 0 end )
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val
+------------+-----------+
| val | occurence |
+------------+-----------+
| 11111111 | 1 |
| 111222222 | 2 |
| 1113333222 | 3 |
+------------+-----------+
Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines
mysql> select * from test ;
+------------+
| val |
+------------+
| 11111111 |
| 111222222 |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)
mysql> select * from look_up ;
+------+------+
| id | val |
+------+------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+------+------+
4 rows in set (0.00 sec)
select
t1.val,
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ
from test t1,(select * from look_up)t2
group by t1.val ;
+------------+------+
| val | occ |
+------------+------+
| 11111111 | 1 |
| 111222222 | 2 |
| 1113333222 | 3 |
+------------+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With