I'm looking for a sql statement to count the number of unique characters in a string. e.g. <pre class="prettyprint"><code>3333333333 -> returns 1 1113333333 -> returns 2 1112222444 -> returns 3 </code></pre> I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.

This is for fun right? SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff. Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is... Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the <code>from 'select 1 ... union ...</code> text in queries. It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a <code>cross join</code> (also any <code>inner join</code>). I use it for generating days for a year, parsing comma delimited strings etc. Now, the sql <code>mid</code> function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions... <pre class="prettyprint"><code>SET @word='Hello World'; SELECT charAtIdx, COUNT(charAtIdx) FROM (SELECT charIdx.id, MID(@word, charIdx.id, 1) AS charAtIdx FROM integerseries AS charIdx WHERE charIdx.id <= LENGTH(@word) ORDER BY charIdx.id ASC ) wordLetters GROUP BY wordLetters.charAtIdx ORDER BY charAtIdx ASC </code></pre> Output: <pre class="prettyprint"><code>charAtIdx count(charAtIdx) --------- ------------------ 1 d 1 e 1 H 1 l 3 o 2 r 1 W 1 </code></pre> Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known. This observation is used in the final query. The final query: The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the <code>where</code> clause. This provides the optimizer with clues as to how to restrict the data produced when doing the <code>join</code>. <pre class="prettyprint"><code>SELECT wordLetterCounts.wordId, wordLetterCounts.word, COUNT(wordLetterCounts.wordId) AS letterCount FROM (SELECT words.id AS wordId, words.word AS word, iseq.id AS charPos, MID(words.word, iseq.id, 1) AS charAtPos, COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount FROM words JOIN integerseries AS iseq ON iseq.id BETWEEN 1 AND words.wordlen GROUP BY words.id, MID(words.word, iseq.id, 1) ) AS wordLetterCounts GROUP BY wordLetterCounts.wordId </code></pre> Output: <pre class="prettyprint"><code>wordId word letterCount ------ -------------------- ------------- 1 3333333333 1 2 1113333333 2 3 1112222444 3 4 Hello World 8 5 funny - not so much? 13 </code></pre> Word Table and Data: <pre class="prettyprint"><code>CREATE TABLE `words` ( `id` int(11) NOT NULL AUTO_INCREMENT, `word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL, `wordlen` int(11) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; /*Data for the table `words` */ insert into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10); insert into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10); insert into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10); insert into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11); insert into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20); </code></pre> Integerseries table: range 1 .. 30 for this example. <pre class="prettyprint"><code>CREATE TABLE `integerseries` ( `id` int(11) unsigned NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci </code></pre>

Count number of unique characters in a string

Tags:

sql

database

mysql

I'm looking for a sql statement to count the number of unique characters in a string.

e.g.

3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3

I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.

423

asked Apr 30 '15 12:04

DynamicTornado

2 Answers

This is for fun right?

SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.

Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...

Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ... text in queries.

It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join (also any inner join). I use it for generating days for a year, parsing comma delimited strings etc.

Now, the sql mid function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...

SET @word='Hello World';

SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
    MID(@word, charIdx.id, 1) AS charAtIdx 
    FROM integerseries AS charIdx
    WHERE charIdx.id <= LENGTH(@word)
    ORDER BY charIdx.id ASC
    ) wordLetters
GROUP BY
   wordLetters.charAtIdx
ORDER BY charAtIdx ASC

Output:

charAtIdx  count(charAtIdx)  
---------  ------------------
                            1
d                           1
e                           1
H                           1
l                           3
o                           2
r                           1
W                           1

Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.

This observation is used in the final query.

The final query:

The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join.

SELECT 
   wordLetterCounts.wordId,
   wordLetterCounts.word,   
   COUNT(wordLetterCounts.wordId) AS letterCount
FROM 
     (SELECT words.id AS wordId,
             words.word AS word,
             iseq.id AS charPos,
             MID(words.word, iseq.id, 1) AS charAtPos,
             COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
     FROM
          words
          JOIN integerseries AS iseq
               ON iseq.id BETWEEN 1 AND words.wordlen 
      GROUP BY
            words.id,
            MID(words.word, iseq.id, 1)
      ) AS wordLetterCounts
GROUP BY
   wordLetterCounts.wordId

Output:

wordId  word                  letterCount  
------  --------------------  -------------
     1  3333333333                        1
     2  1113333333                        2
     3  1112222444                        3
     4  Hello World                       8
     5  funny - not so much?             13

Word Table and Data:

CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
  `wordlen` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

/*Data for the table `words` */

insert  into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert  into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert  into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);

Integerseries table: range 1 .. 30 for this example.

CREATE TABLE `integerseries` (
  `id` int(11) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

200

answered Sep 27 '22 19:09

Ryan Vincent

There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+


select 
val, 
sum(case when locate('1',val) > 0 then 1 else 0 end ) 
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val


+------------+-----------+
| val        | occurence |
+------------+-----------+
| 11111111   |         1 |
| 111222222  |         2 |
| 1113333222 |         3 |
+------------+-----------+

Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)

mysql> select * from look_up ;
+------+------+
| id   | val  |
+------+------+
|    1 | 1    |
|    2 | 2    |
|    3 | 3    |
|    4 | 4    |
+------+------+
4 rows in set (0.00 sec)

select 
t1.val, 
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ 
from test t1,(select * from look_up)t2 
group by t1.val ;

+------------+------+
| val        | occ  |
+------------+------+
| 11111111   |    1 |
| 111222222  |    2 |
| 1113333222 |    3 |
+------------+------+

answered Sep 27 '22 20:09

Abhik Chakraborty

Related questions
                            
                                JPA persist many to many
                            
                                Escape table name MySQL
                            
                                Differences between PostgreSQL and MySQL for PHP developers
                            
                                How to check how long MySQL query is taking?
                            
                                How do I efficiently change a MySQL table structure on a table with millions of entries?
                            
                                MySQL: Calculate sum total of all the figures in a column where has specific date
                            
                                MySQL PDO prepared faster than query? That's what this simple test shows
                            
                                Replicate Microsoft SQL to other databases
                            
                                Cardinality violation on mysql query
                            
                                Best technique to store gender in MySQL Database
                            
                                What is the equivalent of bind_result on PDO
                            
                                Tried every thing still getting ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2) in mac
                            
                                py.test: ImportError: No module named mysql
                            
                                MySQL UNIQUE key not working
                            
                                How to make login form in node.js using mysql database
                            
                                Laravel 4: Where Not Exists
                            
                                Postgresql: Trying to get Average of Counts for the last 10 ten days
                            
                                How to specify collation with PDO without SET NAMES?
                            
                                XAMPP phpMyadmin: Access denied after changing password
                            
                                SQL create table and set auto increment value without Alter table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With