In a MySQL 5.6 database I have table <code>tablename</code> which has (including others) three <code>TEXT</code> columns: <code>col_a, col_b, col_c</code>. I want to extract all unique words (with words being separated by spaces) from these three columns that are at least 5 characters long. By "word" I mean any string of non-space characters, eg "foo-123" would be a word, as would "099423". The columns are all utf8 format InnoDB columns. Is there a single query to do this? EDIT: As requested, here's an example: (in the real data col_a, col_b and col_c are TEXT fields and could have a large number of words.) <pre class="prettyprint lang-mysql prettyprint-override"><code>select id, col_a, col_b, col_c from tablename; id | col_a | col_b | col_c ----|--------------------|----------------|---------------------- 1 | apple orange plum | red green blue | bill dave sue 2 | orange plum banana | yellow red | frank james 3 | kiwi fruit apple | green pink | bill sarah-jane frank expected_result: ["apple", "orange", "banana", "fruit", "green", "yellow", "frank", "james", "sarah-jane"] </code></pre> I don't care about the order of results. thanks! EDIT: in my example above, everything is in lowercase, as that's how I happen to store everything in my real-life table that this question relates to. But, for the sake of argument, if it did contain some capitalisation I would prefer the query to ignore capitalisation (this is the setting of my DB config as it happens). EDIT2: in case it helps, all of the text columns have a FULLTEXT index on them. EDIT3: here is the SQL to create the sample data: <pre class="prettyprint"><code>DROP TABLE IF EXISTS `tablename`; CREATE TABLE `tablename` ( `id` int(11) NOT NULL AUTO_INCREMENT, `col_a` text, `col_b` text, `col_c` text, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8; LOCK TABLES `tablename` WRITE; INSERT INTO `tablename` VALUES (1,'apple orange plum','red green blue','bill dave sue'),(2,'orange plum banana','yellow red','frank james'),(3,'kiwi fruit apple','green pink','bill sarah-jane frank'); UNLOCK TABLES; </code></pre>

From your performance requirements and comments, it appears that you need to run this query regularly. Unfortunately, your data just isn't at the right resolution to do this neatly or succinctly <hr> I would consider adding a summary table of sorts to assist with the final query. By maintaining the summary table, as and when data in the main table changes, you should be able to keep things simpler A suggested format for this summary table would be <ul> <li>summary_table - <code>id</code>, <code>main_table_id</code>, <code>column_name</code>, <code>word</code> </li> </ul> Where <code>main_table_id</code> is a foreign key to your main table's id column You could also place a composite unique index on <code>(main_table_id, column_name, word)</code> <hr> On editing a relevant column value in the main table, you should adjust the summary table <ul> <li>Remove existing words for the <code>main_table_id</code> and <code>column_name</code> </li> <li>Insert a new list of unique words, of at least 5 characters, for the <code>main_table_id</code> and <code>column_name</code> </li> </ul> This could either be done at the application level or using a trigger <hr> This would make the final query much simpler.. <pre class="prettyprint"><code>SELECT DISTINCT word FROM summary_table </code></pre>

How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?

Tags:

regex

text

sql

mysql

In a MySQL 5.6 database I have table tablename which has (including others) three TEXT columns: col_a, col_b, col_c.

I want to extract all unique words (with words being separated by spaces) from these three columns that are at least 5 characters long. By "word" I mean any string of non-space characters, eg "foo-123" would be a word, as would "099423". The columns are all utf8 format InnoDB columns.

Is there a single query to do this?

EDIT: As requested, here's an example: (in the real data col_a, col_b and col_c are TEXT fields and could have a large number of words.)

select id, col_a, col_b, col_c from tablename;

id  | col_a              | col_b          | col_c
----|--------------------|----------------|----------------------
1   | apple orange plum  | red green blue | bill dave sue
2   | orange plum banana | yellow red     | frank james
3   | kiwi fruit apple   | green pink     | bill sarah-jane frank

expected_result: ["apple", "orange", "banana", "fruit", 
                  "green", "yellow", "frank", "james", "sarah-jane"]

I don't care about the order of results. thanks!

EDIT: in my example above, everything is in lowercase, as that's how I happen to store everything in my real-life table that this question relates to. But, for the sake of argument, if it did contain some capitalisation I would prefer the query to ignore capitalisation (this is the setting of my DB config as it happens).

EDIT2: in case it helps, all of the text columns have a FULLTEXT index on them.

EDIT3: here is the SQL to create the sample data:

DROP TABLE IF EXISTS `tablename`;
CREATE TABLE `tablename` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `col_a` text,
  `col_b` text,
  `col_c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
LOCK TABLES `tablename` WRITE;
INSERT INTO `tablename` VALUES (1,'apple orange plum','red green blue','bill dave sue'),(2,'orange plum banana','yellow red','frank james'),(3,'kiwi fruit apple','green pink','bill sarah-jane frank');
UNLOCK TABLES;

403

asked May 16 '19 10:05

Max Williams

2 Answers

The best solution is not using that structure to store data and normalize your database in compliance with normal forms. But if you want to split strings to words and get them as a table and you can't normalize the database and you can't use the latest version of MYSQL with CTE you could create a simple stored procedure to split strings and store them to a temporary table. For example, the stored procedure might look like:

DELIMITER //
CREATE PROCEDURE split_string_to_table (str longtext)
BEGIN
  DECLARE val TEXT DEFAULT NULL;
  DROP TEMPORARY TABLE IF EXISTS temp_values;
  CREATE TEMPORARY TABLE temp_values (
     `value` varchar(200)  
  );

  iterator:
  LOOP  
    IF LENGTH(TRIM(str)) = 0 OR str IS NULL THEN
      LEAVE iterator;
    END IF;
    SET val = SUBSTRING_INDEX(str, ' ', 1);
    INSERT INTO temp_values (`value`) VALUES (TRIM(val));
    SET str = INSERT(str, 1, LENGTH(val) + 1, '');
  END LOOP;
  SELECT DISTINCT(`value`) FROM temp_values WHERE CHAR_LENGTH(`value`) >= 5;
END //
DELIMITER ;

After it, you can join all strings to one string and store it in a temporary variable and pass its value to the stored procedure:

SELECT CONCAT_WS(' ', 
                 GROUP_CONCAT(col_a SEPARATOR ' '), 
                 GROUP_CONCAT(col_b SEPARATOR ' '), 
                 GROUP_CONCAT(col_c SEPARATOR ' ')
       ) INTO @text
FROM mytable;

CALL split_string_to_table(@text);

Result:

--------------
| value      |
--------------
| apple      |
--------------
| orange     |
--------------
| banana     |
--------------
| fruit      |
--------------
| green      |
--------------
| yellow     |
--------------
| frank      |
--------------
| james      |
--------------
| sarah-jane |
--------------

You can see the demo of that realization in DBFiddle

answered Sep 25 '22 01:09

Maksym Fedorov

From your performance requirements and comments, it appears that you need to run this query regularly. Unfortunately, your data just isn't at the right resolution to do this neatly or succinctly

I would consider adding a summary table of sorts to assist with the final query. By maintaining the summary table, as and when data in the main table changes, you should be able to keep things simpler

A suggested format for this summary table would be

summary_table - id, main_table_id, column_name, word

Where main_table_id is a foreign key to your main table's id column

You could also place a composite unique index on (main_table_id, column_name, word)

On editing a relevant column value in the main table, you should adjust the summary table

Remove existing words for the main_table_id and column_name
Insert a new list of unique words, of at least 5 characters, for the main_table_id and column_name

This could either be done at the application level or using a trigger

This would make the final query much simpler..

SELECT DISTINCT word
  FROM summary_table

answered Sep 25 '22 01:09

Arth

Related questions
                            
                                Alternative to COUNT for innodb to prevent table scan?
                            
                                Join four tables in codeigniter
                            
                                Order by count on has_many relation
                            
                                Regression analysis in MySQL
                            
                                Date type issue with Like clause Arabic string in mysql
                            
                                How to make sequelize has many association lowercase in views
                            
                                How to use a naming convention for large databases?
                            
                                Is it possible for mysql to create a distributed database?
                            
                                Database architecture for millions of new rows per day
                            
                                Is C3P0 thread-safe?
                            
                                How to store URLs in MySQL
                            
                                Why does the Wordpress database schema not use foreign keys? [closed]
                            
                                Rails MySQL Too Many Connections
                            
                                MYSQL: How to make NULL or empty data default to 0 during insert
                            
                                How do I detect a cellular carrier dynamically through php/ajax/javascript? [closed]
                            
                                How to map Enum type in mybatis using typeHandler on insert
                            
                                Using two foreign keys as a primary key - MySQL
                            
                                PHP activerecord mysql server has gone away
                            
                                CakePHP query closest latitude longitude from database
                            
                                Mysql - How to compare two Json objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With