Problem: Say there is a simple (yet big) table <code>foods</code> <pre class="prettyprint"><code>id name -- ----------- 01 ginger beer 02 white wine 03 red wine 04 ginger wine </code></pre> I'd like to count how many entries have specific hardcoded patterns, say contain words 'ginger' (<code>LIKE '%ginger%'</code>) or 'wine' (<code>LIKE '%wine%'</code>), or whatever else in them, and write these numbers into rows along comments. The result I'm looking for is the following <pre class="prettyprint"><code>comment total --------------- ----- contains ginger 2 for wine lovers 3 </code></pre> Solution 1 (good format but inefficient): It is possible to use <code>UNION ALL</code> and construct the following <pre class="prettyprint"><code>SELECT * FROM ( ( SELECT 'contains ginger' AS comment, sum((name LIKE '%ginger%')::INT) AS total FROM foods ) UNION ALL ( SELECT 'for wine lovers' AS comment, sum((name LIKE '%wine%')::INT) AS total FROM foods ) ) </code></pre> Apparently it works similarly to simply executing multiple queries and sewing them together afterwards. It is very inefficient. Solution 2 (efficient but bad format): The following is multiple times faster compared to previous solution <pre class="prettyprint"><code>SELECT sum((name LIKE '%ginger%')::INT) AS contains_ginger, sum((name LIKE '%wine%')::INT) AS for_wine_lovers FROM foods </code></pre> And the result is <pre class="prettyprint"><code>contains_ginger for_wine_lovers --------------- --------------- 2 3 </code></pre> So it is definitely possible to get the same information much faster, but in a wrong format... Discussion: What is the best overall approach? What should I do to get the result I want in an efficient manner and preferable format? Or is it really impossible? By the way, I am writing this for Redshift (based on PostgreSQL). Thanks.

option 1: manually reshape <pre class="prettyprint"><code>CREATE TEMPORARY TABLE wide AS ( SELECT sum((name LIKE '%ginger%')::INT) AS contains_ginger, sum((name LIKE '%wine%')::INT) AS for_wine_lovers ... FROM foods; SELECT 'contains ginger', contains_ginger FROM wide UNION ALL SELECT 'for wine lovers', contains_wine FROM wine UNION ALL ...; </code></pre> option 2: create a categories table & use a join <pre class="prettyprint"><code>-- not sure if redshift supports values, hence I'm using the union all to build the table WITH categories (category_label, food_part) AS ( SELECT 'contains ginger', 'ginger' union all SELECT 'for wine lovers', 'wine' ... ) SELECT categories.category_label, COUNT(*) FROM categories LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%') GROUP BY 1 </code></pre> Since your solution 2 you consider to be fast enough, option 1 should work for you. Option 2 should also be fairly efficient, and it is much easier to write & extend, and as an added bonus, this query will let you know if no foods exist in a given category. Option 3: Reshape & redistribute your data to better match the grouping keys. You could also pre-process your dataset if the query execution time is very important. A lot the benefits of this depend on your data volume and data distribution. Do you only have a few hard categories, or will they be searched dynamically from some sort of interface. For example: If the dataset were reshaped like this: <pre class="prettyprint"><code>content name -------- ---- ginger 01 ginger 04 beer 01 white 02 wine 02 wine 04 wine 03 </code></pre> Then you could shard & distribute on <code>content</code>, and each instance could execute that part of the aggregation in parallel. Here an equivalent query might look like this: <pre class="prettyprint"><code>WITH content_count AS ( SELECT content, COUNT(*) total FROM reshaped_food_table GROUP BY 1 ) SELECT CASE content WHEN 'ginger' THEN 'contains ginger' WHEN 'wine' THEN 'for wine lovers' ELSE 'other' END category , total FROM content_count </code></pre>

How to count different values into different rows in SQL efficiently?

Tags:

performance

sql

database

postgresql

amazon-redshift

Problem:

Say there is a simple (yet big) table foods

id   name 
--   -----------  
01   ginger beer
02   white wine
03   red wine
04   ginger wine

I'd like to count how many entries have specific hardcoded patterns, say contain words 'ginger' (LIKE '%ginger%') or 'wine' (LIKE '%wine%'), or whatever else in them, and write these numbers into rows along comments. The result I'm looking for is the following

comment           total 
---------------   -----  
contains ginger   2
for wine lovers   3

Solution 1 (good format but inefficient):

It is possible to use UNION ALL and construct the following

SELECT * FROM
(
  (
    SELECT
      'contains ginger' AS comment,
      sum((name LIKE '%ginger%')::INT) AS total
    FROM foods
  )
  UNION ALL
  (
    SELECT
      'for wine lovers' AS comment,
      sum((name LIKE '%wine%')::INT) AS total
    FROM foods
  )
)

Apparently it works similarly to simply executing multiple queries and sewing them together afterwards. It is very inefficient.

Solution 2 (efficient but bad format):

The following is multiple times faster compared to previous solution

SELECT
  sum((name LIKE '%ginger%')::INT) AS contains_ginger,
  sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods

And the result is

contains_ginger   for_wine_lovers 
---------------   ---------------  
2                 3

So it is definitely possible to get the same information much faster, but in a wrong format...

Discussion:

What is the best overall approach? What should I do to get the result I want in an efficient manner and preferable format? Or is it really impossible?

By the way, I am writing this for Redshift (based on PostgreSQL).

Thanks.

231

asked Aug 07 '17 08:08

Pranasas

2 Answers

In both the queries LIKE operator is used. Alternatively We can use Position to find the location of the hardcoded words in the name. If hardcoded words are available in the name then a number greater than 0 will be returned.

SELECT 
       unnest(array['ginger', 'wine']) AS comments,
       unnest(array[ginger, wine]) AS count
FROM(
     (SELECT sum(contains_ginger) ginger , sum(contains_wine) wine
        FROM
             (SELECT CASE WHEN Position('ginger' in name)>0 
                          THEN 1 
                           END contains_ginger,
                     CASE WHEN Position('wine' in name) > 0 
                          THEN 1
                           END contains_wine
                 FROM foods) t) t1

114

answered Sep 19 '22 03:09

Valli

option 1: manually reshape

CREATE TEMPORARY TABLE wide AS (
  SELECT
    sum((name LIKE '%ginger%')::INT) AS contains_ginger,
    sum((name LIKE '%wine%')::INT) AS for_wine_lovers
    ...
  FROM foods;
SELECT
  'contains ginger', contains_ginger FROM wide

UNION ALL
SELECT 
  'for wine lovers', contains_wine FROM wine

UNION ALL
...;

option 2: create a categories table & use a join

-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
    SELECT 'contains ginger', 'ginger'
    union all
    SELECT 'for wine lovers', 'wine'
    ...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1

Since your solution 2 you consider to be fast enough, option 1 should work for you.

Option 2 should also be fairly efficient, and it is much easier to write & extend, and as an added bonus, this query will let you know if no foods exist in a given category.

Option 3: Reshape & redistribute your data to better match the grouping keys.

You could also pre-process your dataset if the query execution time is very important. A lot the benefits of this depend on your data volume and data distribution. Do you only have a few hard categories, or will they be searched dynamically from some sort of interface.

For example:

If the dataset were reshaped like this:

content   name 
--------  ----
ginger    01
ginger    04
beer      01
white     02
wine      02 
wine      04
wine      03

Then you could shard & distribute on content, and each instance could execute that part of the aggregation in parallel.

Here an equivalent query might look like this:

WITH content_count AS (
  SELECT content, COUNT(*) total
  FROM reshaped_food_table 
  GROUP BY 1
)
SELECT
    CASE content 
      WHEN 'ginger' THEN 'contains ginger'
      WHEN 'wine' THEN 'for wine lovers'
      ELSE 'other' 
    END category
  , total
FROM content_count

answered Sep 19 '22 03:09

Haleemur Ali

Related questions
                            
                                Set list parameter to native query
                            
                                Trying to check if username already exists in MySQL database using PHP [duplicate]
                            
                                Oracle SQL adding multi-line table comment or column comment
                            
                                LIKE case-insensitive for not English letters
                            
                                Convert Varchar Column to Datetime format - SQL Server
                            
                                USING index clause
                            
                                Entity Framework queries miss filtered index WHERE BIT field = 0
                            
                                Why does SQL Server convert VARCHAR to DATETIME using an invalid style?
                            
                                Can I put a condition on a window function in Redshift?
                            
                                List sql tables in pandas.read_sql
                            
                                Apache - Zeppelin using variables across paragraphs
                            
                                What is the limit on the number of rows that can be inserted in a single insert statement with PostgreSQL 9.4?
                            
                                Sql syntax: select without from clause as subquery in select (subselect)
                            
                                get rows based on expiry time
                            
                                How to make criteria with array field in Hibernate
                            
                                T-SQL Procedure, scalar variable error even after successful updation
                            
                                C# fastest way to insert data into SQL database
                            
                                Strange behavior of SQL Server - random selection
                            
                                How to pass parent variable value to child package for reference type: External Reference
                            
                                Database URI or URL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With