Problem:
Say there is a simple (yet big) table foods
id name
-- -----------
01 ginger beer
02 white wine
03 red wine
04 ginger wine
I'd like to count how many entries have specific hardcoded patterns, say contain words 'ginger' (LIKE '%ginger%'
) or 'wine' (LIKE '%wine%'
), or whatever else in them, and write these numbers into rows along comments. The result I'm looking for is the following
comment total
--------------- -----
contains ginger 2
for wine lovers 3
Solution 1 (good format but inefficient):
It is possible to use UNION ALL
and construct the following
SELECT * FROM
(
(
SELECT
'contains ginger' AS comment,
sum((name LIKE '%ginger%')::INT) AS total
FROM foods
)
UNION ALL
(
SELECT
'for wine lovers' AS comment,
sum((name LIKE '%wine%')::INT) AS total
FROM foods
)
)
Apparently it works similarly to simply executing multiple queries and sewing them together afterwards. It is very inefficient.
Solution 2 (efficient but bad format):
The following is multiple times faster compared to previous solution
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods
And the result is
contains_ginger for_wine_lovers
--------------- ---------------
2 3
So it is definitely possible to get the same information much faster, but in a wrong format...
Discussion:
What is the best overall approach? What should I do to get the result I want in an efficient manner and preferable format? Or is it really impossible?
By the way, I am writing this for Redshift (based on PostgreSQL).
Thanks.
To count the number of different values that are stored in a given column, you simply need to designate the column you pass in to the COUNT function as DISTINCT . When given a column, COUNT returns the number of values in that column. Combining this with DISTINCT returns only the number of unique (and non-NULL) values.
The COUNT(DISTINCT) function returns the number of rows with unique non-NULL values. Hence, the inclusion of the DISTINCT keyword eliminates duplicate rows from the count. Its syntax is: COUNT(DISTINCT expr,[expr...])
So to make SELECT COUNT(*) queries fast, here's what to do:Get on any version that supports batch mode on columnstore indexes, and put a columnstore index on the table – although your experiences are going to vary dramatically depending on the kind of query you have.
A simple COUNT(*) just has to count number of rows - no sorting involved, so it will always be faster than COUNT(DISTINCT) .
In both the queries LIKE operator is used. Alternatively We can use Position to find the location of the hardcoded words in the name. If hardcoded words are available in the name then a number greater than 0 will be returned.
SELECT
unnest(array['ginger', 'wine']) AS comments,
unnest(array[ginger, wine]) AS count
FROM(
(SELECT sum(contains_ginger) ginger , sum(contains_wine) wine
FROM
(SELECT CASE WHEN Position('ginger' in name)>0
THEN 1
END contains_ginger,
CASE WHEN Position('wine' in name) > 0
THEN 1
END contains_wine
FROM foods) t) t1
option 1: manually reshape
CREATE TEMPORARY TABLE wide AS (
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
...
FROM foods;
SELECT
'contains ginger', contains_ginger FROM wide
UNION ALL
SELECT
'for wine lovers', contains_wine FROM wine
UNION ALL
...;
option 2: create a categories table & use a join
-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
SELECT 'contains ginger', 'ginger'
union all
SELECT 'for wine lovers', 'wine'
...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1
Since your solution 2 you consider to be fast enough, option 1 should work for you.
Option 2 should also be fairly efficient, and it is much easier to write & extend, and as an added bonus, this query will let you know if no foods exist in a given category.
Option 3: Reshape & redistribute your data to better match the grouping keys.
You could also pre-process your dataset if the query execution time is very important. A lot the benefits of this depend on your data volume and data distribution. Do you only have a few hard categories, or will they be searched dynamically from some sort of interface.
For example:
If the dataset were reshaped like this:
content name
-------- ----
ginger 01
ginger 04
beer 01
white 02
wine 02
wine 04
wine 03
Then you could shard & distribute on content
, and each instance could execute that part of the aggregation in parallel.
Here an equivalent query might look like this:
WITH content_count AS (
SELECT content, COUNT(*) total
FROM reshaped_food_table
GROUP BY 1
)
SELECT
CASE content
WHEN 'ginger' THEN 'contains ginger'
WHEN 'wine' THEN 'for wine lovers'
ELSE 'other'
END category
, total
FROM content_count
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With