A (relatively new to SQL) member of my team was working on writing an SQL query that happened to use a window function. Upon reviewing I noted that they structured their window function like this: <pre class="prettyprint"><code>COUNT(*) OVER(PARTITION BY Part1+Part2) AS A </code></pre> Which I instantly made a feedback note to say it should be like this: <pre class="prettyprint"><code>COUNT(*) OVER(PARTITION BY Part1, Part2) AS A </code></pre> Both Part1 and Part2 are nvarchars. Then I paused to reflect and I couldn't actually work out why that would be wrong. As far as I can see that would actually produce identical results (it does). The actual execution plan is nearly identical aside from an extra Compute Scalar step after the initial table scan on the first query (this is 0% of query cost). The I/O statistics show that the first version has 5 fewer logical reads (12,665 to 12,670). So is there any benefit/detriment to using either form, aside from coding conventions? Is it a case that this works fine in this instance, but in certain circumstances could produce inconsistent results?

Both expressions are valid, but they do not do the same thing. Consider the following data: <pre class="prettyprint"><code>Part1 Part2 AB C A BC </code></pre> When concatenating strings with <code>PARTITION BY Part1+Part2</code> both records fall in the same partition, whereas when using <code>PARTITION BY Part1, Part2</code>, they would belong to different partitions. So the question actually comes down to: what is the correct partitioning criteria for your use case? Usually, unless you are doing something fancy, you want <code>PARTITION BY Part1, Part2</code>. But this actually has to be answered from functional perspective, based on your real use case.

The <code>PARTITION</code> expression is exactly that -- an expression. So you can put almost any form of an expression in there and use that value to partition the rows. In terms of inconsistent results, you will run into a problem if you have this case: <pre class="prettyprint"><code>Part1 Part2 Part1 + Part2 'yummy' 'sushi' 'yummysushi' 'yumm' 'ysushi' 'yummysushi' </code></pre> Both rows would be considered to be part of the same partition, even though the columns have different values. In terms of performance, my only guess is if you have an index or any partitioning scheme set up on those particular columns, you might get an improvement there. Best bet is to use the second case you specified <code>(Part1, Part2)</code>.

List or concatenate in SQL window function

Tags:

sql

sql-server

tsql

count

window-functions

A (relatively new to SQL) member of my team was working on writing an SQL query that happened to use a window function. Upon reviewing I noted that they structured their window function like this:

COUNT(*) OVER(PARTITION BY Part1+Part2) AS A

Which I instantly made a feedback note to say it should be like this:

COUNT(*) OVER(PARTITION BY Part1, Part2) AS A

Both Part1 and Part2 are nvarchars.

Then I paused to reflect and I couldn't actually work out why that would be wrong. As far as I can see that would actually produce identical results (it does). The actual execution plan is nearly identical aside from an extra Compute Scalar step after the initial table scan on the first query (this is 0% of query cost). The I/O statistics show that the first version has 5 fewer logical reads (12,665 to 12,670).

So is there any benefit/detriment to using either form, aside from coding conventions? Is it a case that this works fine in this instance, but in certain circumstances could produce inconsistent results?

621

asked Sep 30 '19 13:09

CobaltZorch

2 Answers

Both expressions are valid, but they do not do the same thing.

Consider the following data:

Part1    Part2
AB       C
A        BC

When concatenating strings with PARTITION BY Part1+Part2 both records fall in the same partition, whereas when using PARTITION BY Part1, Part2, they would belong to different partitions.

So the question actually comes down to: what is the correct partitioning criteria for your use case? Usually, unless you are doing something fancy, you want PARTITION BY Part1, Part2. But this actually has to be answered from functional perspective, based on your real use case.

193

answered Oct 19 '22 00:10

GMB

The PARTITION expression is exactly that -- an expression. So you can put almost any form of an expression in there and use that value to partition the rows.

In terms of inconsistent results, you will run into a problem if you have this case:

Part1    Part2    Part1 + Part2
'yummy'  'sushi'  'yummysushi'
'yumm'   'ysushi' 'yummysushi'

Both rows would be considered to be part of the same partition, even though the columns have different values.

In terms of performance, my only guess is if you have an index or any partitioning scheme set up on those particular columns, you might get an improvement there.

Best bet is to use the second case you specified (Part1, Part2).

answered Oct 18 '22 23:10

ravioli

Related questions
                            
                                Comparing empty string with null value - SQL Server
                            
                                Selecting latest consecutive records that match a condition with PostgreSQL
                            
                                Postgres GROUP BY Array Column
                            
                                Add a new column in table with a sequence - Oracle
                            
                                extract the date from a timestamp value variable in Impala
                            
                                How to do a Select in another Select with Postgresql
                            
                                How to decode BASE64 in Standard SQL?
                            
                                Insert a pandas dataframe into a SQLite table
                            
                                how to select only by date from timestamp column in postgres?
                            
                                Invalid POLYGON bigQuery
                            
                                Updating rows in jOOQ with joins
                            
                                How to change a UNION to a IN clause?
                            
                                How to remove garbage data from array output
                            
                                How to query and iterate over array of structures in Athena (Presto)?
                            
                                In Oracle, what does [select * from table()] mean?
                            
                                Fetch dynamic table name in trigger
                            
                                Query plan on indexed partitioned table. Avoid sequential scan
                            
                                Postgres check constraint in text array for the validity of the values
                            
                                How to match row value as variable for parameter in sibling results?
                            
                                Insert stored procedure results into temp table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With