Some database engines like Microsoft Access supports <code>FIRST()</code> as an aggregate function and I was using it in cases I know the column will only have one value in the group. Potentially, the database engine can optimise this as if it reaches any value it can mark this value as already calculated. So it is a surprise why this is not supported in for example Oracle or SQL Server, and more importantly, not in the SQL standard. In practice, people uses <code>MIN()</code> or <code>MAX()</code> instead, but they all require <ol> <li>The data type underneath have natural ordering semantic and the ordering does matter for the user;</li> <li>The database engine have to compare the intermediate value with the values in each rows</li> </ol> So this is not optimal in many cases. Are there any specific reasons people don't want to allow <code>SELECT ANY(FIELD) ...</code>? (I could think of two variants: <code>ANY()</code> gives any value in the result set that the column is not null; <code>FIRST()</code> gives the column value for the first row in result set, or null if there is no rows)

Regarding <code>first</code>/<code>last</code> The syntax supported by Microsoft Access SQL doesn't make sense in standard SQL: <pre class="prettyprint"><code>SELECT First(LastName) as First, Last(LastName) as Last FROM Employees </code></pre> (source) In Standard SQL, grouping takes places before sorting. Normally, groups are not sorted. That means, it is undefined which row comes first/last. Standard SQL generally aims to avoid constructs that have nondeterministic behaviour (exceptions exist). Standard SQL offers to so-called ordered set functions that accept an <code>within group (order by...)</code> clause to establish an order in a group prior to aggregation: <pre class="prettyprint"><code>SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY val) FROM ... </code></pre> The range for the argument of <code>percentile_disc</code> is <code>0</code> to <code>1</code> whereas <code>0</code> is the first result and <code>1</code> the last. <code>0.5</code> is the median (this is the common use-case for <code>percentile_disc</code>). However, standard SQL does not offer <code>first</code>/<code>last</code> as ordered set function, but <code>percentile_disc</code> with an argument of <code>0</code> is basically first, while the value <code>1</code> would basically give you the last result. The more common SQL way to obtain the first/last value is to use a top-n query: <pre class="prettyprint"><code>SELECT LastName FROM Employees ORDER BY ... FETCH FIRST 1 ROW ONLY </code></pre> Fetching first and last value in one go is a little bit awkward. Other than that, standard SQL also offers the window functions <code>first_value</code> and <code>last_value</code> to pick those values out of a partition without grouping. Regarding <code>any</code> Standard SQL has an aggregate function <code>any</code> but for a different use case. Again, what you (MS Access SQL) suggest for any gives you a non-deterministic result, which is not what standard SQL encourages. The standard SQL function <code>any</code> returns a boolean that is true if any of the conditions is true. It is best used in <code>having</code> clauses: <pre class="prettyprint"><code>SELECT * FROM .. GROUP BY ... HAVING ANY(<condition>) </code></pre> This remove all groups where no <code><condition></code> evaluates to true. References: <ul> <li>Slides regarding <code>WITHIN GROUP</code>: https://www.slideshare.net/MarkusWinand/modern-sql/105 </li> <li>Blog post on the <code>every</code> function (which is similar to <code>any</code>): https://blog.jooq.org/2014/12/18/a-true-sql-gem-you-didnt-know-yet-the-every-aggregate-function/ </li> </ul>

<code>FIRST</code> (or better <code>ANY_VALUE</code> as MySQL calls the function) is not in the SQL standard. This is probably because it is hardly needed in a standard-compliant DBMS. You say you use <code>FIRST</code> "in cases I know the column will only have one value in the Group". In a well-built database such a case should hardly ever occur. Maybe you are rather using it, because MS Access (and several other DBMS) violate the standard when it comes to aggregation with a <code>GROUP BY</code> clause. An example: <pre class="prettyprint"><code>select department_id, d.department_name, count(*) as num_employees from employees e join departments d using (department_id) group by d.department_id; </code></pre> You may want to use <code>FIRST(d.department_name)</code> here, because you know that per <code>department_id</code> there will be just one <code>department_name</code> of course. But so does the DBMS (or better: it should!). In standard SQL the above query is valid, because the <code>department_name</code> is functionally dependent on the <code>department_id</code>. No need hence for a <code>FIRST</code> or <code>ANY_VALUE</code> function. MySQL introduced <code>ANY_VALUE</code> mainly in order to deal with cases where the DBMS fails to detect the functional dependency, but again these cases should be extremely rare. I like the function, because it gives you the opportunity to say "I don't care which", e.g. give me the departments and one leader per department (i.e. in case there are two department leaders: one of them arbitrarily chosen). But well, I guess in the standard comittee they decided that <code>MIN</code> or <code>MAX</code> would serve the purpose in such rare cases, too.

Why "first" or "any" aggregate functions are not commonly used in database engines?

Tags:

sql

database

Some database engines like Microsoft Access supports FIRST() as an aggregate function and I was using it in cases I know the column will only have one value in the group.

Potentially, the database engine can optimise this as if it reaches any value it can mark this value as already calculated. So it is a surprise why this is not supported in for example Oracle or SQL Server, and more importantly, not in the SQL standard.

In practice, people uses MIN() or MAX() instead, but they all require

The data type underneath have natural ordering semantic and the ordering does matter for the user;
The database engine have to compare the intermediate value with the values in each rows

So this is not optimal in many cases.

Are there any specific reasons people don't want to allow SELECT ANY(FIELD) ...? (I could think of two variants: ANY() gives any value in the result set that the column is not null; FIRST() gives the column value for the first row in result set, or null if there is no rows)

726

asked Oct 18 '17 05:10

Earth Engine

2 Answers

Regarding first/last

The syntax supported by Microsoft Access SQL doesn't make sense in standard SQL:

SELECT  
       First(LastName) as First,
       Last(LastName) as Last
  FROM Employees

(source)

In Standard SQL, grouping takes places before sorting. Normally, groups are not sorted. That means, it is undefined which row comes first/last. Standard SQL generally aims to avoid constructs that have nondeterministic behaviour (exceptions exist).

Standard SQL offers to so-called ordered set functions that accept an within group (order by...) clause to establish an order in a group prior to aggregation:

SELECT
        PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY val)
  FROM ...

The range for the argument of percentile_disc is 0 to 1 whereas 0 is the first result and 1 the last. 0.5 is the median (this is the common use-case for percentile_disc).

However, standard SQL does not offer first/last as ordered set function, but percentile_disc with an argument of 0 is basically first, while the value 1 would basically give you the last result.

The more common SQL way to obtain the first/last value is to use a top-n query:

SELECT LastName
  FROM Employees
 ORDER BY ...
 FETCH FIRST 1 ROW ONLY

Fetching first and last value in one go is a little bit awkward.

Other than that, standard SQL also offers the window functions first_value and last_value to pick those values out of a partition without grouping.

Regarding any

Standard SQL has an aggregate function any but for a different use case. Again, what you (MS Access SQL) suggest for any gives you a non-deterministic result, which is not what standard SQL encourages.

The standard SQL function any returns a boolean that is true if any of the conditions is true. It is best used in having clauses:

SELECT
       *
  FROM ..
 GROUP BY ...
HAVING ANY(<condition>)

This remove all groups where no <condition> evaluates to true.

References:

Slides regarding WITHIN GROUP: https://www.slideshare.net/MarkusWinand/modern-sql/105
Blog post on the every function (which is similar to any): https://blog.jooq.org/2014/12/18/a-true-sql-gem-you-didnt-know-yet-the-every-aggregate-function/

176

answered Oct 24 '22 20:10

Markus Winand

FIRST (or better ANY_VALUE as MySQL calls the function) is not in the SQL standard. This is probably because it is hardly needed in a standard-compliant DBMS.

You say you use FIRST "in cases I know the column will only have one value in the Group". In a well-built database such a case should hardly ever occur. Maybe you are rather using it, because MS Access (and several other DBMS) violate the standard when it comes to aggregation with a GROUP BY clause. An example:

select department_id, d.department_name, count(*) as num_employees
from employees e
join departments d using (department_id)
group by d.department_id;

You may want to use FIRST(d.department_name) here, because you know that per department_id there will be just one department_name of course. But so does the DBMS (or better: it should!). In standard SQL the above query is valid, because the department_name is functionally dependent on the department_id. No need hence for a FIRST or ANY_VALUE function.

MySQL introduced ANY_VALUE mainly in order to deal with cases where the DBMS fails to detect the functional dependency, but again these cases should be extremely rare. I like the function, because it gives you the opportunity to say "I don't care which", e.g. give me the departments and one leader per department (i.e. in case there are two department leaders: one of them arbitrarily chosen). But well, I guess in the standard comittee they decided that MIN or MAX would serve the purpose in such rare cases, too.

answered Oct 24 '22 21:10

Thorsten Kettner

Related questions
                            
                                Allow only one checkbox selected in DataGridView
                            
                                How to solve this SQL query (header and detail)?
                            
                                How can I perform an aggregate function on an expression containing an aggregate or a subquery?
                            
                                Import SQL dump within Node environment
                            
                                Unexpected behavior with ActiveRecord includes
                            
                                Getting an error "record type has not been registered" in postgresql. What is wrong?
                            
                                How can I make this query recursive Sql Server?
                            
                                SQL Server query - why am I getting deadlock?
                            
                                Accessing sql-array item by in jooq
                            
                                SQL How to Update by INNER JOIN -
                            
                                best way to Join use date field
                            
                                SQL Query for events that happend in a specific order
                            
                                Difference in NA/NULL treatment using dplyr::left_join (R lang) vs. SQL LEFT JOIN
                            
                                SQL Injection with MySQL (a fun challenge)
                            
                                How to efficiently store a large Java map?
                            
                                Execute dynamic sql for all rows in a result set without a loop
                            
                                How to insert result of Stored Procedure into Temp Table without declaring Temp Table Columns
                            
                                Multi column exist statement
                            
                                how to fix sql server page level corruption? [closed]
                            
                                Postgresql User not connecting to Database (Nginx Django Gunicorn)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With