Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should an ORDER BY ever affect the number of rows returned a SELECT query?

Tags:

sql

ingres

I am going through the pain of normalising a horrible legacy database, and have discovered what I think is a bug with the DBMS.

This query returns the results as I expect:

SELECT DISTINCT RIGHT(SQUEEZE(thing_id), 2) AS thing_id, TRIM(thing_name)
FROM thing
ORDER BY thing_id, thing_name;
(16 rows)

The first time I ran the query, I had inadvertently used the wrong columns in the ORDER BY, as below:

SELECT DISTINCT RIGHT(SQUEEZE(thing_id), 2) AS thing_id, TRIM(thing_name)
FROM thing
ORDER BY thing_name, location;
(33 rows)

Note that the only thing to change is the ORDER BY, and the number of rows returned increases from 16 to 33. The results it gives are not DISTINCT as the query specifies.

I believe this is an outright bug, but a coworker says that it is normal because when we order by "location", it is selected an included invisibly in the results.

Should the ORDER BY ever affect the number of rows returned in a SELECT query?

EDIT: I had another person look at the queries AND I copied the queries to two seperate files, then ran a diff command on them. It is 100% certain that the only difference between the two queries is the columns listed in the ORDER BY.

UPDATE: Ingres have since release patch 14301 with bugfix: "Bug 126640 (GENERIC) Query with order-by expression and distinct aggregate returns more rows than expected. The columns in the order-by expression are not in the select list."

i.e. The query in question will now result in an error as the results are not correct.

like image 799
rusty_turkey Avatar asked May 09 '12 22:05

rusty_turkey


2 Answers

The issue I see is that the second query has a column (location) in the ORDER BY that is not included in the SELECT DISTINCT list. Actually both of the queries are invalid SQL (despite that Ingres seems to allow them). I simplified them (so the first is ok) to:

Query one (valid SQL):

SELECT DISTINCT 
      thing_id 
    , thing_name
FROM thing
ORDER BY thing_id
       , thing_name ;

Query two (invalid SQL, should produce error):

SELECT DISTINCT 
      thing_id 
    , thing_name
FROM thing
ORDER BY thing_name
       , location;

Why should it give error? Because the ORDER BY should be processed after SELECTand DISTINCT. So, two or more rows in the original table may have same thing_id and thing_name but different location. These rows will be collapsed into one. So, there is no location value to be useed for the ordering. Even if there was (a hidden location valiue) kept, which one of the many should it be?

SELECT DISTINCT queries can be rewritten with SELECT ALL and GROUP BY (also invalid in this case):

SELECT ALL
      thing_id 
    , thing_name
FROM thing
GROUP BY thing_id 
       , thing_name
ORDER BY thing_name
       , location;

The above (query 2) actually does produce error in PostgreSQL, SQL-Server and Oracle. Test it in SQL-Fiddle


From the erroneous number of rows returned by the second query in Ingres, I guess that what happens behind the scenes is that he location is secretly kept in the SELECT list so it can be used for the ORDER BY and then removed. This, in combination with the DISTINCT results in the non-standard buggy behaviour:

SELECT DISTINCT 
      thing_id 
    , thing_name
   (, location         --- hidden column) 
FROM thing
ORDER BY thing_name
       , location;

You can call it either a bug or a feature, it doesn't matter, as long as you know it should really not be allowed in the first place.

It seems that a similar issue has been reported a year ago in Actian forum: Problem with DISTINCT + ORDER BY and supposedly fixed. No idea which version they are refering to or if it has been actually fixed or not (and what "fix" means).


If you want the query to be valid and behave as you expect, you could use somethign like this:

SELECT
      RIGHT(SQUEEZE(thing_id), 2)  AS squeezed_thing_id
    , TRIM(thing_name)             AS trimmed_thing_name 
    , MIN(location)                AS a_location            --- or MAX()
FROM 
    thing
GROUP BY 
      RIGHT(SQUEEZE(thing_id), 2)
    , TRIM(thing_name)     
ORDER BY 
      trimmed_thing_name
    , a_location ;                
like image 122
ypercubeᵀᴹ Avatar answered Nov 04 '22 15:11

ypercubeᵀᴹ


Since the order of operations in SQL is:

FROM > WHERE > GROUP BY > HAVING > SELECT > ORDER BY

this does seem to be a bug indeed. Which DBMS is it?

like image 22
MarioDS Avatar answered Nov 04 '22 14:11

MarioDS