Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BINARY_CHECKSUM - different result depending on number of rows

I wonder why the BINARY_CHECKSUM function returns different result for the same:

SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
            (2, NULL, NULL),
            (3, 1, 2)) s(id,a,b);

SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
            (2, NULL, NULL)) s(id,a,b);

Ouput:

+-----+----+------+-------------+
| id  | a  |  b   |     bc      |
+-----+----+------+-------------+
|  1  |    | 100  |        -109 |
|  2  |    |      | -2147483640 |
|  3  | 1  |   2  |          18 |
+-----+----+------+-------------+

-- -109 vs 100
+-----+----+------+------------+
| id  | a  |  b   |     bc     |
+-----+----+------+------------+
|  1  |    | 100  |        100 |
|  2  |    |      | 2147483647 |
+-----+----+------+------------+

And for second sample I get what I would anticipate:

SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
            (2, 3, 4),
            (3,1,1)) s(id,a,b);

SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, 1, 100),
            (2, 3, 4)) s(id,a,b);

Ouptut for both first two rows:

+-----+----+------+-----+
| id  | a  |  b   | bc  |
+-----+----+------+-----+
|  1  | 1  | 100  | 116 |
|  2  | 3  |   4  |  52 |
+-----+----+------+-----+

db<>fiddle demo


It has strange consequences when I want to compare two tables/queries:

WITH t AS (
  SELECT 1 AS id, NULL AS a, 100 b
  UNION ALL SELECT 2, NULL, NULL
  UNION ALL SELECT 3, 1, 2           -- comment this out
), s AS (
  SELECT 1 AS id ,100 AS a, NULL as b
  UNION ALL SELECT 2, NULL, NULL
  UNION ALL SELECT 3, 2, 1           -- comment this out
)
SELECT t.*,s.*
  ,BINARY_CHECKSUM(t.a, t.b) AS bc_t, BINARY_CHECKSUM(s.a, s.b) AS bc_s
FROM t
JOIN s
  ON s.id = t.id
WHERE BINARY_CHECKSUM(t.a, t.b) = BINARY_CHECKSUM(s.a, s.b);

db<>fiddle demo2

For 3 rows I get single result:

+-----+----+----+-----+----+----+--------------+-------------+
| id  | a  | b  | id  | a  | b  |    bc_t      |    bc_s     |
+-----+----+----+-----+----+----+--------------+-------------+
|  2  |    |    |  2  |    |    | -2147483640  | -2147483640 |
+-----+----+----+-----+----+----+--------------+-------------+

but for 2 rows I get also id = 1:

+-----+----+------+-----+------+----+-------------+------------+
| id  | a  |  b   | id  |  a   | b  |    bc_t     |    bc_s    |
+-----+----+------+-----+------+----+-------------+------------+
|  1  |    | 100  |  1  | 100  |    |        100  |        100 |
|  2  |    |      |  2  |      |    | 2147483647  | 2147483647 |
+-----+----+------+-----+------+----+-------------+------------+

Remarks:

  • I am not searching for alternatives like(HASH_BYTES/MD5/CHECKSUM)

  • I am aware that BINARY_CHECKSUM could lead to collisions(two different calls produce the same output) here scenario is a bit different

For this definition, we say that null values, of a specified type, compare as equal values. If at least one of the values in the expression list changes, the expression checksum can also change. However, this is not guaranteed. Therefore, to detect whether values have changed, we recommend use of BINARY_CHECKSUM only if your application can tolerate an occasional missed change.

It is strange for me that hash function returns different result for the same input arguments. Is this behaviour by design or it is some kind of glitch?

EDIT:

As @scsimon points out it works for materialized tables but not for cte. db<>fiddle actual table

Metadata for cte:

SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
            (2, NULL, NULL),
            (3, 1, 2)) s(id,a,b)', NULL,0);

SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, NULL, 100),
            (2, NULL, NULL)) s(id,a,b)', NULL,0)

-- working workaround
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set('
SELECT *
FROM (VALUES(1, cast(NULL as int), 100),
            (2, NULL, NULL)) s(id,a,b)', NULL,0)

For all cases all columns are INT but with explicit CAST it behaves as it should.

db<>fidde metadata

like image 749
Lukasz Szozda Avatar asked Dec 31 '18 15:12

Lukasz Szozda


2 Answers

This has nothing to do with the number of rows. It is because the values in one of the columns of the 2-row version are always NULL. The default type of NULL is int and the default type of a numeric constant (of this length) is int, so these should be comparable. But from a values() derived table, these are (apparently) not exactly the same type.

In particular, a column with only typeless NULLs from a derived table is not comparable, so it is excluded from the binary checksum calculation. This does not occur in a real table, because all columns have types.

The rest of the answer illustrates what is happening.

The code behaves as expected with type conversions:

SELECT *, BINARY_CHECKSUM(a, b) AS bc
FROM (VALUES(1, cast(NULL as int), 100),
            (2, NULL, NULL)
     ) s(id,a,b);

Here is a db<>fiddle.

Actually creating tables with the values suggests that columns with only NULL values have exactly the same type as columns with explicit numbers. That suggests that the original code should work. But an explicit cast also fixes the problem. Very strange.

This is really, really strange. Consider the following:

select v.*, checksum(a, b), checksum(c,b)
FROM (VALUES(1, NULL, 100, NULL),
            (2, 1, 2, 1.0)
     ) v(id, a, b, c);

The change in type for "d" affects the binary_checksum() for the second row, but not for the first.

This is my conclusion. When all the values in a column are binary, then binary_checksum() is aware of this and the column is in the category of "noncomparable data type". The checksum is then based on the remaining columns.

You can validate this by seeing the error when you run:

select v.*, binary_checksum(a)
FROM (VALUES(1, NULL, 100, NULL),
            (2, NULL,    2,   1.0)
     ) v(    id,a,    b,   c);

It complains:

Argument data type NULL is invalid for argument 1 of checksum function.

Ironically, this is not a problem if you save the results into a table and use binary_checksum(). The issue appears to be some interaction with values() and data types -- but something that is not immediately obvious in the information_schema.columns table.

The happyish news is that the code should work on tables, even if it does not work on values() generated derived tables -- as this SQL Fiddle demonstrates.

I also learned that a column filled with NULLs really is typeless. The assignment of the int data type in a select into seems to happen when the table is being defined. The "typeless" type is converted to an int.

like image 191
Gordon Linoff Avatar answered Nov 19 '22 18:11

Gordon Linoff


For the literal NULL without the CAST (and without any typed values in the column) it entirely ignores it and just gives you the same result as BINARY_CHECKSUM(b).

This seems to happen very early on. The initial tree representation output from

SELECT *, BINARY_CHECKSUM(a,b) AS bc
FROM (VALUES(1, NULL, 100),
            (2, NULL, NULL)) s(id,a,b)
OPTION (RECOMPILE, QUERYTRACEON 8605, QUERYTRACEON 3604);

Already shows that it has decided to just use one column as input to the function

ScaOp_Intrinsic binary_checksum

    ScaOp_Identifier COL: Union1008 

This compares with the following output for your first query

ScaOp_Intrinsic binary_checksum

    ScaOp_Identifier COL: Union1011 

    ScaOp_Identifier COL: Union1010 

If you try and get the BINARY_CHECKSUM with

SELECT *, BINARY_CHECKSUM(a) AS bc
FROM (VALUES(1, NULL, 100)) s(id,a,b)

It gives the error

Msg 8184, Level 16, State 1, Line 8 Error in binarychecksum. There are no comparable columns in the binarychecksum input.

This is not the only place where an untyped NULL constant is treated differently from an explicitly typed one.

Another case is

SELECT COALESCE(CAST(NULL AS INT),CAST(NULL AS INT))

vs

SELECT COALESCE(NULL,NULL)

I'd err on the side of "glitch" in this case rather than "by design" though as the columns from the derived table are supposed to be int before they get to the checksum function.

SELECT COALESCE(a,b)
FROM (VALUES(NULL, NULL)) s(a,b)

Does work as expected without this glitch.

like image 37
Martin Smith Avatar answered Nov 19 '22 16:11

Martin Smith