Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rounding issue in LOG and EXP functions

I am trying to perform cumulative multiplication. I am trying two methods to do this

sample data:

DECLARE @TEST TABLE
  (
     PAR_COLUMN INT,
     PERIOD     INT,
     VALUE      NUMERIC(22, 6)
  ) 
INSERT INTO @TEST VALUES 
(1,601,10 ),
(1,602,20 ),
(1,603,30 ),
(1,604,40 ),
(1,605,50 ),
(1,606,60 ),
(2,601,100),
(2,602,200),
(2,603,300),
(2,604,400),
(2,605,500),
(2,606,600)

Note: The data in value column will never be integer and values will have decimal part. To show approximation problem I have kept example values as integers.


Method 1: EXP + LOG + SUM() Over(Order by)

In this method am using EXP + LOG + SUM() Over(Order by) technique to find cumulative multiplication. In this method values are not accurate; there is some rounding and approximation issue in the result.

SELECT *,
       Exp(Sum(Log(Abs(NULLIF(VALUE, 0))))
             OVER(
               PARTITION BY PAR_COLUMN
               ORDER BY PERIOD)) AS CUM_MUL
FROM   @TEST;

Result:

PAR_COLUMN  PERIOD  VALUE       CUM_MUL
----------  ------  ---------   ----------------
1           601     10.000000   10
1           602     20.000000   200             -- 10 * 20 = 200(correct)
1           603     30.000000   6000.00000000001 -- 200 * 30 = 6000.000000000 (not 6000.00000000001) incorrect
1           604     40.000000   240000
1           605     50.000000   12000000
1           606     60.000000   720000000.000001  -- 12000000 * 60 = 720000000.000000 (not 720000000.000001) incorrect
2           601     100.000000  100
2           602     200.000000  20000
2           603     300.000000  5999999.99999999 -- 20000.000000 *300.000000 = 6000000.000000 (not 5999999.99999999) incorrect
2           604     400.000000  2399999999.99999  
2           605     500.000000  1199999999999.99
2           606     600.000000  719999999999998

Method 2: Tradictional Multiplication (Recursive CTE)

This method works perfectly without any rounding or approximation problem.

;WITH CTE
     AS (SELECT TOP 1 WITH TIES PAR_COLUMN,
                                PERIOD,
                                VALUE,
                                CUM_MUL = VALUE
         FROM   @TEST
         ORDER  BY PERIOD
         UNION ALL
         SELECT T.PAR_COLUMN,
                T.PERIOD,
                T.VALUE,
                Cast(T.VALUE * C.CUM_MUL AS NUMERIC(22, 6))
         FROM   CTE C
                INNER JOIN @TEST T
                        ON C.PAR_COLUMN = T.PAR_COLUMN
                           AND T.PERIOD = C.PERIOD + 1)
SELECT *
FROM   CTE 
ORDER BY PAR_COLUMN,PERIOD

Result

PAR_COLUMN  PERIOD  VALUE       CUM_MUL
----------  ------  ---------   ----------------
1           601     10.000000   10.000000
1           602     20.000000   200.000000
1           603     30.000000   6000.000000
1           604     40.000000   240000.000000
1           605     50.000000   12000000.000000
1           606     60.000000   720000000.000000
2           601     100.000000  100.000000
2           602     200.000000  20000.000000
2           603     300.000000  6000000.000000
2           604     400.000000  2400000000.000000
2           605     500.000000  1200000000000.000000
2           606     600.000000  720000000000000.000000

Can anyone tell me why in method 1 values are not accurate and how to fix it? I tried by changing the data types to Float and by increasing the scale in numeric but no use.

I really want to use method 1 which is much faster than method 2.

Edit: Now I know the reason for approximation. Can anyone find a fix for this problem?

like image 247
Pரதீப் Avatar asked Nov 26 '15 12:11

Pரதீப்


3 Answers

In pure T-SQL LOG and EXP operate with the float type (8 bytes), which has only 15-17 significant digits. Even that last 15th digit can become inaccurate if you sum large enough values. Your data is numeric(22,6), so 15 significant digits is not enough.

POWER can return numeric type with potentially higher precision, but it is of little use for us, because both LOG and LOG10 can return only float anyway.

To demonstrate the problem I'll change the type in your example to numeric(15,0) and use POWER instead of EXP:

DECLARE @TEST TABLE
  (
     PAR_COLUMN INT,
     PERIOD     INT,
     VALUE      NUMERIC(15, 0)
  );

INSERT INTO @TEST VALUES 
(1,601,10 ),
(1,602,20 ),
(1,603,30 ),
(1,604,40 ),
(1,605,50 ),
(1,606,60 ),
(2,601,100),
(2,602,200),
(2,603,300),
(2,604,400),
(2,605,500),
(2,606,600);

SELECT *,
    POWER(CAST(10 AS numeric(15,0)),
        Sum(LOG10(
            Abs(NULLIF(VALUE, 0))
            ))
        OVER(PARTITION BY PAR_COLUMN ORDER BY PERIOD)) AS Mul
FROM @TEST;

Result

+------------+--------+-------+-----------------+
| PAR_COLUMN | PERIOD | VALUE |       Mul       |
+------------+--------+-------+-----------------+
|          1 |    601 |    10 |              10 |
|          1 |    602 |    20 |             200 |
|          1 |    603 |    30 |            6000 |
|          1 |    604 |    40 |          240000 |
|          1 |    605 |    50 |        12000000 |
|          1 |    606 |    60 |       720000000 |
|          2 |    601 |   100 |             100 |
|          2 |    602 |   200 |           20000 |
|          2 |    603 |   300 |         6000000 |
|          2 |    604 |   400 |      2400000000 |
|          2 |    605 |   500 |   1200000000000 |
|          2 |    606 |   600 | 720000000000001 |
+------------+--------+-------+-----------------+

Each step here looses precision. Calculating LOG looses precision, SUM looses precision, EXP/POWER looses precision. With these built-in functions I don't think you can do much about it.


So, the answer is - use CLR with C# decimal type (not double), which supports higher precision (28-29 significant digits). Your original SQL type numeric(22,6) would fit into it. And you wouldn't need the trick with LOG/EXP.


Oops. I tried to make a CLR aggregate that calculates Product. It works in my tests, but only as a simple aggregate, i.e.

This works:

SELECT T.PAR_COLUMN, [dbo].[Product](T.VALUE) AS P
FROM @TEST AS T
GROUP BY T.PAR_COLUMN;

And even OVER (PARTITION BY) works:

SELECT *,
    [dbo].[Product](T.VALUE) 
    OVER (PARTITION BY PAR_COLUMN) AS P
FROM @TEST AS T;

But, running product using OVER (PARTITION BY ... ORDER BY ...) doesn't work (checked with SQL Server 2014 Express 12.0.2000.8):

SELECT *,
    [dbo].[Product](T.VALUE) 
    OVER (PARTITION BY T.PAR_COLUMN ORDER BY T.PERIOD 
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CUM_MUL
FROM @TEST AS T;

Incorrect syntax near the keyword 'ORDER'.

A search found this connect item, which is closed as "Won't Fix" and this question.


The C# code:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.IO;
using System.Collections.Generic;
using System.Text;

namespace RunningProduct
{
    [Serializable]
    [SqlUserDefinedAggregate(
        Format.UserDefined,
        MaxByteSize = 17,
        IsInvariantToNulls = true,
        IsInvariantToDuplicates = false,
        IsInvariantToOrder = true,
        IsNullIfEmpty = true)]
    public struct Product : IBinarySerialize
    {
        private bool m_bIsNull; // 1 byte storage
        private decimal m_Product; // 16 bytes storage

        public void Init()
        {
            this.m_bIsNull = true;
            this.m_Product = 1;
        }

        public void Accumulate(
            [SqlFacet(Precision = 22, Scale = 6)] SqlDecimal ParamValue)
        {
            if (ParamValue.IsNull) return;

            this.m_bIsNull = false;
            this.m_Product *= ParamValue.Value;
        }

        public void Merge(Product other)
        {
            SqlDecimal otherValue = other.Terminate();
            this.Accumulate(otherValue);
        }

        [return: SqlFacet(Precision = 22, Scale = 6)]
        public SqlDecimal Terminate()
        {
            if (m_bIsNull)
            {
                return SqlDecimal.Null;
            }
            else
            {
                return m_Product;
            }
        }

        public void Read(BinaryReader r)
        {
            this.m_bIsNull = r.ReadBoolean();
            this.m_Product = r.ReadDecimal();
        }

        public void Write(BinaryWriter w)
        {
            w.Write(this.m_bIsNull);
            w.Write(this.m_Product);
        }
    }
}

Install CLR assembly:

-- Turn advanced options on
EXEC sys.sp_configure @configname = 'show advanced options', @configvalue = 1 ;
GO
RECONFIGURE WITH OVERRIDE ;
GO
-- Enable CLR
EXEC sys.sp_configure @configname = 'clr enabled', @configvalue = 1 ;
GO
RECONFIGURE WITH OVERRIDE ;
GO

CREATE ASSEMBLY [RunningProduct]
AUTHORIZATION [dbo]
FROM 'C:\RunningProduct\RunningProduct.dll'
WITH PERMISSION_SET = SAFE;
GO

CREATE AGGREGATE [dbo].[Product](@ParamValue numeric(22,6))
RETURNS numeric(22,6)
EXTERNAL NAME [RunningProduct].[RunningProduct.Product];
GO

This question discusses calculation of a running SUM in great details and Paul White shows in his answer how to write a CLR function that calculates running SUM efficiently. It would be a good start for writing a function that calculates running Product.

Note, that he uses a different approach. Instead of making a custom aggregate function, Paul makes a function that returns a table. The function reads the original data into memory and performs all required calculations.

It may be easier to achieve the desired effect by implementing these calculations on your client side using the programming language of your choice. Just read the whole table and calculate running product on the client. Creating CLR function makes sense if the running product calculated on the server is an intermediary step in a more complex calculations that would aggregate data further.


One more idea that comes to mind.

Find a third-party .NET math library that offers Log and Exp functions with high precision. Make a CLR version of these scalar functions. And then use the EXP + LOG + SUM() Over (Order by) approach, where SUM is the built-in T-SQL function, which supports Over (Order by) and Exp and Log are custom CLR functions that return not float, but high-precision decimal.

Note, that high precision calculations may also be slow. And using CLR scalar functions in the query may also make it slow.

like image 51
Vladimir Baranov Avatar answered Oct 04 '22 15:10

Vladimir Baranov


LOG() and EXP() implicitly convert arguments to the float data type, which are approximate values.

like image 44
Bacon Bits Avatar answered Oct 04 '22 13:10

Bacon Bits


You can round to big multiple, for your data:

--720000000000000 must be multiple of 600

select
   round( 719999999999998/600,  0 ) * 600

--result: 720000000000000

Test it at SQLFiddle

create TABLE T 
  (
     PAR_COLUMN INT,
     PERIOD     INT,
     VALUE      NUMERIC(22, 6)
  ) 
INSERT INTO T VALUES 
(1,601,10.1 ),    --<--- I put decimals just to test!
(1,602,20 ),
(1,603,30 ),
(1,604,40 ),
(1,605,50 ),
(1,606,60 ),
(2,601,100),
(2,602,200),
(2,603,300),
(2,604,400),
(2,605,500),
(2,606,600)

Query 1:

with T1 as (
SELECT *,
       Exp(Sum(Log(Abs(NULLIF(VALUE, 0))))
             OVER(
               PARTITION BY PAR_COLUMN
               ORDER BY PERIOD)) AS CUM_MUL,
       VALUE AS CUM_MAX1,
       LAG( VALUE , 1, 1.) 
             OVER(
               PARTITION BY PAR_COLUMN
               ORDER BY PERIOD ) AS CUM_MAX2,
       LAG( VALUE , 2, 1.) 
             OVER(
               PARTITION BY PAR_COLUMN
               ORDER BY PERIOD ) AS CUM_MAX3
FROM   T )
select PAR_COLUMN,  PERIOD,  VALUE, 
       ( round( ( CUM_MUL  / ( CUM_MAX1 * CUM_MAX2 * CUM_MAX3) ) ,6) 
         * 
         cast( ( 1000000 * CUM_MAX1 * CUM_MAX2 * CUM_MAX3) as bigint )
       ) / 1000000.
       as CUM_MUL
FROM T1

Results:

| PAR_COLUMN | PERIOD | VALUE |         CUM_MUL |
|------------|--------|-------|-----------------|
|          1 |    601 |  10.1 |            10.1 | --ok! because my data
|          1 |    602 |    20 |             202 |
|          1 |    603 |    30 |            6060 |
|          1 |    604 |    40 |          242400 |
|          1 |    605 |    50 |        12120000 |
|          1 |    606 |    60 |       727200000 |
|          2 |    601 |   100 |             100 |
|          2 |    602 |   200 |           20000 |
|          2 |    603 |   300 |         6000000 |
|          2 |    604 |   400 |      2400000000 |
|          2 |    605 |   500 |   1200000000000 |
|          2 |    606 |   600 | 720000000000000 |

Notice I x1000000 to work without decimals

like image 35
dani herrera Avatar answered Oct 04 '22 15:10

dani herrera