Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query and/or function that subsets the data on one field for all types and calcs on other field

I am trying to go from this:

+------+------+------+------+
| fld1 | fld2 | fld3 | etc… |
+------+------+------+------+
| a    |    5 |    1 |      |
| b    |    5 |    0 |      |
| c    |    6 |    0 |      |
| b    |    2 |    5 |      |
| b    |    1 |    6 |      |
| c    |    0 |    6 |      |
| a    |    8 |    9 |      |
+------+------+------+------+

To:

+--------+--------+-----------+-----+-----+------+
| Factor |  Agg   | CalcDate  | Sum | Avg | etc… |
+--------+--------+-----------+-----+-----+------+
| fld2   | fld1/a | 8/14/2015 |  13 | 6.5 |      |
| fld2   | fld1/b | 8/14/2015 |   8 | 2.7 |      |
| fld2   | fld1/c | 8/14/2015 |   6 | 3   |      |
| fld3   | fld1/a | 8/14/2015 |  10 | 5   |      |
| fld3   | fld1/b | 8/14/2015 |  11 | 3.7 |      |
| fld3   | fld1/c | 8/14/2015 |   6 | 3   |      |
+--------+--------+-----------+-----+-----+------+

Notes:

  • Obviously this data is simplified quite a bit.
  • I have a ton of fields I need to do this for
  • I included easy aggregation calcs here so it might be easier for someone to help me. The exhaustive list is: NaPct, Mean, Sd, Low, Q1, Median, Q3, High, IQR, Kurt, Skew, Obs. Where NaPct = Percent that are NULL, Sd = Standard deviation, Q1 = quartile 1, Q3 = quartile 3, IQR = Inter quartile range, Kurt = Kurtosis, Skew = Skewness, Obs = number of observations that are not NULL.
  • In reality, in the second table above the factor field will be FactorID, Agg will be AggID, and CalcDate will be CalcDateID, but I put the actual values in there for illustration purposes. Shouldn't matter to the question/answer though.
  • Speed is very important as I have 1305 fields and several aggregations to do calculations on before the work day starts.
  • Answers using only MS Access, SQL, and VBA. Sorry business requirement. That said, a SQL only answer that works in MS Access would be best for simplicity.
  • Below is code that uses a custom domain function (DCalcForQueries) and supporting functions that I built that return one calculated aggregate value per field and selected aggregation. Aka, not what I want. Maybe that code is usable for what I want, maybe not. Nevertheless it has the calculations I want in it which hopefully will help.
  • The message boxes are just how I debug while I am alpha testing: not necessary.
  • To use the code, put all of the code in a VBA module, change table "tbl_DatedModel_2015_0702_0" to a table you have in MS Access, change field "Rk-IU Mkt Cap" to a field in your table, and run the TestIT() sub and you should get calculated values in the Immediate window.
  • Don't worry about the calculations so much. I'll deal with that. I just need to know what is the best way to get from the first table above to the second table above in a way that allows for the calculations that I want. Thanks!

    Sub TestIt()
    Dim x
    Set x = GetOrOpenAndGetExcel
    
    Dim rst As DAO.Recordset
    Dim sSql As String
    Dim q As String
    q = VBA.Chr(34)
    sSql = "SELECT " & _
                "DCalcForQueries(" & q & "NaPct" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS NaPct ," & _
                "DCalcForQueries(" & q & "Mean" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Mean ," & _
                "DCalcForQueries(" & q & "Sd" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Sd ," & _
                "DCalcForQueries(" & q & "Low" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Low ," & _
                "DCalcForQueries(" & q & "Q1" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Q1 ," & _
                "DCalcForQueries(" & q & "Median" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Median ," & _
                "DCalcForQueries(" & q & "Q3" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Q3 ," & _
                "DCalcForQueries(" & q & "High" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS High ," & _
                "DCalcForQueries(" & q & "IQR" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS IQR ," & _
                "DCalcForQueries(" & q & "Kurt" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Kurt ," & _
                "DCalcForQueries(" & q & "Skew" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Skew ," & _
                "DCalcForQueries(" & q & "Obs" & q & ", " & q & "tbl_DatedModel_2015_0702_0" & q & ", " & q & "Rk-IU Mkt Cap" & q & ", " & q & "[Rk-IU Mkt Cap] IS NOT NULL AND [GICS Sector] = 'Consumer Discretionary'" & q & ") AS Obs " & _
                "FROM tbl_DatedModel_2015_0702_0;"
    Debug.Print sSql
    Set rst = CurrentDb.OpenRecordset(sSql, dbOpenSnapshot)
    rst.MoveFirst
    
    Debug.Print rst.RecordCount
    Debug.Print rst.Fields("NaPct")
    Debug.Print rst.Fields("Mean")
    Debug.Print rst.Fields("Sd")
    Debug.Print rst.Fields("Low")
    Debug.Print rst.Fields("Q1")
    Debug.Print rst.Fields("Median")
    Debug.Print rst.Fields("Q3")
    Debug.Print rst.Fields("High")
    Debug.Print rst.Fields("IQR")
    Debug.Print rst.Fields("Kurt")
    Debug.Print rst.Fields("Skew")
    Debug.Print rst.Fields("Obs")
    
    
    End Sub
    Public Function DCalcForQueries(sCalc As String, Optional sTbl As String = "", Optional sMainFld As String = "", Optional sWhereClause As String = "", Optional k As Double) As Variant
    
    Dim dblData() As Double
    Dim oxl As Object
    On Error Resume Next
    Set oxl = GetObject(, "Excel.Application")
    If Err.Number <> 0 Then
        MsgBox "Excel object must be openned by the calling sub of DCalcForQueries so it isn't opened over and over, which is very slow"
        GoTo cleanup
    End If
    
    Dim x As Integer
    
    Dim aV() As Variant
    Dim tmp
    Dim lObsCnt As Long
    Dim lNaCnt As Long
    Dim i As Long
    Dim vTmp As Variant
    Dim lTtl As Long
    Dim bDoCalc As Boolean
    
    aV = a2dvGetSubsetFromQuery(sTbl, sMainFld, sWhereClause, "Numeric")
    If aV(0, 0) = "Not Numeric" Then
        MsgBox "Data returned by query was not numeric. Press OK to Stop and debug."
        Stop
    End If
    
    If sCalc = "Percentile" Or sCalc = "Q1" Or sCalc = "Q2" Or sCalc = "Q3" Or sCalc = "Q4" Then
        DCalcForQueries = oxl.WorksheetFunction.Percentile_Exc(aV, k)
    ElseIf sCalc = "Median" Then
        DCalcForQueries = oxl.WorksheetFunction.Median(aV)
    ElseIf sCalc = "Kurt" Or sCalc = "Kurt" Then
        DCalcForQueries = oxl.WorksheetFunction.Kurt(aV)
    ElseIf sCalc = "Minimum" Or sCalc = "Low" Then
        DCalcForQueries = oxl.WorksheetFunction.Min(aV)
    ElseIf sCalc = "Maximum" Or sCalc = "High" Then
        DCalcForQueries = oxl.WorksheetFunction.Max(aV)
    ElseIf sCalc = "IQR" Then
        DCalcForQueries = oxl.WorksheetFunction.Quartile_Exc(aV, 3) - oxl.WorksheetFunction.Quartile_Exc(aV, 1)
    ElseIf sCalc = "Obs" Then
        lObsCnt = 0
        For Each tmp In aV
            If Not IsNull(tmp) Then
                lObsCnt = lObsCnt + 1
            End If
        Next
        DCalcForQueries = lObsCnt
    ElseIf sCalc = "%NA" Or sCalc = "PctNa" Or sCalc = "NaPct" Or sCalc = "%Null" Or sCalc = "PctNull" Then
        lNaCnt = 0
        lTtl = UBound(aV, 2) + 1
        For Each tmp In aV
            If IsNull(tmp) Then
                lNaCnt = lNaCnt + 1
            End If
        Next
        DCalcForQueries = (lNaCnt / lTtl) * 100
    ElseIf sCalc = "Skewness" Or sCalc = "Skew" Then
        DCalcForQueries = oxl.WorksheetFunction.Skew(aV)
    ElseIf sCalc = "StDev" Or sCalc = "Sd" Then
        DCalcForQueries = oxl.WorksheetFunction.StDev_S(aV)
    ElseIf sCalc = "Mean" Then
        DCalcForQueries = oxl.WorksheetFunction.Average(aV)
    Else
        MsgBox "sCalc parameter not recognized: " & sCalc
    End If
    
    cleanup:
    
    
    End Function
    
    Function a2dvGetSubsetFromQuery(sTbl As String, sMainFld As String, sWhereClause As String, sTest As String) As Variant()
    'sTest can be  "Numeric" or "None" ...will implement more as needed
    Dim iFieldType As Integer
    Dim rst As DAO.Recordset
    Dim db As Database
    Set db = CurrentDb
    Dim sMainFldFull As String
    Dim sSubSetFldFull As String
    Dim sSql As String
    
    sMainFldFull = "[" & sMainFld & "]"
    sSubSetFldFull = ""
    sSql = ""
    
    sSql = "SELECT " & sMainFldFull & " FROM " & sTbl
    If Len(sWhereClause) > 0 Then
        sSql = sSql & " WHERE " & sWhereClause
    End If
    
    Set rst = db.OpenRecordset(sSql, dbOpenSnapshot)
    
    'make sure the data is the right type
    
    iFieldType = rst(sMainFld).Type
    
    If sTest = "Numeric" Then
        If iFieldType = dbByte Or _
            iFieldType = dbInteger Or _
            iFieldType = dbLong Or _
            iFieldType = dbCurrency Or _
            iFieldType = dbSingle Or _
            iFieldType = dbDouble _
            Then
            rst.MoveLast
            rst.MoveFirst
    
            a2dvGetSubsetFromQuery = rst.GetRows(rst.RecordCount)
    
        Else
            Dim aV(0 To 1, 0 To 1) As Variant
            aV(0, 0) = "Not Numeric"
            a2dvGetSubsetFromQuery = aV
    
        End If
    ElseIf sTest = "None" Then
        'don't do any testing
        rst.MoveLast
        rst.MoveFirst
    
        a2dvGetSubsetFromQuery = rst.GetRows(rst.RecordCount)
    Else
        MsgBox "Test type (sTest) can only be 'None' or 'Numeric'. It was: " & sTest
        Stop
    End If
    
    cleanup:
    rst.Close
    Set rst = Nothing
    End Function
    Public Function GetOrOpenAndGetExcel() As Object
    'if excel is open it will return the excel object
    'if excel is not open it will open excel and return the excel object
    On Error GoTo 0
    On Error Resume Next
    Set GetOrOpenAndGetExcel = GetObject(, "Excel.Application")
    
    If Err.Number <> 0 Then
        Set GetOrOpenAndGetExcel = CreateObject("Excel.Application")
    End If
    
    On Error GoTo 0
    End Function
    

Edit1: The code I provide above is just to illustrate my attempt and the calculations. I'm pretty sure it isn't directly related to a good answer, but I'm not 100% sure. If I use what I have above, it produces one record at a time and I'd have to add (INSERT INTO) each record one at a time, which would be quite slow. My plan was to build a 2d array of the results and use that 2d array to add the records in batches, but was told that you can't do that without looping through the array adding each record once at a time, which would defeat the purpose. I am pretty sure a solution that includes looping through the fld1 types or one query with sub-queries that can do it in one step is the direction that should be taken. What I have done to optimize so far: I pulled the creation of the Excel object out so is created only once in the TestIt() Sub.

Edit2: I have 1305 fields to do calculations on. They are not all in the same table; however, for the purposes of this question I just need a working answer that does more than one field at a time. I.e. your answer can assume all fields are all in the same table and for simplicity your answer can include only 2 fields and I can expand it from there. In the code above I have calculated 12 metrics on one field "Rk-IU Mkt Cap" aggregating on one type,'Consumer Discretionary' ([GICS Sector] = 'Consumer Discretionary'"). What I have is not what I am after.

like image 527
mountainclimber11 Avatar asked Mar 16 '23 02:03

mountainclimber11


1 Answers

Would something like this work, just using pure tSql?

1: Create table and insert some sample data

CREATE TABLE [dbo].[FLD](
    [fld1] [nvarchar](2) NOT NULL,
    [fld2] [int] NULL,
    [fld3] [int] NULL
) ON [PRIMARY]

GO

INSERT FLD VALUES ('a', 5, 9)
INSERT FLD VALUES ('b', 1, 8)
INSERT FLD VALUES ('a', 3, 7)

2: Use a nested UNPIVOT to create the factors

SELECT t.factor,t.val + '/' + t.v  AS Agg, SUM(value) AS [Sum], AVG(value) AS [AVG]
FROM
(
    SELECT * from
    (
        select * from FLD f
        UNPIVOT
        (
            v
            for val in (fld1)
        ) piv
        ) f
    UNPIVOT 
    (
        value
        for factor in (fld2, fld3)
    ) s
) t 
group by t.v, t.factor, t.val
like image 150
Johnv2020 Avatar answered Mar 17 '23 16:03

Johnv2020