Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Optimize this UDF

I have this UDF which I use to lookup dates and return values based on the condition.
Basically two(2) conditions only, either < or > the date.
Also, I just use built-in Excel Functions as well and just added some conditions.

Public Function CLOOKUP(lookup_value, table_array As Range, column_index As Long, _
                        rv_operator, reference_value, Optional range_lookup, _
                        Optional return_index) As Variant

Dim NT_array, S_array
Dim ORGLOOKUP, REFLOOKUP
Dim row_count As Long, row_less As Long

With Application.WorksheetFunction
    If column_index > 0 And column_index <= table_array.Columns.Count Then

        On Error Resume Next
        ORGLOOKUP = .VLookup(lookup_value, table_array, column_index, range_lookup)
        If Err.number <> 0 Then CLOOKUP = CVErr(xlErrNA): Exit Function
        On Error GoTo 0

        Select Case rv_operator
        Case "<"
            Do While ORGLOOKUP > reference_value
                Set NT_array = table_array.Resize(, 1)
                row_count = .CountA(NT_array)
                Set S_array = table_array.Resize(row_count)
                row_less = .Match(lookup_value, NT_array, 0)
                Set table_array = S_array.Offset(row_less, 0).Resize(row_count - row_less)

                On Error Resume Next
                ORGLOOKUP = .VLookup(lookup_value, table_array, column_index, range_lookup)
                If Err.number <> 0 Then CLOOKUP = CVErr(xlErrNA): Exit Function
                On Error GoTo 0
            Loop
        Case ">"
            Do While ORGLOOKUP < reference_value
                Set NT_array = table_array.Resize(, 1)
                row_count = .CountA(NT_array)
                Set S_array = table_array.Resize(row_count)
                row_less = .Match(lookup_value, NT_array, 0)
                Set table_array = S_array.Offset(row_less, 0).Resize(row_count - row_less)

                On Error Resume Next
                ORGLOOKUP = .VLookup(lookup_value, table_array, column_index, range_lookup)
                If Err.number <> 0 Then CLOOKUP = CVErr(xlErrNA): Exit Function
                On Error GoTo 0
            Loop
        Case Else
            CLOOKUP = CVErr(xlErrNA)
        End Select

        Select Case True
        Case IsMissing(return_index)
            CLOOKUP = ORGLOOKUP
        Case Else
            If return_index <= table_array.Columns.Count Then
                REFLOOKUP = .VLookup(lookup_value, table_array, return_index, range_lookup)
                CLOOKUP = REFLOOKUP
            Else
                CLOOKUP = CVErr(xlErrNA)
            End If
        End Select
    Else
        CLOOKUP = CVErr(xlErrNA)
    End If
End With

End Function

It works fine but I want to optimize it a bit to improve calculation speed.
Usually I'm using this to lookup 10k rows in an excel file with 600k or more rows.
It takes 5~8 mins in a sorted data.
If someone can point me to the right direction on how to optimize this function, that would be great.

Edit1:

HERE is the workbook link.
Two(2) Sheets, Data Source and Data for Lookup, self-explanatory I guess.
I also included the function in the WB.
I used the function to populate the values on the Data for Lookup Sheet under the Manufacturing Date column and just leave the first cell with the actual formula to avoid problems in opening it.
For those who are not keen, here's the syntax on how to use the function:

lookup_value - what you are looking for
table_array - where you're looking
column_index - column from which you want to get information based on your lookup_value
rv_operator - criteria whether the value to return is less than or greater than the reference_value
reference_value - where your returned value is compared
range_lookup - exact or approximate match
return_index - alternative column index, just in case you need to return data aside from what get from the column_index

Remember that I use this to get DATES so the column_index always contains a date as well as the reference_value.
That is the reason there is a return_index since I may need to recover information that fall under the conditions but not actually interested in the dates.

So for example in my sample workbook, I need to get the manufacturing date of serial number 096364139403422056 but it shoud be less than the reference value 1/4/2014.
There are more than one occurence of this serial number so I need to get the closest to the reference value.
The result should be 11/15/2013 using the function: =CLOOKUP(B2,'Source Data'!A:B,2,"<",A2,0) Hope above explanation helps you guys a bit.

Btw, this can also be achieved using Array Formulas.
I just made this formula for the benefit of the other users who are not well versed with AF's.

like image 962
L42 Avatar asked Nov 01 '22 03:11

L42


1 Answers

I have created a solution that takes about 40 seconds in my laptop. My laptop takes about 7 minutes to copy the formula to all lookup rows.

When I measure the various bottlenecks in the original UDF, I found out that VLOOKUP is very expensive. Example using a row close to the bottom:

  • VLOOKUP: 31 ms
  • COUNTA: 7.8 ms
  • Match: 15 ms

Since you can potentially call the above functions multiple times (when there is a duplicate), it is even more time consuming.

My solution is using a VBA macro instead of optimizing the UDF. Also, instead of using VLOOKUP, I am using the Scripting.Dictionary object to store all the serial numbers. Lookup using Scripting.Dictionary is 100 times faster according to How to optimize vlookup for high search count ? (alternatives to VLOOKUP).

I tested it on Office 2010 running on Windows 7. Loading all the serial numbers into the Dictionary takes about 37 seconds, while the lookup and populating column C takes about 3 seconds ! Therefore, it's not a problem at all to have more rows in the lookup worksheet !

If the macro complains when creating Scripting.Dictionary, you may need to add a reference to Microsoft Scripting Runtime (see the above link for details).

When I compare the result with your UDF formula, I found some inconsistency which may be due to a bug in your UDF code. For example:

  1. In row 12739, serial number 096364139401213204, the reference date is 1/13/2013, the data is 1/3/2013 and 4/23/2013, but the result is #VALUE! So it looks like if ANY of the data is GREATER than the reference date, you want the result to be #VALUE!

  2. HOWEVER, in row 12779, serial number 096364139508732708, the reference date is 1/9/2013, the data is 8/10/2013 and 1/2/2013, your UDF produces 1/2/2013 instead of #VALUE! even though there is a row with Mfg date greater than the reference date.

I don't know what behavior you want, so I assume you want to display #VALUE! when ANY of the data is greater than the reference date. If you want to change the behavior, please let me know, or update the code yourselves (I put copious comment in the code).

Here's a link to download the spreadsheet plus the Macro to: https://www.dropbox.com/s/djqvu0a4a6h5a06/Sample%20Workbook%20Optimized.xlsm . I'm going to make it available only for 1 week. The Macro code is below:

Option Explicit
Sub Macro1()
'
' Macro1 Macro
'
Const COMPARISONMODE = "<"
Const SOURCESHEETNAME = "Source Data"
Const LOOKUPSHEETNAME = "Data for Lookup"

Dim oSource
Set oSource = CreateObject("Scripting.Dictionary")

Dim starttime, endtime, totalindex


'BUILD THE INDEX in oSource
'Column A = serial number
'Column B = mfg date
'Column C = extra data
'Each item contains a comma separated list of row numbers
starttime = Timer

Sheets(SOURCESHEETNAME).Activate
Dim rownum, serialno, mfgdate
rownum = 2
Do
  serialno = Cells(rownum, 1)
  If Not IsError(serialno) Then
    serialno = CStr(serialno)
    If serialno = "" Then Exit Do
    If oSource.Exists(serialno) Then
      oSource(serialno) = oSource(serialno) & "," & rownum
    Else
      oSource.Add serialno, CStr(rownum)
    End If
  End If
  rownum = rownum + 1
Loop

endtime = Timer

totalindex = endtime - starttime

starttime = Timer

'DO THE LOOKUP
'NOTE: Assume that there are no #VALUE! in columns A and B of the lookup table
Dim rownumlist, sourcerownum, aryRownumlist, refdate, closestmfgdate, closestextradata, j
Sheets(LOOKUPSHEETNAME).Activate
rownum = 2
Do
  refdate = CDate(Cells(rownum, 1))
  serialno = Cells(rownum, 2)
  If serialno = "" Then Exit Do
  If Not oSource.Exists(serialno) Then
    Cells(rownum, 3) = CVErr(xlErrNA)
    GoTo ContinueLoop
  End If
  aryRownumlist = Split(oSource(serialno), ",")
  closestmfgdate = ""
  closestextradata = ""
  'Find the closest manufacturing date to the reference date out of all matches
  For j = LBound(aryRownumlist) To UBound(aryRownumlist)
    sourcerownum = CLng(aryRownumlist(j))
    mfgdate = Sheets(SOURCESHEETNAME).Cells(sourcerownum, 2)
    If IsError(mfgdate) Then Exit For  'if any of the date in the matches is not valid, output N/A
    mfgdate = CDate(mfgdate)
    'Exclude depending on COMPARISONMODE
    'must be less than the reference date if COMPARISONMODE = "<", otherwise it has to be greater than
    'If comparison failed for ANY of the matches, we will output N/A
    'If you want the failed comparison match to be excluded but still output a date, instead of doing
    '   Exit For, you can do Goto ContinueFor.  Example:
    '      If mfgdate >= refdate Then Goto ContinueFor
    'QUESTION: What to do if it is equal?  Assume that we will output N/A as well
    If COMPARISONMODE = "<" Then
      If mfgdate >= refdate Then closestmfgdate = "": Exit For
    Else
      If mfgdate <= refdate Then closestmfgdate = "": Exit For
    End If
    'Now check whether it is closer to refdate
    If closestmfgdate = "" Then
        closestmfgdate = mfgdate
        closestextradata = Sheets(SOURCESHEETNAME).Cells(sourcerownum, 3)
    ElseIf Abs(DateDiff("d", closestmfgdate, refdate)) > Abs(DateDiff("d", mfgdate, refdate)) Then
        closestmfgdate = mfgdate
        closestextradata = Sheets(SOURCESHEETNAME).Cells(sourcerownum, 3)
    End If
ContinueFor:
  Next
  If closestmfgdate = "" Then
    Cells(rownum, 3) = CVErr(xlErrNA)
    Cells(rownum, 4) = ""
  Else
    Cells(rownum, 3) = closestmfgdate
    Cells(rownum, 4) = closestextradata
  End If
ContinueLoop:
  rownum = rownum + 1
Loop


endtime = Timer

MsgBox "Indexing time=" & totalindex & " seconds; lookup time=" & (endtime - starttime) & " seconds"

End Sub

If you find the above solution satisfactory, please award the bounty or at least accept the solution. Thanks.

like image 186
Anonymous Avatar answered Nov 15 '22 07:11

Anonymous