Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excel VBA Performance - 1 million rows - Delete rows containing a value, in less than 1 min

I am trying to find a way to filter large data and remove rows in a worksheet, in less than one minute

The goal:

  • Find all records containing specific text in column 1, and delete the entire row
  • Keep all cell formatting (colors, font, borders, column widths) and formulas as they are

.

Test Data:

Test data:

.

How the code works:

  1. It starts by turning all Excel features Off
  2. If the workbook is not empty and the text value to be removed exists in column 1

    • Copies the used range of column 1 to an array
    • Iterates over every value in array backwards
    • When it finds a match:

      • Appends the cell address to a tmp string in the format "A11,A275,A3900,..."
      • If the tmp variable length is close to 255 characters
      • Deletes rows using .Range("A11,A275,A3900,...").EntireRow.Delete Shift:=xlUp
      • Resets tmp to empty and moves on to the next set of rows
  3. At the end, it turns all Excel features back On

.

The main issue is the Delete operation, and total duration time should be under one minute. Any code-based solution is acceptable as long as it performs under 1 minute.

This narrows the scope to very few acceptable answers. The answers already provided are also very short and easy to implement. One performs the operation in about 30 seconds, so there is at least one answer that provides an acceptable solution, and other may find it useful as well

.

My main initial function:

Sub DeleteRowsWithValuesStrings()
    Const MAX_SZ As Byte = 240

    Dim i As Long, j As Long, t As Double, ws As Worksheet
    Dim memArr As Variant, max As Long, tmp As String

    Set ws = Worksheets(1)
    max = GetMaxCell(ws.UsedRange).Row
    FastWB True:    t = Timer

    With ws
        If max > 1 Then
            If IndexOfValInRowOrCol("Test String", , ws.UsedRange) > 0 Then
                memArr = .Range(.Cells(1, 1), .Cells(max, 1)).Value2
                For i = max To 1 Step -1

                    If memArr(i, 1) = "Test String" Then
                        tmp = tmp & "A" & i & ","
                        If Len(tmp) > MAX_SZ Then
                           .Range(Left(tmp, Len(tmp) - 1)).EntireRow.Delete Shift:=xlUp
                           tmp = vbNullString

                        End If
                    End If

                Next
                If Len(tmp) > 0 Then
                    .Range(Left(tmp, Len(tmp) - 1)).EntireRow.Delete Shift:=xlUp
                End If
                .Calculate
            End If
        End If
    End With
    FastWB False:   InputBox "Duration: ", "Duration", Timer - t
End Sub

Helper functions (turn Excel features off and on):

Public Sub FastWB(Optional ByVal opt As Boolean = True)
    With Application
        .Calculation = IIf(opt, xlCalculationManual, xlCalculationAutomatic)
        .DisplayAlerts = Not opt
        .DisplayStatusBar = Not opt
        .EnableAnimations = Not opt
        .EnableEvents = Not opt
        .ScreenUpdating = Not opt
    End With
    FastWS , opt
End Sub

Public Sub FastWS(Optional ByVal ws As Worksheet = Nothing, _
                  Optional ByVal opt As Boolean = True)
    If ws Is Nothing Then
        For Each ws In Application.ActiveWorkbook.Sheets
            EnableWS ws, opt
        Next
    Else
        EnableWS ws, opt
    End If
End Sub

Private Sub EnableWS(ByVal ws As Worksheet, ByVal opt As Boolean)
    With ws
        .DisplayPageBreaks = False
        .EnableCalculation = Not opt
        .EnableFormatConditionsCalculation = Not opt
        .EnablePivotTable = Not opt
    End With
End Sub

Finds last cell with data (thanks @ZygD - now I tested it in several scenarios):

Public Function GetMaxCell(Optional ByRef rng As Range = Nothing) As Range

    'Returns the last cell containing a value, or A1 if Worksheet is empty

    Const NONEMPTY As String = "*"
    Dim lRow As Range, lCol As Range

    If rng Is Nothing Then Set rng = Application.ActiveWorkbook.ActiveSheet.UsedRange
    If WorksheetFunction.CountA(rng) = 0 Then
        Set GetMaxCell = rng.Parent.Cells(1, 1)
    Else
        With rng
            Set lRow = .Cells.Find(What:=NONEMPTY, LookIn:=xlFormulas, _
                                        After:=.Cells(1, 1), _
                                        SearchDirection:=xlPrevious, _
                                        SearchOrder:=xlByRows)
            If Not lRow Is Nothing Then
                Set lCol = .Cells.Find(What:=NONEMPTY, LookIn:=xlFormulas, _
                                            After:=.Cells(1, 1), _
                                            SearchDirection:=xlPrevious, _
                                            SearchOrder:=xlByColumns)

                Set GetMaxCell = .Parent.Cells(lRow.Row, lCol.Column)
            End If
        End With
    End If
End Function

Returns the index of a match in the array, or 0 if a match is not found:

Public Function IndexOfValInRowOrCol( _
                                    ByVal searchVal As String, _
                                    Optional ByRef ws As Worksheet = Nothing, _
                                    Optional ByRef rng As Range = Nothing, _
                                    Optional ByRef vertical As Boolean = True, _
                                    Optional ByRef rowOrColNum As Long = 1 _
                                    ) As Long

    'Returns position in Row or Column, or 0 if no matches found

    Dim usedRng As Range, result As Variant, searchRow As Long, searchCol As Long

    result = CVErr(9999) '- generate custom error

    Set usedRng = GetUsedRng(ws, rng)
    If Not usedRng Is Nothing Then
        If rowOrColNum < 1 Then rowOrColNum = 1
        With Application
            If vertical Then
                result = .Match(searchVal, rng.Columns(rowOrColNum), 0)
            Else
                result = .Match(searchVal, rng.Rows(rowOrColNum), 0)
            End If
        End With
    End If
    If IsError(result) Then IndexOfValInRowOrCol = 0 Else IndexOfValInRowOrCol = result
End Function

.

Update:

Tested 6 solutions (3 tests each): Excel Hero's solution is the fastest so far (removes formulas)

.

Here are the results, fastest to the slowest:

.

Test 1. Total of 100,000 records, 10,000 to be deleted:

1. ExcelHero()                    - 1.5 seconds

2. DeleteRowsWithValuesNewSheet() - 2.4 seconds

3. DeleteRowsWithValuesStrings()  - 2.45 minutes
4. DeleteRowsWithValuesArray()    - 2.45 minutes
5. QuickAndEasy()                 - 3.25 minutes
6. DeleteRowsWithValuesUnion()    - Stopped after 5 minutes

.

Test 2. Total of 1 million records, 100,000 to be deleted:

1. ExcelHero()                    - 16 seconds (average)

2. DeleteRowsWithValuesNewSheet() - 33 seconds (average)

3. DeleteRowsWithValuesStrings()  - 4 hrs 38 min (16701.375 sec)
4. DeleteRowsWithValuesArray()    - 4 hrs 37 min (16626.3051757813 sec)
5. QuickAndEasy()                 - 5 hrs 40 min (20434.2104492188 sec)
6. DeleteRowsWithValuesUnion()    - N/A

.

Notes:

  1. ExcelHero method: easy to implement, reliable, extremely fast, but removes formulas
  2. NewSheet method: easy to implement, reliable, and meets the target
  3. Strings method: more effort to implement, reliable, but doesn't meet requirement
  4. Array method: similar to Strings, but ReDims an array (faster version of Union)
  5. QuickAndEasy: easy to implement (short, reliable and elegant), but doesn't meet requirement
  6. Range Union: implementation complexity similar to 2 and 3, but too slow

I also made the test data more realistic by introducing unusual values:

  • empty cells, ranges, rows, and columns
  • special characters, like =[`~!@#$%^&*()_-+{}[]\|;:'",.<>/?, separate and multiple combinations
  • blank spaces, tabs, empty formulas, border, font, and other cell formatting
  • large and small numbers with decimals (=12.9999999999999 + 0.00000000000000001)
  • hyperlinks, conditional formatting rules
  • empty formatting inside and outside data ranges
  • anything else that might cause data issues
like image 692
paul bica Avatar asked Jun 20 '15 22:06

paul bica


3 Answers

I'm providing the first answer as a reference

Others may find it useful, if there are no other options available

  • Fastest way to achieve the result is not to use the Delete operation
  • Out of 1 million records it removes 100,000 rows in an average of 33 seconds

.

Sub DeleteRowsWithValuesNewSheet()  '100K records   10K to delete
                                    'Test 1:        2.40234375 sec
                                    'Test 2:        2.41796875 sec
                                    'Test 3:        2.40234375 sec
                                    '1M records     100K to delete
                                    'Test 1:        32.9140625 sec
                                    'Test 2:        33.1484375 sec
                                    'Test 3:        32.90625   sec
    Dim oldWs As Worksheet, newWs As Worksheet, rowHeights() As Long
    Dim wsName As String, t As Double, oldUsedRng As Range

    FastWB True:    t = Timer

    Set oldWs = Worksheets(1)
    wsName = oldWs.Name

    Set oldUsedRng = oldWs.Range("A1", GetMaxCell(oldWs.UsedRange))

    If oldUsedRng.Rows.Count > 1 Then                           'If sheet is not empty
        Set newWs = Sheets.Add(After:=oldWs)                    'Add new sheet
        With oldUsedRng
            .AutoFilter Field:=1, Criteria1:="<>Test String"
            .Copy                                               'Copy visible data
        End With
        With newWs.Cells
            .PasteSpecial xlPasteColumnWidths
            .PasteSpecial xlPasteAll                            'Paste data on new sheet
            .Cells(1, 1).Select                                 'Deselect paste area
            .Cells(1, 1).Copy                                   'Clear Clipboard
        End With
        oldWs.Delete                                            'Delete old sheet
        newWs.Name = wsName
    End If
    FastWB False:   InputBox "Duration: ", "Duration", Timer - t
End Sub

.

At high level:

  • It creates a new worksheet, and keeps a reference to the initial sheet
  • AutoFilters column 1 on the searched text: .AutoFilter Field:=1, Criteria1:="<>Test String"
  • Copies all (visible) data from initial sheet
  • Pastes column widths, formats, and data to the new sheet
  • Deletes initial sheet
  • Renames the new sheet to the old sheet name

It uses the same helper functions posted in the question

The 99% of the duration is used by the AutoFilter

.

There are a couple limitations I found so far, the first can be addressed:

  1. If there are any hidden rows on the initial sheet, it unhides them

    • A separate function is needed to hide them back
    • Depending on implementation, it might significantly increase duration
  2. VBA related:

    • It changes the Code Name of the sheet; other VBA referring to Sheet1 will be broken (if any)
    • It deletes all VBA code associated with the initial sheet (if any)

.

A few notes about using large files like this:

  • The binary format (.xlsb) reduce file size dramatically (from 137 Mb to 43 Mb)
  • Unmanaged Conditional Formatting rules can cause exponential performance issues

    • The same for Comments, and Data validation
  • Reading file or data from network is much slower than working with a locall file

like image 197
paul bica Avatar answered Oct 18 '22 22:10

paul bica


A significant gain in speed can be achieved if the source data do not contain formulas, or if the scenario would allow (or want) the formulas to be converted into hard values during the conditional row deletions.

With the above as a caveat, my solution uses the AdvancedFilter of the range object. It's about twice as fast as DeleteRowsWithValuesNewSheet().

Public Sub ExcelHero()
    Dim t#, crit As Range, data As Range, ws As Worksheet
    Dim r&, fc As Range, lc As Range, fr1 As Range, fr2 As Range
    FastWB True
    t = Timer

        Set fc = ActiveSheet.UsedRange.Item(1)
        Set lc = GetMaxCell
        Set data = ActiveSheet.Range(fc, lc)
        Set ws = Sheets.Add
        With data
            Set fr1 = data.Worksheet.Range(fc, fc.Offset(, lc.Column))
            Set fr2 = ws.Range(ws.Cells(fc.Row, fc.Column), ws.Cells(fc.Row, lc.Column))
            With fr2
                fr1.Copy
                .PasteSpecial xlPasteColumnWidths: .PasteSpecial xlPasteAll
                .Item(1).Select
            End With
            Set crit = .Resize(2, 1).Offset(, lc.Column + 1)
            crit = [{"Column 1";"<>Test String"}]
            .AdvancedFilter xlFilterCopy, crit, fr2
            .Worksheet.Delete
        End With

    FastWB False
    r = ws.UsedRange.Rows.Count
    Debug.Print "Rows: " & r & ", Duration: " & Timer - t & " seconds"
End Sub
like image 45
Excel Hero Avatar answered Oct 18 '22 23:10

Excel Hero


On my elderly Dell Inspiron 1564 (Win 7 Office 2007) this:

Sub QuickAndEasy()
    Dim rng As Range
    Set rng = Range("AA2:AA1000001")
    Range("AB1") = Now
    Application.ScreenUpdating = False
        With rng
            .Formula = "=If(A2=""Test String"",0/0,A2)"
            .Cells.SpecialCells(xlCellTypeFormulas, xlErrors).EntireRow.Delete
            .Clear
        End With
    Application.ScreenUpdating = True
    Range("AC1") = Now
End Sub

took about 10 seconds to run. I am assuming that column AA is available.

EDIT#1:

Please note that this code does not set Calculation to Manual. Performance will improve if the Calculation mode is set to Manual after the "helper" column is allowed to calculate.

like image 6
Gary's Student Avatar answered Oct 18 '22 23:10

Gary's Student