Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need a better optimized code?

Need a much Optimized code.Well I Got a Project and I have Succefully made it work with the vba (Mostly helped by the stackoverflow programmers Thanks for that) But Today I got a Feedback. Its deleting 2 more unique entries in the record But I dont know why its deleting Them.

The Algorithm I have applied

I have Used the COUNTIF function Which I found on google

    ="countif(A$1:A2,A3)=0" A3 is the active cell, Checks A2,A1 for dupes

It Throws False if there is a duplicate in The A column and True If it is a unique.What I have understood about Countif is that It checks all the above columns values from that cell I mean let us take A4. SO it checks A2,A1,A3 for the duplicate. Similarly A10 checks for A1 to A9 and throws either TRue or False.Well It was working But I dont know what went wrong The code is not working for some entries.Its even showing False for the Unique entries sometimes.

And its taking more time to applye these formula as I have more amount of data. Im trying to make it cleaner and more Optimizing Way.People told me its not a c or some other Language to make it optimize but Im need of code that makes my code more optimized

I need code for these condtions can anyone help me as my countif failed.Im little helpless in doing so.

1)I have a column and I should check for duplicates in that column and delete that row if it is a duplicate

2) I have 35000 Old entries in the column and I have new entries 2000 everyweek these are appended. I need to check these 2000 entries from the total 37000 ( as we appened we get 35000+2000) and these delete operation need to be performed only on the newly appended 2000 entries but it should check the duplicates for entire column

Let me explain you clearly I have 2000 entries newly added,so Only these entries are to be checked for the duplicates from the 35000 entries and also from itself (2000 entries) and delete it if it is a duplicate and no duplicating operation should be performed on the 35000 entries old data.

I have found some codes but they are deleting even the duplicates of the 35000 entries. I have set the range but even though its not working. Can anyone help me with the best code that takes less time?please thank you

Updating my question with the sample code I have

   A       B            F       G        H       I              Y          
  PTY   39868.5         4       2       540      3      PTY39868.5425403 
  GTY   34446.1234      2       1       230      1      GTY34446.1234212301
  PTY   3945.678                2                2       PTY3945.67822
  GTY   34446.1234      2       1       230      1      GTY34446.1234212301
                  let us say these are old 35000 entries

Explaination to the above example.

The above are the 35000 entries. I have to check A,B,F,G,H,I columns for the dupes, if they are same I have to delete the row, I should not bother about the other columns c,d etc. so what I did is I have used one unused column Y and concatenated these 6 columns values into 1 at Y column using these

  = A2 & B2 & F2 & G2 & H2 &I2 with the respective columns

Now checking the Y column for dupes and delete the entire row. as 2003 supports only for one column as far to my knowledge.

Notice that even the 35000 entries may have duplicates in it but I should not delete them. Example you can see the 2 and last row in my example code are dupes but I should not delete as it is the old data.

   A       B            F       G        H       I              Y          
  PTY   39868.5         4       2       540      3      PTY39868.5425403     'old 
  GTY   34446.1234      2       1       230      1      GTY34446.1234212301   'old
  PTY   3945.678                2                2       PTY3945.67822        'old
  GTY   34446.1234      2       1       230      1      GTY34446.1234212301    'old
  PTY    3945.678       1       1       230      2      PTY3945.678112302      'new
  PTY    39868.5        4       2       540      3      PTY39868.5425403       'new 
  PTY    3945.678       1       1       230      2      PTY3945.678112302      'new

Now note that New entry PTY (from last 2nd) is a duplicate of the original record(PTY at first) So I hava to delete it.And the last new entry is a duplicate of the new entry itself so I should delete it even that . SO in the above code I have to delete only the last 2 rows which are dupes of original record and also from it . But should not delete the GTY which is the dupe but which is in orginal record.

I think I gave a clear view now. Is concatenating them into one cell . Is it better way to approach? as conactenatin for 40000 entries taking just 2 seconds i think that doesnt matter but any more algorithms to these is much aprreciated

I heard counif treats 45.00 and 45.00000 as different is that right may be that was the problem with it? since I have decimal points in my data. I think I should do

    = I2 & H2 & G2 & F2 & A2 & B2

which is better to concatenate? is this or the other i posted before?

like image 663
niko Avatar asked Nov 29 '22 10:11

niko


2 Answers

BIG UPDATE:

It think the original questions threw me off - there may be a problem with the logic in the question. The following assumes you want to delete the cell, not entire row, for the duplicate entries.

  • If the 35000 old records do not include duplicates, then all you need to do is remove all duplicates from the entire column - so long as you start from row 1, you run no risk of deleting any of the 'old' rows since no duplicates exist in them.

Here is one way:

Sub UniqueList()

Application.ScreenUpdating = False
Dim vArray As Variant
Dim i As Long, j As Long, lastrow As Long
Dim dictionary As Object
Set dictionary = CreateObject("scripting.dictionary")

lastrow = Range("A" & Rows.Count).End(xlUp).Row
vArray = Range("A1:A" & lastrow).Value

On Error Resume Next
For i = 1 To UBound(vArray, 1)
    For j = 1 To UBound(vArray, 2)
        If Len(vArray(i, j)) <> 0 Then
            dictionary(vArray(i, j)) = 1
        End If
    Next
Next

Columns("A:A").ClearContents
Range("A1").Resize(dictionary.Count).Value = _
Application.Transpose(dictionary.keys)

Application.ScreenUpdating = True

End Sub
  • If for some odd reason the 35000 old records DO include dupes and you only want to allow these 35000 records to do so, then you can use 2 dictionaries, but this would be an unusual case since you'd be treating the old records differently than new...
Sub RemoveNewDupes()

Application.ScreenUpdating = False
Dim lastRow As Long
Dim varray As Variant
Dim oldDict As Object, newDict As Object
Set oldDict = CreateObject("scripting.dictionary")
Set newDict = CreateObject("scripting.dictionary")

On Error Resume Next
lastRow = Range("A" & Rows.Count).End(xlUp).Row

'Add old entries to dictionary
varray = Range("A1:A35000").Value
For i = 1 To UBound(varray, 1)
    oldDict.Add varray(i, 1), 1
Next

'Check for dupes
varray = Range("A35001:A" & lastRow).Value
For i = 1 To UBound(varray, 1)
    If oldDict.exists(varray(i, 1)) = False Then
        newDict.Add varray(i, 1), 1
    End If
Next

'Delete and slap back on the unique list
Range("A35001", "A" & Rows.Count).ClearContents
Range("A35001").Resize(newDict.Count).Value = _
Application.Transpose(newDict.keys)

Application.ScreenUpdating = True
End Sub

Thanks to Reafidy for the advice and getting me to relook at this.

like image 66
aevanko Avatar answered Dec 05 '22 15:12

aevanko


This is also a response to some of the comments and solutions made by other members so sorry if it does not straight away answer your question.

Firstly I believe that using excel in a database scenario that raw data and presentation data should be separated. This usually means a single worksheet with raw data and multiple other worksheets with presentation data. Then delete the raw data when necessary or archive.

When speed testing it is very difficult to get a level playing field in excel as there are many things that affect the results. Computer specs, available RAM etc.. Code must first be compiled before running any of the procedures. The test data is also important, when considering duplicates - how many duplicates vs how many rows. This sub loads some test data, altering the amount of rows vs the range of random numbers (duplicates) will give very different results for your code. I don't know what your data looks like so we are kind of working blind and your results may be very different.

'// This is still not very good test data, but should suffice for this situation.
Sub TestFill()
    '// 300000 rows
    For i = 1 To 300000
        '// This populates a random number between 1 & 10000 - adjust to suit
        Cells(i, "A").value = Int((100000 + 1) * Rnd + 1)
    Next
End Sub

If we are talking about advanced filter vs an array & dictonary method then advanced filter will be quicker with a lower amount of rows but once you get above a certain amount of rows then the array method will be quicker. Then see what happens when you change the amount of duplicates.... :) As a guideline or as a general rule using excels built in functions will be faster and I recommend always develop attempting to use these inbuilt functions, however there are often exceptions, like above when removing duplicates. :)

Deleting rows can be slow when looping if used incorrectly. If looping is used then it is important to keep synchronisation between code and the workbook out of the loop. This usually means read data to an array, loop through the data, then load the data from the array back to the presentation worksheet essentially deleting the unwanted data.

Sub RemoveDuplicatesA()

    '// Copy raw data to presentation sheet
    Range("A1", Cells(Rows.Count, "A").End(xlUp)).AdvancedFilter _
        Action:=xlFilterCopy, CopyToRange:=Sheet2.Range("B1"), Unique:=True

End Sub

This will be the fastest method:

Sub RemoveDuplicatesB()        
    Dim vData As Variant, vArray As Variant
    Dim lCnt As Long, lRow As Long

    vData = ActiveSheet.UsedRange.Columns(1).value
    ReDim vArray(0 To UBound(vData, 1), 0)
    lCnt = 0

    With CreateObject("Scripting.Dictionary")
        For lRow = 1 To UBound(vData, 1)
            If Not .Exists(vData(lRow, 1)) Then
                vArray(lCnt, 0) = vData(lRow, 1): lCnt = lCnt + 1
                .Add vData(lRow, 1), Nothing
            End If
        Next lRow
    End With

    '// Copy raw data to presentation sheet
    Sheet2.Range("B1").Resize(lCnt).value = vArray

End Sub

Application transpose has a limitation of 65536 rows but as you are using 2003 you should be fine using it, therefore you can simplify the above code with:

Sub RemoveDuplicatesC()
    Dim vData As Variant
    Dim lRow As Long

    vData = ActiveSheet.UsedRange.Columns(1).value

    With CreateObject("Scripting.Dictionary")
        For lRow = 1 To UBound(vData, 1)
            If Not .exists(vData(lRow, 1)) Then
                .Add vData(lRow, 1), Nothing
            End If
        Next lRow

        '// Copy raw data to presentation sheet or replace raw data
        Sheet2.Columns(2).ClearContents
        Sheet2.Columns(2).Resize(.Count).value = Application.Transpose(.keys)
    End With

End Sub 

EDIT

Okay so @Issun has mentioned you want the entire row deleted. My suggestion was to improve your spreadsheet layout by having a raw data and presentation sheet which means you dont need to delete anything hence it would have been the fastest method. If you dont want to do that and would like to edit the raw data directly then try this:

 Sub RemoveDuplicatesD()
    Dim vData As Variant, vArray As Variant
    Dim lRow As Long       

    vData = ActiveSheet.UsedRange.Columns(1).value
    ReDim vArray(1 To UBound(vData, 1), 0)     

    With CreateObject("Scripting.Dictionary")
        For lRow = 1 To UBound(vData, 1)
            If Not .exists(vData(lRow, 1)) Then
                varray(lRow, 0) = "x"
                .Add vData(lRow, 1), Nothing
            End If
        Next lRow
    End With

    Application.ScreenUpdating = False

    '// Modify the raw data
    With ActiveSheet
        .Columns(2).Insert
        .Range("B1").Resize(lRow).value = vArray
        .Columns(2).SpecialCells(xlCellTypeBlanks).EntireRow.Delete
        .Columns(2).Delete
    End With

    Application.ScreenUpdating = True
End Sub
like image 41
Reafidy Avatar answered Dec 05 '22 15:12

Reafidy