Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete Duplicate records from large csv file C# .Net

Tags:

c#

.net

csv

I have created a solution which read a large csv file currently 20-30 mb in size, I have tried to delete the duplicate rows based on certain column values that the user chooses at run time using the usual technique of finding duplicate rows but its so slow that it seems the program is not working at all.

What other technique can be applied to remove duplicate records from a csv file

Here's the code, definitely I am doing something wrong

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }
like image 944
Sandhurst Avatar asked Mar 11 '11 11:03

Sandhurst


2 Answers

One way to do this would be to go through the table, building a HashSet<string> that contains the combined column values you're interested in. If you try to add a string that's already there, then you have a duplicate row. Something like:

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}

That should be very fast.

like image 198
Jim Mischel Avatar answered Sep 29 '22 20:09

Jim Mischel


If you've implemented your sorting routine as a couple of nested for or foreach loops, you could optimise it by sorting the data by the columns you wish to de-duplicate against, and simply compare each row to the last row you looked at.

Posting some code is a sure-fire way to get better answers though, without an idea of how you've implemented it anything you get will just be conjecture.

like image 27
mdm Avatar answered Sep 29 '22 18:09

mdm