Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest method to remove Empty rows and Columns From Excel Files using Interop

Tags:

I have a lot of excel files that contains data and it contains empty rows and empty columns. like shown bellow

Excel preview

I am trying to remove Empty rows and columns from excel using interop. I create a simple winform application and used the following code and it works fine.

Dim lstFiles As New List(Of String) lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))  Dim m_XlApp = New Excel.Application Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks Dim m_xlWrkb As Excel.Workbook  For Each strFile As String In lstFiles     m_xlWrkb = m_xlWrkbs.Open(strFile)     Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)     Dim intRow As Integer = 1      While intRow <= m_XlWrkSheet.UsedRange.Rows.Count         If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then             m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)         Else             intRow += 1         End If     End While      Dim intCol As Integer = 1     While intCol <= m_XlWrkSheet.UsedRange.Columns.Count         If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then             m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)         Else             intCol += 1         End If     End While Next  m_xlWrkb.Save() m_xlWrkb.Close(SaveChanges:=True)  Marshal.ReleaseComObject(m_xlWrkb) Marshal.ReleaseComObject(m_xlWrkbs) m_XlApp.Quit() Marshal.ReleaseComObject(m_XlApp) 

But when cleaning big excel files it takes a lot of time. Any suggestions for optimizing this code? or another way to clean this excel files faster? Is there a function that can delete empty rows in one click?

I don't have problem if answers are using C#

EDIT:

I uploaded a sample file Sample File. But not all files have same structure.

like image 514
Hadi Avatar asked Nov 13 '16 12:11

Hadi


People also ask

How do I get rid of infinite blank rows?

To select all rows under the data range, select the first row under data and press CTRL + SHIFT + Down Arrow.


2 Answers

I found that looping through the excel worksheet can take some time if the worksheet is large. So my solution tried to avoid any looping in the worksheet. To avoid looping through the worksheet, I made a 2 dimensional object array from the cells returned from usedRange with:

Excel.Range targetCells = worksheet.UsedRange; object[,] allValues = (object[,])targetCells.Cells.Value; 

This is the array I loop through to get the indexes of the empty rows and columns. I make 2 int lists, one keeps the row indexes to delete the other keeps the column indexes to delete.

List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols); List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols); 

These lists will be sorted from high to low to simplify deleting rows from the bottom up and deleting columns from right to left. Then simply loop through each list and delete the appropriate row/col.

DeleteRows(emptyRows, worksheet); DeleteCols(emptyCols, worksheet); 

Finally after all the empty rows and columns have been deleted, I SaveAs the file to a new file name.

Hope this helps.

EDIT:

Addressed the UsedRange issue such that if there are empty rows at the top of the worksheet, those rows will now be removed. Also this will remove any empty columns to the left of the starting data. This allows for the indexing to work properly even if there are empty rows or columns before the data starts. This was accomplished by taking the address of the first cell in UsedRange this will be an address of the form “$A$1:$D$4”. This will allow the use of an offset if the empty rows at the top and empty columns to the left are to remain and not be deleted. In this case I am simply deleting them. To get the number of rows to delete from the top can be calculated by the first “$A$4” address where the “4” is the row that the first data appears. So we need to delete the top 3 rows. The Column address is of the form “A”, “AB” or even “AAD” this required some translation and thanks to How to convert a column number (eg. 127) into an excel column (eg. AA) I was able to determine how many columns on the left need to be deleted.

class Program {   static void Main(string[] args) {     Excel.Application excel = new Excel.Application();     string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";     Excel.Workbook workbook = excel.Workbooks.Open(originalPath);     Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];     Excel.Range usedRange = worksheet.UsedRange;      RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);      DeleteEmptyRowsCols(worksheet);      string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls";     workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);      workbook.Close();     excel.Quit();     System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);     System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);     Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");     Console.ReadKey();   }    private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet) {     Excel.Range targetCells = worksheet.UsedRange;     object[,] allValues = (object[,])targetCells.Cells.Value;     int totalRows = targetCells.Rows.Count;     int totalCols = targetCells.Columns.Count;      List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);     List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);      // now we have a list of the empty rows and columns we need to delete     DeleteRows(emptyRows, worksheet);     DeleteCols(emptyCols, worksheet);   }    private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet) {     // the rows are sorted high to low - so index's wont shift     foreach (int rowIndex in rowsToDelete) {       worksheet.Rows[rowIndex].Delete();     }   }    private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet) {     // the cols are sorted high to low - so index's wont shift     foreach (int colIndex in colsToDelete) {       worksheet.Columns[colIndex].Delete();     }   }    private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols) {     List<int> emptyRows = new List<int>();      for (int i = 1; i < totalRows; i++) {       if (IsRowEmpty(allValues, i, totalCols)) {         emptyRows.Add(i);       }     }     // sort the list from high to low     return emptyRows.OrderByDescending(x => x).ToList();   }    private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols) {     List<int> emptyCols = new List<int>();      for (int i = 1; i < totalCols; i++) {       if (IsColumnEmpty(allValues, i, totalRows)) {         emptyCols.Add(i);       }     }     // sort the list from high to low     return emptyCols.OrderByDescending(x => x).ToList();   }    private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows) {     for (int i = 1; i < totalRows; i++) {       if (allValues[i, colIndex] != null) {         return false;       }     }     return true;   }    private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols) {     for (int i = 1; i < totalCols; i++) {       if (allValues[rowIndex, i] != null) {         return false;       }     }     return true;   }    private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange) {     string addressString = usedRange.Address.ToString();     int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);     DeleteTopEmptyRows(worksheet, rowsToDelete);     int colsToDelete = GetNumberOfLeftColsToDelte(addressString);     DeleteLeftEmptyColumns(worksheet, colsToDelete);   }    private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow) {     for (int i = 0; i < startRow - 1; i++) {       worksheet.Rows[1].Delete();     }   }    private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount) {     for (int i = 0; i < colCount - 1; i++) {       worksheet.Columns[1].Delete();     }   }    private static int GetNumberOfTopRowsToDelete(string address) {     string[] splitArray = address.Split(':');     string firstIndex = splitArray[0];     splitArray = firstIndex.Split('$');     string value = splitArray[2];     int returnValue = -1;     if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))       return returnValue;     return returnValue;   }    private static int GetNumberOfLeftColsToDelte(string address) {     string[] splitArray = address.Split(':');     string firstindex = splitArray[0];     splitArray = firstindex.Split('$');     string value = splitArray[1];     return ParseColHeaderToIndex(value);   }    private static int ParseColHeaderToIndex(string colAdress) {     int[] digits = new int[colAdress.Length];     for (int i = 0; i < colAdress.Length; ++i) {       digits[i] = Convert.ToInt32(colAdress[i]) - 64;     }     int mul = 1; int res = 0;     for (int pos = digits.Length - 1; pos >= 0; --pos) {       res += digits[pos] * mul;       mul *= 26;     }     return res;   } } 

EDIT 2: For testing I made a method that loops thru the the worksheet and compared it to my code that loops thru an object array. It shows a significant difference.

enter image description here

Method to Loop thru the worksheet and delete empty rows and columns.

enum RowOrCol { Row, Column }; private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet) {   Excel.Range usedRange = worksheet.UsedRange;   int totalRows = usedRange.Rows.Count;   int totalCols = usedRange.Columns.Count;    RemoveEmpty(usedRange, RowOrCol.Row);   RemoveEmpty(usedRange, RowOrCol.Column); }  private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol) {   int count;   Excel.Range curRange;   if (rowOrCol == RowOrCol.Column)     count = usedRange.Columns.Count;   else     count = usedRange.Rows.Count;    for (int i = count; i > 0; i--) {     bool isEmpty = true;     if (rowOrCol == RowOrCol.Column)       curRange = usedRange.Columns[i];     else       curRange = usedRange.Rows[i];      foreach (Excel.Range cell in curRange.Cells) {       if (cell.Value != null) {         isEmpty = false;         break; // we can exit this loop since the range is not empty       }       else {         // Cell value is null contiue checking       }     } // end loop thru each cell in this range (row or column)      if (isEmpty) {       curRange.Delete();     }   } } 

Then a Main for testing/timing the two methods.

enum RowOrCol { Row, Column };  static void Main(string[] args) {   Excel.Application excel = new Excel.Application();   string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";   Excel.Workbook workbook = excel.Workbooks.Open(originalPath);   Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];   Excel.Range usedRange = worksheet.UsedRange;    // Start test for looping thru each excel worksheet   Stopwatch sw = new Stopwatch();   Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");   sw.Start();   ConventionalRemoveEmptyRowsCols(worksheet);   sw.Stop();   Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");    string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";   workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);   workbook.Close();   Console.WriteLine("");    // Start test for looping thru object array   workbook = excel.Workbooks.Open(originalPath);   worksheet = workbook.Worksheets["Sheet1"];   usedRange = worksheet.UsedRange;   Console.WriteLine("Start stopwatch to loop thru object array...");   sw = new Stopwatch();   sw.Start();   DeleteEmptyRowsCols(worksheet);   sw.Stop();    // display results from second test   Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");   string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";   workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);   workbook.Close();   excel.Quit();   System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);   System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);   Console.WriteLine("");   Console.WriteLine("Finished testing methods - Press any key to exit");   Console.ReadKey(); } 

EDIT 3 As per OP request... I updated and changed the code to match the OP code. With this I found some interesting results. See below.

I changed the code to match the functions you are using ie… EntireRow and CountA. The code below I found that it preforms terribly. Running some tests I found the code below was in the 800+ milliseconds execution time. However one subtle change made a huge difference.

On the line:

while (rowIndex <= worksheet.UsedRange.Rows.Count) 

This is slowing things down a lot. If you create a range variable for UsedRang and not keep regrabbibg it with each iteration of the while loop will make a huge difference. So… when I change the while loop to…

Excel.Range usedRange = worksheet.UsedRange; int rowIndex = 1;  while (rowIndex <= usedRange.Rows.Count) and while (colIndex <= usedRange.Columns.Count) 

This performed very close to my object array solution. I did not post the results, as you can use the code below and change the while loop to grab the UsedRange with each iteration or use the variable usedRange to test this.

private static void RemoveEmptyRowsCols3(Excel.Worksheet worksheet) {   //Excel.Range usedRange = worksheet.UsedRange;     // <- using this variable makes the while loop much faster    int rowIndex = 1;    // delete empty rows   //while (rowIndex <= usedRange.Rows.Count)     // <- changing this one line makes a huge difference - not grabbibg the UsedRange with each iteration...   while (rowIndex <= worksheet.UsedRange.Rows.Count) {     if (excel.WorksheetFunction.CountA(worksheet.Cells[rowIndex, 1].EntireRow) == 0) {       worksheet.Cells[rowIndex, 1].EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp);     }     else {       rowIndex++;     }   }    // delete empty columns   int colIndex = 1;   // while (colIndex <= usedRange.Columns.Count) // <- change here also    while (colIndex <= worksheet.UsedRange.Columns.Count) {     if (excel.WorksheetFunction.CountA(worksheet.Cells[1, colIndex].EntireColumn) == 0) {       worksheet.Cells[1, colIndex].EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft);     }     else {       colIndex++;     }   } } 

UPDATE by @Hadi

You can alter DeleteCols and DeleteRows function to get better performance if excel contains extra blank rows and columns after the last used ones:

private static void DeleteRows(List<int> rowsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet) {     // the rows are sorted high to low - so index's wont shift      List<int> NonEmptyRows = Enumerable.Range(1, rowsToDelete.Max()).ToList().Except(rowsToDelete).ToList();      if (NonEmptyRows.Max() < rowsToDelete.Max())     {          // there are empty rows after the last non empty row          Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[NonEmptyRows.Max() + 1,1];         Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[rowsToDelete.Max(), 1];          //Delete all empty rows after the last used row         worksheet.Range[cell1, cell2].EntireRow.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftUp);       }    //else last non empty row = worksheet.Rows.Count        foreach (int rowIndex in rowsToDelete.Where(x => x < NonEmptyRows.Max()))     {         worksheet.Rows[rowIndex].Delete();     } }  private static void DeleteCols(List<int> colsToDelete, Microsoft.Office.Interop.Excel.Worksheet worksheet) {     // the cols are sorted high to low - so index's wont shift      //Get non Empty Cols     List<int> NonEmptyCols = Enumerable.Range(1, colsToDelete.Max()).ToList().Except(colsToDelete).ToList();      if (NonEmptyCols.Max() < colsToDelete.Max())     {          // there are empty rows after the last non empty row          Microsoft.Office.Interop.Excel.Range cell1 = worksheet.Cells[1,NonEmptyCols.Max() + 1];         Microsoft.Office.Interop.Excel.Range cell2 = worksheet.Cells[1,NonEmptyCols.Max()];          //Delete all empty rows after the last used row         worksheet.Range[cell1, cell2].EntireColumn.Delete(Microsoft.Office.Interop.Excel.XlDeleteShiftDirection.xlShiftToLeft);       }            //else last non empty column = worksheet.Columns.Count      foreach (int colIndex in colsToDelete.Where(x => x < NonEmptyCols.Max()))     {         worksheet.Columns[colIndex].Delete();     } } 

check my answer at Get Last non empty column and row index from excel using Interop

like image 174
JohnG Avatar answered Sep 28 '22 17:09

JohnG


Maybe something to consider:

Sub usedRangeDeleteRowsCols()     Dim LastRow, LastCol, i As Long      LastRow = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByRows).Row     LastCol = Cells.Find(What:="*", SearchDirection:=xlPrevious, SearchOrder:=xlByColumns).Column      For i = LastRow To 1 Step -1         If WorksheetFunction.CountA(Range(Cells(i, 1), Cells(i, LastCol))) = 0 Then             Cells(i, 1).EntireRow.Delete         End If     Next      For i = LastCol To 1 Step -1         If WorksheetFunction.CountA(Range(Cells(1, i), Cells(LastRow, i))) = 0 Then             Cells(1, i).EntireColumn.Delete         End If     Next End Sub 

I think there are two efficiencies compared to equivalent functions in the original code. Firstly, instead of using Excel's unreliable UsedRange property, we find the last value and only scan rows and columns within the genuine used range.

Secondly the worksheet count function again only works within the genuine used range - for example when searching for blank rows we only look in the range of used columns (rather than .EntireRow).

The For loops work backwards because, for example, every time a row is deleted, the row address of following data changes. Working backwards means the row addresses of "data to be worked on" doesn't change.

like image 25
David Avatar answered Sep 28 '22 17:09

David