Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare differences in very large csv files

I have to compare two csv files with a size of 2-3 GB each, contained in Windows platform.

I've tried to put the first one in a HashMap to compare it with the second one, but the result (as expected) is a very high memory cosumption.

The target is to get the differences in another file.

The lines may appear in diffent order, and maybe missed also.

Any suggetions?

like image 916
richarbernal Avatar asked May 17 '12 19:05

richarbernal


People also ask

What is the better way to read the large CSV file?

So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

Is there a size limit for CSV files?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.

Can beyond compare Compare CSV files?

Beyond Compare Table Compare can look at a pair of tabular data files. It accepts . xlsx Excel, and . csv, but also things like PDFs and Word Docs if they have tabular data.


1 Answers

Assuming you wish to do this in Java, via programming, the answers are different.

Are both of the files ordered? If so, then you don't need to read in whole files, you simply start at the beginning of both files, and

  1. If the entries match, advance the "current" line in both files.
  2. If the entries don't match, determine which file's line would come first, display that line, and advance the current line in that file.

If you don't have ordered files, then perhaps you could order the files prior to the diff. Again, since you need a low memory solution, don't read the entire file in to sort it. Chop the file up into manageable chunks, and then sort each chunk. Then use insertion sort to combine the chunks.

like image 106
Edwin Buck Avatar answered Sep 21 '22 17:09

Edwin Buck