Comparing HUGE ASCII Files

Question

I work for a company that does ETL work on various databases. I am tasked with creating a patch for two full historical data sets on the client machine, which would then be sent over to our servers. This patch needs to be programmatic so that it can be called from our software.

The datasets are simple text files. We have extraction software running on our client's systems to perform the extraction. Extraction files range in size, up to 3GB+. I have implemented a solution using Microsoft's FC.exe, but it has limitations.

I'm using FC to produce the comparison file, then parsing it in perl on our side to extract the records that have been removed/updated, and those that have been added.

FC works perfectly fine for me as long as the line of text does not exceed 128 characters. When that happens the output is put on to the next line of the comparison file, and so appears as an added/deleted record. I know that I could probably pre-process the files, but this will add a tremendous amount of time, probably defeating the purpose.

I have tried using diffutils, but it complains about large files.

I also toyed with some c# code to implement the patch process myself. This worked fine for small files, but was horribly inefficient when dealing with the big ones (tested it on a 2.8 GB extract)

Are there any good command-line utilities or c# libraries that I can utilize to create this patch file? Barring that, is there an algorithm that I can use to implement this myself? Keep in mind that records may be updated, added, and deleted (I know, it irks me too that the clients DELETE records, rather than marking them inactive. This is out of my control.)

Edit for clarity:

I need to compare two separate database extracts from two different times. Usually these will be about one day apart.

Given the below files: (these will obviously be much longer and much wider)

Old.txt

a
b
c
d
e
1
f
2
5

New.txt

a
3
b
c
4
d
e
1
f
g

The expected output would be:

3 added
4 added
2 removed
g added
5 removed

Steve Ruble · Accepted Answer

Here's a pretty efficient solution - I think it's roughly O(n), but it depends on the distribution of adds and deletes. Memory consumption is pretty low, but also depends on the number of consecutive adds and deletes.

Limitations:

This algorithm doesn't keep the patch lines in the same order that they were in the original file; if that's critical, you could do something like using Dictionary<string,int> where the key is the line and the value is the original line number rather than using HashSet<string> to track added and removed lines.
The target ("new") file must be somewhat similar to the source ("old") file. Specifically, all unchanged lines must be in the same order in the source and the target. If this condition isn't met the algorithm will behave badly.
Each line must be unique with regard to lines near it, where "near" means between the nearest lines which are unchanged between source and target. If this condition isn't met, the algorithm will miss changes.
This implementation doesn't account for modified lines. I think you could add that functionality by replacing == comparisons with whatever operation you use for detecting that two lines are the "same" line, then writing them out to the patch if they are the "same" line with content changes.

The algorithm uses a pair of "added" and "removed" buffers to track potentially added and removed lines as it runs through the files. Lines are tentatively marked "added" or "removed" when they don't match between the files. When a tentatively marked line is found in one of the files (if a "removed" line is found in the target file, or a "added" line is found in the source file) that is a signal which indicates that all the lines in the other buffer do belong there, so the other buffer is flushed to the patch file, then the reader is advanced one line in the file where the matched line was found.

For example:

 
Source  Target Added   Removed
A-------A      _       _
B-------X      +X      +B
C-------B      Flush X -B
D--\  \-C      _       _
E-\ \---E      +E      +D
F  \----F      -E      Flush D

Here's the code:

public void Diff(
    string sourcePath,
    string targetPath,
    string patchPath,
    string addedSuffix,
    string removedSuffix)

{
    using(var sourceReader = new StreamReader(sourcePath))
    using(var targetReader = new StreamReader(targetPath))
    using(var patchWriter = new StreamWriter(patchPath, append:false))
    {   
        var sourceLine = sourceReader.ReadLine();
        var targetLine = targetReader.ReadLine();

        var added = new HashSet<string>();
        var removed = new HashSet<string>();

        do{
            if(sourceLine == targetLine)
            {   
                sourceLine = sourceReader.ReadLine();
                targetLine = targetReader.ReadLine();
            }
            else
            {
                if(removed.Contains(targetLine))
                {
                    // Found targetLine in tentatively removed lines, so it wasn't actually removed.
                    removed.Remove(targetLine);
                    // Since we found something we thought had been removed, we know that all tentatively added lines actually are new.
                    Flush(patchWriter, added, addedSuffix);             
                    added.Clear();

                    targetLine = targetReader.ReadLine();               
                } 
                else if(added.Contains(sourceLine))
                {
                    // Found sourceLine in tentatively added lines, so it wasn't actually added.
                    added.Remove(sourceLine);
                    // We found something we thought had been added, so all candidates for removal should actually be removed.
                    Flush(patchWriter,removed, removedSuffix);
                    removed.Clear();

                    sourceLine = sourceReader.ReadLine();               
                }
                else
                {
                    // Source and target don't match, so we assume that the source was removed and the target was added.
                    // If we're wrong, we'll clean it up when we come across the line later on.
                    removed.Add(sourceLine);
                    added.Add(targetLine);
                    sourceLine = sourceReader.ReadLine();               
                    targetLine = targetReader.ReadLine();               
                }       
            }   
        } while(sourceLine != null || targetLine != null); 

        Flush(patchWriter, added, addedSuffix);
        Flush(patchWriter, removed, removedSuffix);
    }
}

public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix)
{
    foreach (var line in lines.Where (l => l != null))
    {
        writer.WriteLine("{0} {1}", line.Trim(), suffix);
    }
}

Here's some code that I used to generate test files:

var path = /* path */;
var sourcePath = Path.Combine(path, "source.txt");
var targetPath = Path.Combine(path, "target.txt");
var expectedPath = Path.Combine(path, "expected.txt");
var rnd = new Random(10);

using(var sourceWriter = new StreamWriter(sourcePath))
using(var targetWriter = new StreamWriter(targetPath))
using(var expectedWriter = new StreamWriter(expectedPath))
{
    var limit = 10.0 * 100000;
    for (int i = 0; i < limit; i++)
    {
        if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit);
        var guid = Guid.NewGuid().ToString();
        var r = rnd.Next(0,10);
        var removed = 3;
        var added = 6;
        if(r >= 0 && r < removed)
        {
            sourceWriter.WriteLine(guid);
            expectedWriter.WriteLine(guid + " 0");
        }
        else if(r >= removed && r < added)
        {
            targetWriter.WriteLine(guid);
            expectedWriter.WriteLine(guid + " 1");
        }
        else if(r >= added)
        {   
            sourceWriter.WriteLine(guid);
            targetWriter.WriteLine(guid);           
        }
    }
}

See any mistakes or issues? Is this what you're looking for?

Comparing HUGE ASCII Files

Tags:

c#

sql

diff

etl

rbedger

1 Answers

Steve Ruble

Recent Activity

Donate For Us

Comparing HUGE ASCII Files

Tags:

c#

sql

diff

etl

rbedger

1 Answers

Steve Ruble

Related questions

Recent Activity

Donate For Us