I work for a company that does ETL work on various databases. I am tasked with creating a patch for two full historical data sets on the client machine, which would then be sent over to our servers. This patch needs to be programmatic so that it can be called from our software.
The datasets are simple text files. We have extraction software running on our client's systems to perform the extraction. Extraction files range in size, up to 3GB+. I have implemented a solution using Microsoft's FC.exe, but it has limitations.
I'm using FC to produce the comparison file, then parsing it in perl on our side to extract the records that have been removed/updated, and those that have been added.
FC works perfectly fine for me as long as the line of text does not exceed 128 characters. When that happens the output is put on to the next line of the comparison file, and so appears as an added/deleted record. I know that I could probably pre-process the files, but this will add a tremendous amount of time, probably defeating the purpose.
I have tried using diffutils, but it complains about large files.
I also toyed with some c# code to implement the patch process myself. This worked fine for small files, but was horribly inefficient when dealing with the big ones (tested it on a 2.8 GB extract)
Are there any good command-line utilities or c# libraries that I can utilize to create this patch file? Barring that, is there an algorithm that I can use to implement this myself? Keep in mind that records may be updated, added, and deleted (I know, it irks me too that the clients DELETE records, rather than marking them inactive. This is out of my control.)
Edit for clarity:
I need to compare two separate database extracts from two different times. Usually these will be about one day apart.
Given the below files: (these will obviously be much longer and much wider)
Old.txt
a
b
c
d
e
1
f
2
5
New.txt
a
3
b
c
4
d
e
1
f
g
The expected output would be:
3 added
4 added
2 removed
g added
5 removed
Here's a pretty efficient solution - I think it's roughly O(n), but it depends on the distribution of adds and deletes. Memory consumption is pretty low, but also depends on the number of consecutive adds and deletes.
Limitations:
The algorithm uses a pair of "added" and "removed" buffers to track potentially added and removed lines as it runs through the files. Lines are tentatively marked "added" or "removed" when they don't match between the files. When a tentatively marked line is found in one of the files (if a "removed" line is found in the target file, or a "added" line is found in the source file) that is a signal which indicates that all the lines in the other buffer do belong there, so the other buffer is flushed to the patch file, then the reader is advanced one line in the file where the matched line was found.
For example:
Source Target Added Removed A-------A _ _ B-------X +X +B C-------B Flush X -B D--\ \-C _ _ E-\ \---E +E +D F \----F -E Flush D
Here's the code:
public void Diff(
string sourcePath,
string targetPath,
string patchPath,
string addedSuffix,
string removedSuffix)
{
using(var sourceReader = new StreamReader(sourcePath))
using(var targetReader = new StreamReader(targetPath))
using(var patchWriter = new StreamWriter(patchPath, append:false))
{
var sourceLine = sourceReader.ReadLine();
var targetLine = targetReader.ReadLine();
var added = new HashSet<string>();
var removed = new HashSet<string>();
do{
if(sourceLine == targetLine)
{
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
else
{
if(removed.Contains(targetLine))
{
// Found targetLine in tentatively removed lines, so it wasn't actually removed.
removed.Remove(targetLine);
// Since we found something we thought had been removed, we know that all tentatively added lines actually are new.
Flush(patchWriter, added, addedSuffix);
added.Clear();
targetLine = targetReader.ReadLine();
}
else if(added.Contains(sourceLine))
{
// Found sourceLine in tentatively added lines, so it wasn't actually added.
added.Remove(sourceLine);
// We found something we thought had been added, so all candidates for removal should actually be removed.
Flush(patchWriter,removed, removedSuffix);
removed.Clear();
sourceLine = sourceReader.ReadLine();
}
else
{
// Source and target don't match, so we assume that the source was removed and the target was added.
// If we're wrong, we'll clean it up when we come across the line later on.
removed.Add(sourceLine);
added.Add(targetLine);
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
}
} while(sourceLine != null || targetLine != null);
Flush(patchWriter, added, addedSuffix);
Flush(patchWriter, removed, removedSuffix);
}
}
public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix)
{
foreach (var line in lines.Where (l => l != null))
{
writer.WriteLine("{0} {1}", line.Trim(), suffix);
}
}
Here's some code that I used to generate test files:
var path = /* path */;
var sourcePath = Path.Combine(path, "source.txt");
var targetPath = Path.Combine(path, "target.txt");
var expectedPath = Path.Combine(path, "expected.txt");
var rnd = new Random(10);
using(var sourceWriter = new StreamWriter(sourcePath))
using(var targetWriter = new StreamWriter(targetPath))
using(var expectedWriter = new StreamWriter(expectedPath))
{
var limit = 10.0 * 100000;
for (int i = 0; i < limit; i++)
{
if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit);
var guid = Guid.NewGuid().ToString();
var r = rnd.Next(0,10);
var removed = 3;
var added = 6;
if(r >= 0 && r < removed)
{
sourceWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 0");
}
else if(r >= removed && r < added)
{
targetWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 1");
}
else if(r >= added)
{
sourceWriter.WriteLine(guid);
targetWriter.WriteLine(guid);
}
}
}
See any mistakes or issues? Is this what you're looking for?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With