Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare large files

Tags:

compare

delphi

I have some files (3-5) that i need to compare:
File 1.txt have 1 million strings.
File 2.txt have 10 million strings.
File 3.txt have 5 million strings.
All these files are compared with file keys.txt (10 thousand strings). If line from currently opened file is the same as one of lines from keys.txt, write this line into output.txt (I hope you understand what i mean).

Now i have:

function Thread.checkKeys(sLine: string): boolean;
var
  SR: TStreamReader;
  line: string;
begin
  Result := false;
  SR := TStreamReader.Create(sKeyFile); // sKeyFile - Path to file keys.txt
  try
    while (not(SR.EndOfStream)) and (not(Result))do
      begin
        line := SR.ReadLine;
        if LowerCase(line) = LowerCase(sLine) then
          begin
            saveStr(sLine);
            inc(iMatch);
            Result := true;
          end;
      end;
  finally
    SR.Free;
  end;
end;

procedure Thread.saveStr(sToSave: string);
var
  fOut: TStreamWriter;
begin
  fOut := TStreamWriter.Create('output.txt', true, TEncoding.UTF8);
  try
    fOut.WriteLine(sToSave);
  finally
    fOut.Free;
  end;
end;

procedure Thread.updateFiles;
begin
  fmMain.flDone.Caption := IntToStr(iFile);
  fmMain.flMatch.Caption := IntToStr(iMatch);
end;

And loop with

    fInput := TStreamReader.Create(tsFiles[iCurFile]);
    while not(fInput.EndOfStream) do
      begin
        sInput := fInput.ReadLine;
        checkKeys(sInput);
      end;
    fInput.Free;
    iFile := iCurFile + 1;
    Synchronize(updateFiles);

So, if i compare these 3 files with file key.txt it takes about 4 hours. How to decrease compare time?

like image 578
Alex Hide Avatar asked Oct 16 '13 08:10

Alex Hide


People also ask

How do I compare very large files?

You could try a command line diff tool or DiffUtils for Windows. Textpad also has a comparison tool integrated it the files are text. If you just need to detmine if the files are different (not what the differences are) use a checksum comparison tool that uses MD5 or SHA1.

Can Notepad ++ compare files?

Open any two files (A, B) in Notepad++, which you want to compare. File B (new) gets compared to File A (old). Then, navigate to Plugins > Compare Menu > Compare. It shows the difference/comparison side by side, as shown in the screenshot.

What is the best way to compare two files?

Use the diff command to compare text files. It can compare single files or the contents of directories. When the diff command is run on regular files, and when it compares text files in different directories, the diff command tells which lines must be changed in the files so that they match.


1 Answers

An easy solution is to use an associative container to store your keys. This can provide efficient lookup.

In Delphi you can use TDictionary<TKey,TValue> from Generics.Collections. The implementation of this container hashes the keys and provides O(1) lookup.

Declare the container like this:

Keys: TDictionary<string, Boolean>; 
// doesn't matter what type you use for the value, we pick Boolean since we
// have to pick something

Create and populate it like this:

Keys := TDictionary<string, Integer>.Create;
SR := TStreamReader.Create(sKeyFile);
try
  while not SR.EndOfStream do
    Keys.Add(LowerCase(SR.ReadLine), True); 
    // exception raised if duplicate key found
finally
  SR.Free;
end;

Then your checking function becomes:

function Thread.checkKeys(const sLine: string): boolean;
begin
  Result := Keys.ContainsKey(LowerCase(sLine));
  if Result then 
  begin
    saveStr(sLine);
    inc(iMatch);
  end;
end;
like image 60
David Heffernan Avatar answered Nov 02 '22 02:11

David Heffernan