Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate lines from TStringList without sorting in Delphi

Tags:

delphi

I know how to remove duplicate strings from a TStringList using dupignore for a sorted Tstringlist.

CallData := TStringList.Create;
CallData.Sorted := True;
Call.Duplicates := dupIgnore;

But in my case strings must not be sorted .

Using a FOR loop finding duplicates is very slow (also using indexOF())when TStringList has hundreds of thousands of lines .

 if OpenDialog1.Execute then
  begin
    Try
      y := TStringList.create;
      f := TStreamReader.create(OpenDialog1.FileName, TEncoding.UTF8, True);
      while not f.EndOfStream do
      begin
        l := f.ReadLine;
        X.Add(l);
      end;

      g := Tstreamwriter.create('d:\logX.txt', True, TEncoding.UTF8);
      for I := 0 to X.count - 1 do
      begin


          if y.IndexOf(X[I]) = -1 then

          y.Add(X[I]);

      end;

      for j := 0 to y.count - 1 do
        g.WriteLine(y[j]);

    Finally
      f.free;
      y.free;
      g.free;
    End;
  end;

is there any better way ?

like image 261
Shahram Banazadeh Avatar asked Dec 14 '17 16:12

Shahram Banazadeh


3 Answers

Here's how I would approach this problem:

  1. Create a dictionary keyed on a string. It doesn't matter that the value type is.
  2. Iterate through the string list in reverse order.
  3. For each string, check whether or not it is in the dictionary.
  4. If it is in the dictionary, remove from the string list. Otherwise add to the dictionary.

If there are a large number of duplicates to be removed then the performance of the above will be affected by repeated removal from the string list. That's because each item to be removed results in the later items being shifted down one index. You can avoid this by copying into a new list rather than deleting inplace.

Alternatively, you can operate in place like this:

  1. Create a dictionary keyed on a string. It doesn't matter that the value type is.
  2. Initialise a variable named Count to zero.
  3. Iterate through the string list in forward order.
  4. For each string, check whether or not it is in the dictionary.
  5. If it is in the dictionary, do nothing. Otherwise add to the dictionary, copy into index Count of the list, and then increment Count.
  6. Once the iteration is complete, resize the list to have Count elements.

The point of the dictionary is that lookup is an O(1) operation and so the second algorithm has O(n) time complexity.

like image 94
David Heffernan Avatar answered Nov 08 '22 19:11

David Heffernan


I would use trickery, by having a sorted and an unsorted list. Like this:

  y := TStringList.create;
  s := TStringList.create;
  s.Sorted := TRUE;
  s.Duplicates := dupIgnore;

  f := TStreamReader.create(OpenDialog1.FileName, TEncoding.UTF8, True);
  while not f.EndOfStream do
  begin
    l := f.ReadLine;
    s.Add(l);
    if s.Count > y.Count then y.Add(l);
  end;

  // etc.
like image 30
Dsm Avatar answered Nov 08 '22 20:11

Dsm


function compareobjects
          (list     : Tstringlist;
           index1   : integer;
           index2   : integer
          )         : integer;
begin
  if index1 = index2 then
    result := 0
  else
    if integer(list.objects[index1]) < integer(list.objects[index2]) then
      result := -1
    else
      result := 1;
end;

begin
  Try
    y := TStringList.create;
    y.Sorted := true;
    y.Duplicates := dupignore;
    f := TStreamReader.create('c:\106x\q47780823.bat');
    i := 0;
    while not f.EndOfStream do
    begin
      inc(i);
      line := f.readline;
      y.Addobject(line,tobject(i));
    end;
    y.Sorted := false;
    y.CustomSort(compareobjects);

    for i := 0 to y.count - 1 do
      WriteLn(y[i]);

    Finally
      f.free;
      y.free;
  End;
  readln;
end.

I'd keep track of the line number (i) and assign it with the string by casting as an object; sort the list and remove duplicates as before, but then un-sort it using a custom sort on the objects.

like image 45
Magoo Avatar answered Nov 08 '22 18:11

Magoo