Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce CPU usage when scanning for folders/sub-folders/files?

Tags:

delphi

I have developed an application that scans basically everywhere for a file or list of files. When I scan small folders like 10 000 files and sub files there is no problem. But when I scan for instance my entire users folder with more than 100 000 items, it is very heavy on my processor. It takes about 40% of my processor's power.

Is there a way to optimize this code so that it uses less CPU?

procedure GetAllSubFolders(sPath: String);
var
  Path: String;
  Rec: TSearchRec;
begin
  try
    Path := IncludeTrailingBackslash(sPath);
    if FindFirst(Path + '*.*', faAnyFile, Rec) = 0 then
      try
        repeat
          Application.ProcessMessages;
          if (Rec.Name <> '.') and (Rec.Name <> '..') then
          begin
            if (ExtractFileExt(Path + Rec.Name) <> '') And
              (ExtractFileExt(Path + Rec.Name).ToLower <> '.lnk') And
              (Directoryexists(Path + Rec.Name + '\') = False) then
            begin
              if (Pos(Path + Rec.Name, main.Memo1.Lines.Text) = 0) then
              begin
                main.ListBox1.Items.Add(Path + Rec.Name);
                main.Memo1.Lines.Add(Path + Rec.Name)
              end;
            end;

            GetAllSubFolders(Path + Rec.Name);
          end;
        until FindNext(Rec) <> 0;
      finally
        FindClose(Rec);
      end;
  except
    on e: Exception do
      ShowMessage(e.Message);
  end;
end;

My app searches for all the files in a selected folder and sub-folder, zip's them and copies them to another location you specify.

The Application.ProcessMessages command is there to make sure the application doesn't look like it is hanging and the user closes it. Because finding 100 000 files for instance can take an hour or so...

I am concerned about the processor usage, The memory is not really affected.

Note: The memo is to make sure the same files are not selected twice.

like image 524
Marcel Avatar asked Aug 14 '14 07:08

Marcel


2 Answers

I see the following performance problems:

  1. The call to Application.ProcessMessages is somewhat expensive. You are polling for messages rather than using a blocking wait, i.e. GetMessage. As well as the performance issue, the use of Application.ProcessMessages is generally an indication of poor design for various reasons and one should, in general, avoid the need to call it.
  2. A non-virtual list box performs badly with a lot of files.
  3. Using a memo control (a GUI control) to store a list of strings is exceptionally expensive.
  4. Every time you add to the GUI controls they update and refresh which is very expensive.
  5. The evaluation of Memo1.Lines.Text is extraordinarily expensive.
  6. The use of Pos is likewise massively expensive.
  7. The use of DirectoryExists is expensive and spurious. The attributes returned in the search record contain that information.

I would make the following changes:

  • Move the search code into a thread to avoid the need for ProcessMessages. You'll need to devise some way to transport the information back to the main thread for display in the GUI.
  • Use a virtual list view to display the files.
  • Store the list of files that you wish to search for duplicates in a dictionary which gives you O(1) lookup. Take care with case-insensitivity of file names, an issue that you have perhaps neglected so far. This replaces the memo.
  • Check whether an item is a directory by using Rec.Attr. That is check that Rec.Attr and faDirectory <> 0.
like image 171
David Heffernan Avatar answered Nov 15 '22 07:11

David Heffernan


I agree with the answer which says you would do best to do what you're doing in a background thread and I don't want to encourage you to persist in doing it in your main thread.

However, if you go to a command prompt and do this:

dir c:\*.* /s > dump.txt & notepad dump.txt

you may be surprised quite how quickly Notepad pops into view.

So there are few things you could do to speed up your GetAllSubFolders, even if you keep it in your main thread, e.g. to bracket the code by calls to main.Memo1.Lines.BeginUpdate and main.Memo1.Lines.EndUpdate, likewise main.Listbox1.Items.BeginUpdate and EndUpdate. This will stop these controls being updated while it executes (which is actually what your code is spending most of its time doing, that and the "if Pos( ...)" business I've commented on below). And, if you haven't gathered already, Application.ProcessMessages is evil (mostly).

I did some timings on my D: drive, which is a 500Gb SSD with 263562 files in 35949 directories.

  1. The code in your q: 6777 secs
  2. Doing a dir to Notepad as per the above: 15 secs
  3. The code below, in main thread: 9.7 secs

The reason I've included the code below in this answer is that you'll find it much easier to execute in a thread because it gathers the results into a TStringlist, whose contents you can then assign to your memo and listbox once the thread has completed.

A few comments on the code in your q, which I imagine you might have got from somewhere.

  • It pointlessly recurses even when the current entry in Rec is a plain file. The code below only recurses if the current Rec entry is a directory.

  • It apparently tries to avoid duplicates by the "if Pos( ...)" business, which shouldn't be necessary (except maybe if there's a symbolic link (e.g created with the MkLink command) somewhere that points elsewhere on the drive) and does it in a highly inefficient manner, i.e. by searching for the filename in the memo contents - those will get longer and longer as it finds more files). In the code below, the stringlist is set up to discard duplicates and has its Sorted property set to True, which makes its checking for duplicates much quicker, becauseit can then do a binary search through its contents rather than a serial one.

  • It calculates Path + Rec.Name 6 times for each thing it finds, which is avoidably inefficient at r/t and inflates the source code. This is only a minor point, though, compared to the first two.

Code:

function GetAllSubFolders(sPath: String) : TStringList;

  procedure GetAllSubFoldersInner(sPath : String);
  var
    Path,
    AFileName,
    Ext: String;
    Rec: TSearchRec;
    Done: Boolean;
  begin
    Path := IncludeTrailingBackslash(sPath);
    if FindFirst(Path + '*.*', faAnyFile, Rec) = 0 then begin
        Done := False;
        while not Done do begin
         if (Rec.Name <> '.') and (Rec.Name <> '..') then begin
            AFileName := Path + Rec.Name;
            Ext := ExtractFileExt(AFileName).ToLower;
           if not ((Rec.Attr and faDirectory) = faDirectory) then begin
             Result.Add(AFileName)
           end
           else begin
             GetAllSubFoldersInner(AFileName);
           end;
         end;
         Done := FindNext(Rec) <> 0;
        end;
      FindClose(Rec);
    end;
  end;

begin
  Result := TStringList.Create;
  Result.BeginUpdate;
  Result.Sorted := True;
  Result.Duplicates := dupIgnore;  // don't add duplicate filenames to the list

  GetAllSubFoldersInner(sPath);
  Result.EndUpdate;
end;

procedure TMain.Button1Click(Sender: TObject);
var
  T1,
  T2 : Integer;
  TL : TStringList;
begin
  T1 := GetTickCount;

  TL := GetAllSubfolders('D:\');

  try
    Memo1.Lines.BeginUpdate;
    try
      Memo1.Lines.Text := TL.Text;
    finally
      Memo1.Lines.EndUpdate;
    end;

    T2 := GetTickCount;

    Caption := Format('GetAll: %d, Load: %d, Files: %d', [T2 - T1, GetTickCount - T2, TL.Count]);
  finally
    TL.Free;
  end;
end;
like image 29
MartynA Avatar answered Nov 15 '22 06:11

MartynA