Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Search to see if a String Exists in Large Files with Delphi

I have a FindFile routine in my program which will list files, but if the "Containing Text" field is filled in, then it should only list files containing that text.

enter image description here

If the "Containing Text" field is entered, then I search each file found for the text. My current method of doing that is:

  var
    FileContents: TStringlist;

  begin
    FileContents.LoadFromFile(Filepath);
    if Pos(TextToFind, FileContents.Text) = 0 then
      Found := false
    else 
      Found := true;

The above code is simple, and it generally works okay. But it has two problems:

  1. It fails for very large files (e.g. 300 MB)

  2. I feel it could be faster. It isn't bad, but why wait 10 minutes searching through 1000 files, if there might be a simple way to speed it up a bit?

I need this to work for Delphi 2009 and to search text files that may or may not be Unicode. It only needs to work for text files.

So how can I speed this search up and also make it work for very large files?


Bonus: I would also want to allow an "ignore case" option. That's a tougher one to make efficient. Any ideas?


Solution:

Well, mghie pointed out my earlier question How Can I Efficiently Read The First Few Lines of Many Files in Delphi, and as I answered, it was different and didn't provide the solution.

But he got me thinking that I had done this before and I had. I built a block reading routine for large files that breaks it into 32 MB blocks. I use that to read the input file of my program which can be huge. The routine works fine and fast. So step one is to do the same for these files I am looking through.

So now the question was how to efficiently search within those blocks. Well I did have a previous question on that topic: Is There An Efficient Whole Word Search Function in Delphi? and RRUZ pointed out the SearchBuf routine to me.

That solves the "bonus" as well, because SearchBuf has options which include Whole Word Search (the answer to that question) and MatchCase/noMatchCase (the answer to the bonus).

So I'm off and running. Thanks once again SO community.

like image 480
lkessler Avatar asked Dec 02 '22 03:12

lkessler


2 Answers

The best approach here is probably to use memory mapped files.

First you need a file handle, use the CreateFile windows API function for that.

Then pass that to CreateFileMapping to get a file mapping handle. Finally use MapViewOfFile to map the file into memory.

To handle large files, MapViewOfFile is able to map only a certain range into memory, so you can e.g. map the first 32MB, then use UnmapViewOfFile to unmap it followed by a MapViewOfFile for the next 32MB and so on. (EDIT: as was pointed out below, make sure that the blocks you map this way overlap by a multiple of 4kb, and at least as much as the length of the text you are searching for, so that you are not overlooking any text which might be split at the block boundary)

To do the actual searching once the (part of) the file is mapped into memory, you can make a copy of the source for StrPosLen from SysUtils.pas (it's unfortunately defined in the implementation section only and not exposed in the interface). Leave one copy as is and make another copy, replacing Wide with Ansi every time. Also, if you want to be able to search in binary files which might contain embedded #0's, you can remove the (Str1[I] <> #0) and part.

Either find a way to identify if a file is ANSI or Unicode, or simply call both the Ansi and Unicode version on each mapped part of the file.

Once you are done with each file, make sure to call CloseHandle first on the file mapping handle and then on the file handling. (And don't forget to call UnmapViewOfFile first).

EDIT:

A big advantage of using memory mapped files instead of using e.g. a TFileStream to read the file into memory in blocks is that the bytes will only end up in memory once.

Normally, on file access, first Windows reads the bytes into the OS file cache. Then copies them from there into the application memory.

If you use memory mapped files, the OS can directly map the physical pages from the OS file cache into the address space of the application without making another copy (reducing the time needed for making the copy and halfing memory usage).

Bonus Answer: By calling StrLIComp instead of StrLComp you can do a case insensitive search.

like image 77
Thorsten Engler Avatar answered Dec 04 '22 08:12

Thorsten Engler


If you are looking for text string searches, look for the Boyer-Moore search algorithm. It uses memory mapped files and a really fast search engine. The is some delphi units around that contain implementations of this algorithm.

To give you an idea of the speed - i currently search through 10-20MB files and it takes in the order of milliseconds.

Oh just read that it might be unicode - not sure if it supports that - but definately look down this path.

like image 33
Simon Avatar answered Dec 04 '22 07:12

Simon