Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way of searching a string in text files [closed]

I need to search for a string, roughly 13 characters, in a group of text files using C#. The number of text files is changing and can range between 100-1000. The size of the files can range between 1KB and 10MB.

I tried the naive way of opening each file, read it line by line and see if the string exists (using index.of), but this is too slow. I also tried using the Boyer-Moore algorithm, which did improve the timing, by 5 seconds, but still this feels slow.

Any idea on how to speed up the search?

like image 706
ocp1000 Avatar asked Feb 12 '13 07:02

ocp1000


People also ask

How to search text in file in Java?

Searching files in Java can be performed using the File class and FilenameFilter interface. The FilenameFilter interface is used to filter files from the list of files. This interface has a method boolean accept(File dir, String name) that is implemented to find the desired files from the list returned by the java. io.

How do you check if a string is present in a file in Java?

If you are using Java 8, can use Files. lines() to read files on a line-by-line basis with a stream. Otherwise, guava's LineProcessor , can do this as well. This example uses streams to find all lines which match a string and return them in a list.


5 Answers

Depending on how many times you want to do the 'search', you want to use a search engine or not. If you want to search a lot of times, use a search engine, otherwise: don't. I'm going to describe how to implement both scenario's here.

When using a search engine: It sounds like you're looking for substrings, which means you should index your files as such using your favorite search engine, preferably one you can customize (lucene, terrier, etc.). The technique you need here is to index trigrams, that is: all 3-character combinations have to be indexed. F.ex.: 'foobar' will generate 'foo','oob','oba' and 'bar'. When searching, you want to do the same with your query and issue a search engine query with the AND of all these trigrams. (That will run a merge-join on the posting lists from the documents, which will return their ID's or whatever you put in the posting lists).

Alternatively, you can implement suffix arrays and index your files once. This will give a little more flexibility if you want to search for short (1-2 char) substrings, but in terms of indexes is harder to maintain. (There is some research at CWI/Amsterdam for fast indexing suffix arrays)

When you want to search only a few times, the algorithm to use is either Boyer-Moore (I usually use Boyer-moore-sunday as described in [Graham A. Stephen, String Search]) or a compiled DFA (you can construct them from an NFA, which is easier to make). However, that will only give you a small speed increase, for the simple reason that disk IO is probably your bottleneck and comparing a bunch of bytes that you need to decode anyways is quite fast.

The biggest improvement you can make is not reading your file line by line, but in blocks. You should configure NTFS to use a block size of 64 KB if you can and read the files in multiplies of 64 KB - think 4 MB or more in a single read. I'd even suggest using asynchronous IO so that you can read and process (previously read data) at the same time. If you do it correctly, that should already give you a split-second implementation for 10 MB on most modern hardware.

Last but not least, a neat trick used throughout information retrieval is also to compress you data using a fast compression algorithm. Since disk IO is slower than memory/cpu operations, this will probably help as well. Google's Snappy compressor is a good example of a fast compression algorithm.

like image 177
atlaste Avatar answered Oct 05 '22 20:10

atlaste


You should consider using Operating System file search with contents. Take a look at Microsoft Windows Search 3.x SDK

Or you can utilize PLINQ for searching in array of files. See this link:

File Content and Directory Search using Directory.GetFiles and PLINQ

like image 40
Habib Avatar answered Oct 05 '22 19:10

Habib


Two options come to mind:

Reading your text file in memory and just search the whole string at once.

If that proves to be too slow or too memory hungry, use an indexer like Apache Lucene. There is a nice and easy SDK for that available for .NET, called Lucene.net

Here is a small introduction for it: http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net

like image 41
Wim Haanstra Avatar answered Oct 05 '22 19:10

Wim Haanstra


If your computer can handle it try loading all text files into memory (using the technique shown here and then evaluate the text in memory.

If you cant handle all of the files at one time, then do this for smallest files. File I/O is going to be your greatest expense here, so you want to minimize that as much as possible.

like image 22
Yaakov Ellis Avatar answered Oct 05 '22 19:10

Yaakov Ellis


You can use the Microsoft's indexing service to search for documents in the folders which you would add in the catalogue. Here is a very nice article which you can user to search your text files

like image 42
Vikram Avatar answered Oct 05 '22 19:10

Vikram