Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# searching large text file

I am trying to optimize the search for a string in a large text file (300-600mb). Using my current method, it is taking too long.

Currently I have been using IndexOf to search for the string, but the time it takes is way too long (20s) to build an index for each line with the string.

How can I optimize searching speed? I've tried Contains() but that is slow as well. Any suggestions? I was thinking regex match but I don't see that having a significant speed boost. Maybe my search logic is flawed

example

while ((line = myStream.ReadLine()) != null)
{
    if (line.IndexOf(CompareString, StringComparison.OrdinalIgnoreCase) >= 0)
    {
        LineIndex.Add(CurrentPosition);
        LinesCounted += 1;
    }
}
like image 733
user1747467 Avatar asked Dec 19 '12 19:12

user1747467


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.

What is C full form?

Full form of C is “COMPILE”. One thing which was missing in C language was further added to C++ that is 'the concept of CLASSES'.


3 Answers

The brute force algorithm you're using performs in O(nm) time, where n is the length of the string being searched and m the length of the substring/pattern you're trying to find. You need to use a string search algorithm:

  • Boyer-Moore is "the standard", I think: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

  • But there are lots more out there: http://www-igm.univ-mlv.fr/~lecroq/string/

  • including Morris-Pratt: http://www.stoimen.com/blog/2012/04/09/computer-algorithms-morris-pratt-string-searching/

  • and Knuth-Morris-Pratt: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

However, using a regular expression crafted with care might be sufficient, depending on what you are trying to find. See Jeffrey's Friedl's tome, Mastering Regular Expressions for help on building efficient regular expressions (e.g., no backtracking).

You might also want to consult a good algorithms text. I'm partial to Robert Sedgewick's Algorithms in its various incarnations (Algorithms in [C|C++|Java])

like image 58
Nicholas Carey Avatar answered Oct 06 '22 15:10

Nicholas Carey


Unfortunately, I don't think there's a whole lot you can do in straight C#.

I have found the Boyer-Moore algorithm to be extremely fast for this task. But I found there was no way to make even that as fast as IndexOf. My assumption is that this is because IndexOf is implemented in hand-optimized assembler while my code ran in C#.

You can see my code and performance test results in the article Fast Text Search with Boyer-Moore.

like image 38
Jonathan Wood Avatar answered Oct 06 '22 14:10

Jonathan Wood


Have you seen these questions (and answers)?

  • Processing large text file in C#
  • Is there a way to read large text file in parts?
  • Matching a string in a Large text file?

Doing it the way you are now seems to be the way to go if all you want to do is read the text file. Other ideas:

  • If it is possible to pre-sort the data, such as when it gets inserted into the text file, that could help.

  • You could insert the data into a database and query it as needed.

  • You could use a hash table

like image 35
Austin Henley Avatar answered Oct 06 '22 15:10

Austin Henley