C# Code/Algorithm to Search Text for Terms

Question

We have 5mb of typical text (just plain words). We have 1000 words/phrases to use as terms to search for in this text.

What's the most efficient way to do this in .NET (ideally C#)?

Our ideas include regex's (a single one, lots of them) plus even the String.Contains stuff.

The input is a 2mb to 5mb text string - all text. Multiple hits are good, as in each term (of the 1000) that matches then we do want to know about it. Performance in terms of entire time to execute, don't care about footprint. Current algorithm gives about 60 seconds+ using naive string.contains. We don't want 'cat' to provide a match with 'category' or even 'cats' (i.e. entire term word must hit, no stemming).

We expect a <5% hit ratio in the text. The results would ideally just be the terms that matched (dont need position or frequency just yet). We get a new 2-5mb string every 10 seconds, so can't assume we can index the input. The 1000 terms are dynamic, although have a change rate of about 1 change an hour.

Mark Brackett · Accepted Answer

A naive string.Contains with 762 words (the final page) of War and Peace (3.13MB) runs in about 10s for me. Switching to 1000 GUIDs runs in about 5.5 secs.

Regex.IsMatch found the 762 words (much of which were probably in earlier pages as well) in about .5 seconds, and ruled out the GUIDs in 2.5 seconds.

I'd suggest your problem lies elsewhere...Or you just need some decent hardware.

Kent Boogaart · Answer

Why reinvent the wheel? Why not just leverage something like Lucene.NET?

C# Code/Algorithm to Search Text for Terms

Tags:

c#

.net

algorithm

search

user47892

2 Answers

Mark Brackett

Kent Boogaart

Recent Activity

Donate For Us

C# Code/Algorithm to Search Text for Terms

Tags:

c#

.net

algorithm

search

user47892

2 Answers

Mark Brackett

Kent Boogaart

Related questions

Recent Activity

Donate For Us