Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High performance "contains" search in list of strings in C#

I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At the moment I am doing this with a plain old dataset using the Select method ("MATCH %term%"). This takes about 600ms on my laptop. I'd like to make it faster, maybe 100-200ms.

What would be a recommended approach?

Performance is critical so I can trade memory footprint for better performance if necessary (within reason). The list of strings will not change once initialised so calculating hashes would also be an option.

Does anyone have a recommendation and which C# data structures are best suited to the task?

like image 868
njr101 Avatar asked Sep 29 '11 16:09

njr101


2 Answers

I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.

Otherwise, you might just try something like this:

var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();

But it probably won't get you down to 100ms.

like image 156
StriplingWarrior Avatar answered Sep 19 '22 15:09

StriplingWarrior


Have you tried the following?

list.FindAll(x => x.Contains("YourTerm")).ToList();

For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.

list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();

Hope this will help you.

like image 34
Fever Avatar answered Sep 19 '22 15:09

Fever