Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most Efficient way to test large number of strings against a large List<string>

I've looked at a number of other similar questions, but the methods given seem too slow for what I am trying to accomplish, or are testing for partial matches, which I don't need and should be slower.

I have two large files filled with strings, I need to check each string in one list to see if it matches any of the strings in the second list. I don't need to check for partial matches, and everything should be properly escaped.

The second list (of strings to remove) contains 160,000 strings. I've loaded this into a List<String> and then was reading each line of the larger file and testing it using List<String>.Any(line.contains).

Even with only a small part of the first list (40k strings), this is taking a long time, probably over 20 minutes on my fast development computer.

Here's My Question

Is there a more/What is the most efficient way of comparing a large list of strings individually against another larger list of strings, when no partial matches are needed.

like image 823
Kin Avatar asked Jul 11 '11 15:07

Kin


3 Answers

First off, I think your logic is just plain wrong. Passing a delegate to Contains to the Any method will do partial string matches and you have explicitly stated that you want only exact matches.

Leaving that aside, your performance problem is due to the nature of the list data structure; it was not designed to be efficiently searched via "Any".

The problem is that "Any" simply does a linear search, starting at the beginning of the list and blazing through it until it finds a match. If the list has 100K entries then every "miss" will do 100K string comparisons, and every "hit" will do on average 50K string comparisons.

That's terrible.

What you should do is convert the List into a HashSet of strings. The set takes up slightly more memory, but is extremely fast to search.

Another possible optimization is to sort one of the lists -- which is an O(n lg n) operation -- and then binary search the sorted list, which is an O(lg n) operation.

A third possible optimization is to sort both lists, and then write a sorted list comparer. Obviously a sorted list comparer is much faster than an unsorted list comparer. You keep an index into each list, and advance only the index that is pointing to the "smaller" item. That is, if the lists are

A, B, C, D, G, I
B, D, E, H, I

Then you start with indexes pointing at A and B. "A" is smaller, so you advance the first index to "B". Now they are the same; you have a match. Advance both of them. The first index is "C" and the second is "D". "C" is smaller", so advance it...


More generally though, I think you are describing the problem at too low a level. I feel like you're asking a question about drills when you should be asking a question about holes. Maybe a drill isn't the right tool in the first place. Can you tell us why you are matching two big lists of strings? Perhaps there is an easier way to do what you want.

like image 25
Eric Lippert Avatar answered Oct 19 '22 15:10

Eric Lippert


Instead of using a List<string>, use a HashSet<string>. It has O(1) lookups instead of O(n) like the list. You should see orders of magnitude speedup if you make this change.

It may also give you slightly better performance to use HashSet.Contains(), rather than LINQ's .Any() extension method.

like image 88
recursive Avatar answered Oct 19 '22 16:10

recursive


Use the Union and Except operators to get the differences and similarities between your two lists.

Union: http://msdn.microsoft.com/en-us/library/bb341731.aspx

Except: http://msdn.microsoft.com/en-us/library/system.linq.enumerable.except.aspx

Each function returns a list containing the resulting data.

like image 33
Gage Avatar answered Oct 19 '22 16:10

Gage