Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching Two Large Sets of Strings in C#

Here is the situation:

I have a webpage that I have scraped as a string.

I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.

I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.

What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.

Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?

    //Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
    foreach(string pageToken in pageTokens)
    {
         if(dic.ContainsKey(pageToken)) 
         {
              //Do what I need to do
         }
    }

Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?

Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?

like image 610
Jason Avatar asked Feb 27 '23 18:02

Jason


1 Answers

Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.

Edit: Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).

like image 164
Robert Fraser Avatar answered Mar 07 '23 06:03

Robert Fraser