Here is the situation:
I have a webpage that I have scraped as a string.
I have several fields in a MSSQL database. For example, car model, it has an ID and a Name, such as Mustang or Civic. It is pre-filled with most models of car.
I want to find any match for any row in my models table. So if I have Civic, Mustang and E350 in my Model Table I want to find any occurance of any of the three on the page I have scraped.
What is an efficient way to do this in C#. I am using LINQ to SQL to interface with the db.
Does creating a dictionary of all models, tokenizing the page and iterating through the tokens make sense? Or should I just iterate through the tokens and use a WHERE clause and ask the database if there is a match?
//Dictionary dic contains all models from the DB, with the name being the key and the id being the value...
foreach(string pageToken in pageTokens)
{
if(dic.ContainsKey(pageToken))
{
//Do what I need to do
}
}
Both of these methods seem terrible to me. Any suggestions on what I should do? Something with set intersection I would imagine might be nice?
Neither of these methods address what happens when a Model name is more than one word..like "F150 Extended Cab". Thoughts on that?
Searching for multiple strings in a larger text is a well-understood problem, and signifigant research has been made into making it fast. The two most popular and effective methods for this are the Aho-Corasick Algorithm (I'd rcommend this one) and the Rabin-Karp Algorithm. They use a little preprocessing, but are orders of magnitude less complex & faster than the naieve method (the naieve method is worst-case O(m*n^2*p) where m is the length of the long string [the webpage you scraped] and n is the average length of the needles and p is the number of needles). Aho-Corsaik is linear. A C# implementation of it can be found at CodeProject for free.
Edit: Oops, I was wrong about the complexity of Aho-Corasick -- it's linear in the number & length of input strings + the size of the string being analyzed [the scraped text] plus the number of matches. But it's still linear and linear is a lot better than cubic :-).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With