Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search string algorithms


I am trying to get the contact information in the content pages from a set of web sites (thousands of them). I wanted to ask experts like you guys before scratching my head. All I need is the address, email ids, phone numbers and contact person information if available.

I think you understand the problem already. Yes it is the formatting... since there is no standard format that websites follows, its really hard to pin point the exact information that I need. Some websites are designed with flash contact us pages and some other websites designed the contact information as image types with custom fonts.

And hints/ideas/suggestions are mostly welcome...

Thank you....

like image 982
Krishna Avatar asked Dec 20 '11 23:12

Krishna


1 Answers

This is as you might expect, by no means a trivial task. Here is one way of approaching this:

  1. Use an inverted indexing system such as Lucene/Solr or Sphinx to index the pages. You might need to write your own crawler/spider. Apache Nutch and other crawlers offer spidering out of the box. If the content is fairly static, download them to your system locally.

  2. Once the content is indexed, you could query it for email addresses, telephone numbers, etc. by building a boolean query such as: //for email //for telephone # parentheses Contents:@ AND (Contents:.COM OR Contents:.NET) OR Contents:"(" OR Contents:")"` Important: the foregoing code should not be taken literally. You could get even fancier by using Lucene Regex Query & Span Query which would let you build pretty sophisticated queries.

  3. Finally on the result pages, (a) run a result highlighter to get the snippet(s) around the query term and, (b) on the snippets, run a regex to extract out the fields of interest.

  4. If you have a North American address data set, you could run multiple-passes to validate addresses against i) a mapping provider like Bing Maps, or Google maps to verify addresses. As far as I know, USPS and others offer valid address look-ups for a fee, to validate US zip codes and Canadian Postal codes. or, ii) a reverse DNS look-up for email addresses and so on....

That should get you started....like I said, there is no single best solution here, you will need to try multiple approaches to iterate and get to the accuracy level you desire.

Hope this helps.

like image 139
Mikos Avatar answered Nov 07 '22 16:11

Mikos