Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:

http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us

What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?

Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.

What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:

  • On the About Us page, under the name of the partner.
  • On the About Us page, as a generic catch-all email.
  • On the Team page, under the name of the partner.
  • On the Contact Us page, as a generic catch-all email.
  • On a Partner's page, under the name of the partner.

Or it could be any other way.

One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.

The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.

Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.

I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?

Any advice on specific direction you can give me would be great.

like image 311
marcamillion Avatar asked Oct 04 '16 21:10

marcamillion


People also ask

How do I scrape text from a website?

Extract Text and ImagesClick the “File” menu in your Web browser and click the “Save as” or “Save Page As” option. Select “Web Page, Complete” from the Save as Type drop-down menu and type a name for the file. Click “Save.” The text and images from the Web page will be extracted and saved.

What is a collection of different websites accessed through the Internet?

All publicly accessible websites collectively constitute the World Wide Web.


3 Answers

I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.

I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.

I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.

But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.

I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.

Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.

Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.

like image 78
Kim Steinhaug Avatar answered Oct 07 '22 07:10

Kim Steinhaug


The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.

It is easy to find email from a webpage as there is a fixed format of (username)@(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:

Only one email in page?
    Yes -> catch-all email.
    No -> Is name found in that page as well?
        No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
        Yes ->  Email should be attached to the name found right before it. It is normal that the name should appear before the email.
                Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
like image 3
cytsunny Avatar answered Oct 07 '22 08:10

cytsunny


I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.

  • For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
  • It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
  • Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
  • Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.

What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.

like image 1
Aaron Morefield Avatar answered Oct 07 '22 08:10

Aaron Morefield