Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping without knowledge of page structure

I'm trying to teach myself a concept by writing a script. Basically, I'm trying to write a Python script that, given a few keywords, will crawl web pages until it finds the data I need. For example, say I want to find a list of venemous snakes that live in the US. I might run my script with the keywords list,venemous,snakes,US, and I want to be able to trust with at least 80% certainty that it will return a list of snakes in the US.

I already know how to implement the web spider part, I just want to learn how I can determine a web page's relevancy without knowing a single thing about the page's structure. I have researched web scraping techniques but they all seem to assume knowledge of the page's html tag structure. Is there a certain algorithm out there that would allow me to pull data from the page and determine its relevancy?

Any pointers would be greatly appreciated. I am using Python with urllib and BeautifulSoup.

like image 551
Harrison Avatar asked May 28 '14 21:05

Harrison


2 Answers

using a crawler like scrapy (just for handling concurrent downloads), you can write a simple spider like this and probably start with Wikipedia as a good start point. This script is a complete example using scrapy, nltk and whoosh. it will never stop and will index the links for later search using whoosh It's a small Google:

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]
    
    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()

        
        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item
like image 123
Farshid Ashouri Avatar answered Nov 15 '22 18:11

Farshid Ashouri


You're basically asking "how do I write a search engine." This is... not trivial.

The right way to do this is to just use Google's (or Bing's, or Yahoo!'s, or...) search API and show the top n results. But if you're just working on a personal project to teach yourself some concepts (not sure which ones those would be exactly though), then here are a few suggestions:

  • search the text content of the appropriate tags (<p>, <div>, and so forth) for relevant keywords (duh)
  • use the relevant keywords to check for the presence of tags that might contain what you're looking for. For example, if you're looking for a list of things, then a page containing <ul> or <ol> or even <table> might be a good candidate
  • build a synonym dictionary and search each page for synonyms of your keywords too. Limiting yourself to "US" might mean an artificially low ranking for a page containing just "America"
  • keep a list of words which are not in your set of keywords and give a higher ranking to pages which contain the most of them. These pages are (arguably) more likely to contain the answer you're looking for

good luck (you'll need it)!

like image 28
Dan O Avatar answered Nov 15 '22 17:11

Dan O