Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can scrapy be used to extract the link graph of a website?

Given a starting URL start (and some rules on admissible domains etc) I would like to produce a directed graph (V, E) where the nodes in V are the pages reachable from start, and there is an arc (u,v) in E whenever there is a hyperlink on page u pointing to page v.

Is there a simple way to obtain such a graph with scrapy? I would also be happy using another open source tool if it can achieve the goal more easily/nicely.

like image 903
mitchus Avatar asked Oct 09 '12 14:10

mitchus


Video Answer


1 Answers

I don't know any tools or contrib which is producing precisely what you want. You'll have to build a scrapy spider to to that. I can explain here the necessary steps:

  • Create a scrapy project and generate a default spider

    $ scrapy startproject sitegraph
    $ cd sitegraph
    $ scrapy genspider graphspider mydomain.com
    
  • This will create a directory with a items.py file. Add the following lines in this file

    from scrapy.item import Item, Field
    
    class SitegraphItem(Item):
         url=Field()
         linkedurls=Field()
    
  • in the spiders directory you will find graphspider.py replace it by (of course mydomain.com need to be replaced):

    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.utils.url import urljoin_rfc
    from sitegraph.items import SitegraphItem
    
    class GraphspiderSpider(CrawlSpider):
        name = 'graphspider'
        allowed_domains = ['mydomain.com']
        start_urls = ['http://mydomain/index.html']
    
        rules = (
            Rule(SgmlLinkExtractor(allow=r'/'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            hxs = HtmlXPathSelector(response)
            i = SitegraphItem()
            i['url'] = response.url
            i['http_status'] = response.status
            llinks=[]
            for anchor in hxs.select('//a[@href]'):
                href=anchor.select('@href').extract()[0]
                if not href.lower().startswith("javascript"):
            llinks.append(urljoin_rfc(response.url,href))
            i['linkedurls'] = llinks
            return i
    
  • then edit the settings.py file and add (change the file name accordingly):

    FEED_FORMAT="jsonlines"
    FEED_URI="file:///tmp/sitegraph.json"
    
  • now you can run:

    $ scrapy crawl graphspider
    
  • this will generate a json file the you can use to build a graph.

You can use a package like networkx to analyse it ot pygraphviz to draw it (not recommanded for large sites)

import json
import pygraphviz as pg

def loadgraph(fname):
        G=pg.AGraph(directed=True)
        for line in open(fname):
            j=json.loads(line)
            url=j["url"]
            G.add_node(url)
            for linked_url in j["linkedurls"]:
                G.add_edge(url,linked_url)
        return G

if __name__=='__main__':
        G=loadgraph("/tmp/sitegraph.json")
        G.layout(prog='dot')
        G.draw("sitegraph.png")
like image 137
gvtech Avatar answered Oct 02 '22 13:10

gvtech