Given a starting URL start
(and some rules on admissible domains etc) I would like to produce a directed graph (V, E) where the nodes in V
are the pages reachable from start
, and there is an arc (u,v)
in E
whenever there is a hyperlink on page u
pointing to page v
.
Is there a simple way to obtain such a graph with scrapy
? I would also be happy using another open source tool if it can achieve the goal more easily/nicely.
I don't know any tools or contrib which is producing precisely what you want. You'll have to build a scrapy spider to to that. I can explain here the necessary steps:
Create a scrapy project and generate a default spider
$ scrapy startproject sitegraph
$ cd sitegraph
$ scrapy genspider graphspider mydomain.com
This will create a directory with a items.py file. Add the following lines in this file
from scrapy.item import Item, Field
class SitegraphItem(Item):
url=Field()
linkedurls=Field()
in the spiders directory you will find graphspider.py replace it by (of course mydomain.com need to be replaced):
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class GraphspiderSpider(CrawlSpider):
name = 'graphspider'
allowed_domains = ['mydomain.com']
start_urls = ['http://mydomain/index.html']
rules = (
Rule(SgmlLinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
then edit the settings.py file and add (change the file name accordingly):
FEED_FORMAT="jsonlines"
FEED_URI="file:///tmp/sitegraph.json"
now you can run:
$ scrapy crawl graphspider
this will generate a json file the you can use to build a graph.
You can use a package like networkx to analyse it ot pygraphviz to draw it (not recommanded for large sites)
import json
import pygraphviz as pg
def loadgraph(fname):
G=pg.AGraph(directed=True)
for line in open(fname):
j=json.loads(line)
url=j["url"]
G.add_node(url)
for linked_url in j["linkedurls"]:
G.add_edge(url,linked_url)
return G
if __name__=='__main__':
G=loadgraph("/tmp/sitegraph.json")
G.layout(prog='dot')
G.draw("sitegraph.png")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With