I'm trying to teach myself a concept by writing a script. Basically, I'm trying to write a Python script that, given a few keywords, will crawl web pages until it finds the data I need. For example, say I want to find a list of venemous snakes that live in the US. I might run my script with the keywords list,venemous,snakes,US
, and I want to be able to trust with at least 80% certainty that it will return a list of snakes in the US.
I already know how to implement the web spider part, I just want to learn how I can determine a web page's relevancy without knowing a single thing about the page's structure. I have researched web scraping techniques but they all seem to assume knowledge of the page's html tag structure. Is there a certain algorithm out there that would allow me to pull data from the page and determine its relevancy?
Any pointers would be greatly appreciated. I am using Python
with urllib
and BeautifulSoup
.
using a crawler like scrapy (just for handling concurrent downloads), you can write a simple spider like this and probably start with Wikipedia as a good start point. This script is a complete example using scrapy
, nltk
and whoosh
. it will never stop and will index the links for later search using whoosh
It's a small Google:
_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk
def open_writer():
if not os.path.isdir("indexdir"):
os.mkdir("indexdir")
schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
ix = create_in("indexdir", schema)
else:
ix = open_dir("indexdir")
return ix.writer()
class Main(BaseSpider):
name = "main"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Snakes"]
def parse(self, response):
writer = open_writer() ## for indexing
sel = Selector(response)
email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
#general_link_validation = re.compile(r'')
#We stored already crawled links in this list
crawledLinks = set()
titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
contents = sel.xpath('//body/div[@id="content"]').extract()
if contents:
content = contents[0]
if titles:
title = titles[0]
else:
return
links = sel.xpath('//a/@href').extract()
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
url = urljoin(response.url, link)
#print url
## our url must not have any ":" character in it. link /wiki/talk:company
if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
crawledLinks.add(url)
#print url, depth
yield Request(url, self.parse)
item = MainItem()
item["title"] = title
print '*'*80
print 'crawled: %s | it has %s links.' % (title, len(links))
#print content
print '*'*80
item["links"] = list(crawledLinks)
writer.add_document(title=title, content=nltk.clean_html(content)) ## I save only text from content.
#print crawledLinks
writer.commit()
yield item
You're basically asking "how do I write a search engine." This is... not trivial.
The right way to do this is to just use Google's (or Bing's, or Yahoo!'s, or...) search API and show the top n results. But if you're just working on a personal project to teach yourself some concepts (not sure which ones those would be exactly though), then here are a few suggestions:
<p>
, <div>
, and so forth) for relevant keywords (duh)<ul>
or <ol>
or even <table>
might be a good candidategood luck (you'll need it)!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With