Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape all the content of each link with scrapy?

I am new with scrapy I would like to extract all the content of each advertise from this website. So I tried the following:

from scrapy.spiders import Spider
from craigslist_sample.items import CraigslistSampleItem

from scrapy.selector import Selector
class MySpider(Spider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        links = response.selector.xpath(".//*[@id='sortable-results']//ul//li//p")
        for link in links:
            content = link.xpath(".//*[@id='titletextonly']").extract()
            title = link.xpath("a/@href").extract()
            print(title,content)

items:

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    link = Field()

However, when I run the crawler I got nothing:

$ scrapy crawl --nolog craig
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

Thus, my question is: How can I walk over each url, get inside each link and crawl the content and the title?, and which is the best way of do this?.

like image 511
student Avatar asked Nov 08 '16 05:11

student


People also ask

How do I extract text from Scrapy?

Description. /html/head/title − This will select the <title> element, inside the <head> element of an HTML document. /html/head/title/text() − This will select the text within the same <title> element. //td − This will select all the elements from <td>.

Is Scrapy faster than BeautifulSoup?

Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.


2 Answers

To scaffold a basic scrapy project you can use the command:

scrapy startproject craig

Then add the spider and items:

craig/spiders/spider.py

from scrapy import Spider
from craig.items import CraigslistSampleItem
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy import Request
import urlparse, re

class CraigSpider(Spider):
    name = "craig"
    start_url = "https://sfbay.craigslist.org/search/npo"

    def start_requests(self):

        yield Request(self.start_url, callback=self.parse_results_page)


    def parse_results_page(self, response):

        sel = Selector(response)

        # Browse paging.
        page_urls = sel.xpath(""".//span[@class='buttons']/a[@class='button next']/@href""").getall()

        for page_url in page_urls + [response.url]:
            page_url = urlparse.urljoin(self.start_url, page_url)

            # Yield a request for the next page of the list, with callback to this same function: self.parse_results_page().
            yield Request(page_url, callback=self.parse_results_page)

        # Browse items.
        item_urls = sel.xpath(""".//*[@id='sortable-results']//li//a/@href""").getall()

        for item_url in item_urls:
            item_url = urlparse.urljoin(self.start_url, item_url)

            # Yield a request for each item page, with callback self.parse_item().
            yield Request(item_url, callback=self.parse_item)


    def parse_item(self, response):

        sel = Selector(response)

        item = CraigslistSampleItem()

        item['title'] = sel.xpath('//*[@id="titletextonly"]').extract_first()
        item['body'] = sel.xpath('//*[@id="postingbody"]').extract_first()
        item['link'] = response.url

        yield item

craig/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

from scrapy.item import Item, Field

class CraigslistSampleItem(Item):
    title = Field()
    body = Field()
    link = Field()

craig/settings.py

# -*- coding: utf-8 -*-

BOT_NAME = 'craig'

SPIDER_MODULES = ['craig.spiders']
NEWSPIDER_MODULE = 'craig.spiders'

ITEM_PIPELINES = {
   'craig.pipelines.CraigPipeline': 300,
}

craig/pipelines.py

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exporters import CsvItemExporter

class CraigPipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signals.spider_opened)
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.files = {}

    def spider_opened(self, spider):
        file = open('%s_ads.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

You can run the spider by running the command:

scrapy runspider craig/spiders/spider.py

From the root of your project.

It should create a craig_ads.csv in the root of your project.

like image 90
Ivan Chaer Avatar answered Oct 13 '22 04:10

Ivan Chaer


I am trying to answer your question.

First of all, because of your incorrect XPath query, you got blank results. By XPath ".//*[@id='sortable-results']//ul//li//p", you located relevant <p> nodes correctly, though I don't like your query expression. However, I have no idea of your following XPath expression ".//*[@id='titletextonly']" and "a/@href", they couldn't locate link and title as you expected. Maybe your meaning is to locate the text of title and the hyperlink of the title. If yes, I believe you have to learn Xpath, and please start with HTML DOM.

I do want to instruct you how to do XPath query, as there are lots of resources online. I would like to mention some features of Scrapy XPath selector:

  1. Scrapy XPath Selector is an improved wrapper of standard XPath query.

In standard XPath query, it returns an array of DOM nodes you queried. You can open development mode of your browser(F12), use console command $x(x_exp) to test. I highly suggest that test your XPath expression through this way. It will give you instant results and save lots of time. If you have time, be familiar with the web development tools of your browser, which will have you quick understand web page structure and locate the entry you are looking for.

While, Scrapy response.xpath(x_exp) returns an array of Selector objects corresponding to actual XPath query, which is actually a SelectorList object. This means XPath results is reprented by SelectorsList. And both Selector and SelectorList class provides some useful functions to operate the results:

  • extract, return a list of serialized document nodes (to unicode strings)
  • extract_first, return scalar, first of the extract results
  • re, return a list, re of the extract results
  • re_first, return scalar, first of the re results.

These functions make your programming much more convenient. One example is that you can call xpath function directly on SelectorList object. If you tried lxml before, you would see that this is super useful: if you want to call xpath function on the results of a former xpath results in lxml, you have to iterate over the former results. Another example is that when you definitely sure that there is at most one element in that list, you can use extract_first to get a scalar value, instead of using list index method (e.g., rlist[0]) which would cause out of index exception when no element matched. Remember that there are always exceptions when you parse the web page, be careful and robust of your programming.

  1. Absolute XPath vs. relative XPath

Keep in mind that if you are nesting XPathSelectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the XPathSelector you’re calling it from.

When you do operation node.xpath(x_expr), if x_expr starts with /, it is an absolute query, XPath will search from root; else if x_expr starts with ., it is a relative query. This is also noted in standards 2.5 Abbreviated Syntax

. selects the context node

.//para selects the para element descendants of the context node

.. selects the parent of the context node

../@lang selects the lang attribute of the parent of the context node

  1. How to follow the next page and end of following.

For your application, you probably need to following the next page. Here, the next page node is easy to locate -- there are next buttons. However, you need also take care of the time to stop following. Look carefully for your URL query parameter to tell the URL pattern of your application. Here, to determine when to stop follow the next page, you can compare current item range with the total number of items.

New Edited

I was a little confused with the meaning of content of the link. Now I got it that @student wanted to crawl the link to extract AD content as well. The following is a solution.

  1. Send Request and attach its parser

As you may notice that I use Scrapy Request class to follow the next page. Actually, the power of Request class is beyond that -- you can attach desired parse function for each request by setting parameter callback.

callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.

In step 3, I did not set callback when sending next page requests, as these request should be handled by default parse function. Now comes to the specified AD page, a different page then the former AD list page. Thus we need to define a new page parser function, let's say parse_ad, when we send each AD page request, attach this parse_ad function with the requests.

Let's go to the revised sample code that works for me:

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapydemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()


class AdItem(scrapy.Item):
    title = scrapy.Field()
    description = scrapy.Field()

The spider

# -*- coding: utf-8 -*-
from scrapy.spiders import Spider
from scrapy.http import Request
from scrapydemo.items import ScrapydemoItem
from scrapydemo.items import AdItem
try:
    from urllib.parse import urljoin
except ImportError:
    from urlparse import urljoin


class MySpider(Spider):
    name = "demo"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        # locate list of each item
        s_links = response.xpath("//*[@id='sortable-results']/ul/li")
        # locate next page and extract it
        next_page = response.xpath(
            '//a[@title="next page"]/@href').extract_first()
        next_page = urljoin(response.url, next_page)
        to = response.xpath(
            '//span[@class="rangeTo"]/text()').extract_first()
        total = response.xpath(
            '//span[@class="totalcount"]/text()').extract_first()
        # test end of following
        if int(to) < int(total):
            # important, send request of next page
            # default parsing function is 'parse'
            yield Request(next_page)

        for s_link in s_links:
            # locate and extract
            title = s_link.xpath("./p/a/text()").extract_first().strip()
            link = s_link.xpath("./p/a/@href").extract_first()
            link = urljoin(response.url, link)
            if title is None or link is None:
                print('Warning: no title or link found: %s', response.url)
            else:
                yield ScrapydemoItem(title=title, link=link)
                # important, send request of ad page
                # parsing function is 'parse_ad'
                yield Request(link, callback=self.parse_ad)

    def parse_ad(self, response):
        ad_title = response.xpath(
            '//span[@id="titletextonly"]/text()').extract_first().strip()
        ad_description = ''.join(response.xpath(
            '//section[@id="postingbody"]//text()').extract())
        if ad_title is not None and ad_description is not None:
            yield AdItem(title=ad_title, description=ad_description)
        else:
            print('Waring: no title or description found %s', response.url)

Key Note

  • Two parse function, parse for requests of AD list page and parse_ad for request of specified AD page.
  • To extract content of the AD post, you need some tricks. See How can I get all the plain text from a website with Scrapy

A snapshot of output:

2016-11-10 21:25:14 [scrapy] DEBUG: Scraped from <200 http://sfbay.craigslist.org/eby/npo/5869108363.html>
{'description': '\n'
                '        \n'
                '            QR Code Link to This Post\n'
                '            \n'
                '        \n'
                'Agency History:\n' ........
 'title': 'Staff Accountant'}
2016-11-10 21:25:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 39259,
 'downloader/request_count': 117,
 'downloader/request_method_count/GET': 117,
 'downloader/response_bytes': 711320,
 'downloader/response_count': 117,
 'downloader/response_status_count/200': 117,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2016, 11, 11, 2, 25, 14, 878628),
 'item_scraped_count': 314,
 'log_count/DEBUG': 432,
 'log_count/INFO': 8,
 'request_depth_max': 2,
 'response_received_count': 117,
 'scheduler/dequeued': 116,
 'scheduler/dequeued/memory': 116,
 'scheduler/enqueued': 203,
 'scheduler/enqueued/memory': 203,
 'start_time': datetime.datetime(2016, 11, 11, 2, 24, 59, 242456)}
2016-11-10 21:25:14 [scrapy] INFO: Spider closed (shutdown)

Thanks. Hope this would be helpful and have fun.

like image 30
rojeeer Avatar answered Oct 13 '22 04:10

rojeeer