Scrapy only scrapes the first start url in a list of 15 start urls

Question

I am new to Scrapy and am attempting to teach myself the basics. I have compiled a code that goes to the Louisiana Department of Natural Resources website to retrieve the serial number for certain oil wells.

I have each well's link listed in the start URLs command, but scrappy only downloads data from the first url. What am I doing wrong?

import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from mike.items import MikeItem

class SonrisSpider(Spider):
   name = "sspider"

   start_urls = [
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
    "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",

]


    def parse(self, response):
       item = MikeItem()
       item['serial'] =    response.xpath('/html/body/table[1]/tr[2]/td[1]/text()').extract()[0]
       yield item

Thank you for any help you might be able to provide. If I have not explained my problem thoroughly, please let me know and I will attempt to clarify.

Jithin · Accepted Answer

I think this code might help,

By default scrapy prevent duplicate requests. Since only the parameters are different in your start-url scrapy will consider the rest of the urls in the start-url as duplicate request of the first one. That's why your spider stops after fetching the first url. In order to parse the rest of the urls we have enable dont_filter flag in the scrapy request. (chek the start_request())

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from mike.items import MikeItem


class SonrisSpider(scrapy.Spider):
    name = "sspider"
    allowed_domains = ["sonlite.dnr.state.la.us"]
    start_urls = [
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=207899",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=971683",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=214206",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=159420",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=243671",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248942",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=156613",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=972498",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=215443",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=248463",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=195136",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=179181",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=199930",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=203419",
                "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_WSN=220454",
            ]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse_data, dont_filter=True)

    def parse_data(self, response):
        item = MikeItem()
        serial = response.xpath(
            '/html/body/table[1]/tr[2]/td[1]/text()').extract()
        serial = serial[0] if serial else 'n/a'
        item['serial'] = serial
        yield item

sample output returned by this spider is as follows,

{'serial': u'207899'}
{'serial': u'971683'}
{'serial': u'214206'}
{'serial': u'159420'}
{'serial': u'248942'}
{'serial': u'243671'}

Scrapy only scrapes the first start url in a list of 15 start urls

Tags:

python

web-scraping

scrapy

Josiah Hulsey

1 Answers

Jithin

Recent Activity

Donate For Us

Scrapy only scrapes the first start url in a list of 15 start urls

Tags:

python

web-scraping

scrapy

Josiah Hulsey

1 Answers

Jithin

Related questions

Recent Activity

Donate For Us