Python data scraping with Scrapy

Tags:

I want to scrape data from a website which has TextFields, Buttons etc.. and my requirement is to fill the text fields and submit the form to get the results and then scrape the data points from results page.

I want to know that does Scrapy has this feature or If anyone can recommend a library in Python to accomplish this task?

(edited)
I want to scrape the data from the following website:
http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType

My requirement is to select the values from ComboBoxes and hit the search button and scrape the data points from the result page.

P.S. I'm using selenium Firefox driver to scrape data from some other website but that solution is not good because selenium Firefox driver is dependent on FireFox's EXE i.e Firefox must be installed before running the scraper.

Selenium Firefox driver is consuming around 100MB memory for one instance and my requirement is to run a lot of instances at a time to make the scraping process quick so there is memory limitation as well.

Firefox crashes sometimes during the execution of scraper, don't know why. Also I need window less scraping which is not possible in case of Selenium Firefox driver.

My ultimate goal is to run the scrapers on Heroku and I have Linux environment over there so selenium Firefox driver won't work on Heroku. Thanks

828

asked May 28 '13 06:05

Sibtain Norain

2 Answers

Basically, you have plenty of tools to choose from:

scrapy
beautifulsoup
lxml
mechanize
requests (and grequests)
selenium
ghost.py

These tools have different purposes but they can be mixed together depending on the task.

Scrapy is a powerful and very smart tool for crawling web-sites, extracting data. But, when it comes to manipulating the page: clicking buttons, filling forms - it becomes more complicated:

sometimes, it's easy to simulate filling/submitting forms by making underlying form action directly in scrapy
sometimes, you have to use other tools to help scrapy - like mechanize or selenium

If you make your question more specific, it'll help to understand what kind of tools you should use or choose from.

Take a look at an example of interesting scrapy&selenium mix. Here, selenium task is to click the button and provide data for scrapy items:

import time
from scrapy.item import Item, Field

from selenium import webdriver

from scrapy.spider import BaseSpider


class ElyseAvenueItem(Item):
    name = Field()


class ElyseAvenueSpider(BaseSpider):
    name = "elyse"
    allowed_domains = ["ehealthinsurance.com"]
    start_urls = [
    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")
        if el:
            el.click()

        time.sleep(10)

        plans = self.driver.find_elements_by_class_name("plan-info")
        for plan in plans:
            item = ElyseAvenueItem()
            item['name'] = plan.find_element_by_class_name('primary').text
            yield item

        self.driver.close()

UPDATE:

Here's an example on how to use scrapy in your case:

from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

from scrapy.spider import BaseSpider


class AcrisItem(Item):
    borough = Field()
    block = Field()
    doc_type_name = Field()


class AcrisSpider(BaseSpider):
    name = "acris"
    allowed_domains = ["a836-acris.nyc.gov"]
    start_urls = ['http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentType']


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')

        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]
        for document_class in document_classes:
            if document_class:
                doc_type = document_class.select('.//@value').extract()[0]
                doc_type_name = document_class.select('.//text()').extract()[0]
                formdata = {'__RequestVerificationToken': form_token,
                            'hid_selectdate': '7',
                            'hid_doctype': doc_type,
                            'hid_doctype_name': doc_type_name,
                            'hid_max_rows': '10',
                            'hid_ISIntranet': 'N',
                            'hid_SearchType': 'DOCTYPE',
                            'hid_page': '1',
                            'hid_borough': '0',
                            'hid_borough_name': 'ALL BOROUGHS',
                            'hid_ReqID': '',
                            'hid_sort': '',
                            'hid_datefromm': '',
                            'hid_datefromd': '',
                            'hid_datefromy': '',
                            'hid_datetom': '',
                            'hid_datetod': '',
                            'hid_datetoy': '', }
                yield FormRequest(url="http://a836-acris.nyc.gov/DS/DocumentSearch/DocumentTypeResult",
                                  method="POST",
                                  formdata=formdata,
                                  callback=self.parse_page,
                                  meta={'doc_type_name': doc_type_name})

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)

        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')
        for row in rows:
            item = AcrisItem()
            borough = row.select('.//td[2]/div/font/text()').extract()
            block = row.select('.//td[3]/div/font/text()').extract()

            if borough and block:
                item['borough'] = borough[0]
                item['block'] = block[0]
                item['doc_type_name'] = response.meta['doc_type_name']

                yield item

Save it in spider.py and run via scrapy runspider spider.py -o output.json and in output.json you will see:

{"doc_type_name": "CONDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CONFIRMATORY DEED ", "borough": "Borough", "block": "Block"}
{"doc_type_name": "CERT NONATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}
...

Hope that helps.

195

answered Nov 15 '22 19:11

alecxe

If you simply want to submit the form and extract data from the resulting page, I'd go for:

requests to send the post request
beautiful soup to extract chosen data from the result page

Scrapy added value really holds in its ability to follow links and crawl a website, I don't think it is the right tool for the job if you know precisely what you are searching for.

answered Nov 15 '22 19:11

icecrime

Related questions
                            
                                SQLAlchemy: Table already exists
                            
                                Elementwise operations over tuples in Python
                            
                                Monkey patch a function in a module for unit testing
                            
                                HTML not rendering in Django text field
                            
                                How to prove that parameter evaluation is "left to right" in Python?
                            
                                The scope of if __name__ == __main__
                            
                                Adding spaces to items in list (Python)
                            
                                What is the most efficient way to remove a group of indices from a list of numbers in Python 2.7?
                            
                                Replace a string in list of lists
                            
                                Remove last 3 letters of string in Django template [:-3]
                            
                                Given boundaries, find interval
                            
                                Dynamic refresh printing of multiprocessing or multithreading in Python
                            
                                Sorting a networkx graph object Python
                            
                                Complex list slice/index in python
                            
                                Unable to switch back to root user using python
                            
                                Shuffle a list within a specific range python
                            
                                Blender scripting: Indices of selected vertices
                            
                                Computing the lowest monthly payment using bisection search in python
                            
                                "SyntaxError: non-keyword arg after keyword arg" Error in Python when using requests.post()
                            
                                Python MySQL escape special characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python data scraping with Scrapy

Tags:

python

python-2.7

web-scraping

scrapy

Sibtain Norain

People also ask

2 Answers

alecxe

icecrime

Recent Activity

Donate For Us