Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to submit a form in scrapy?

I tried to use scrapy to complete the login and collect my project commit count. And here is the code.

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import Spider
from scrapy.utils.response import open_in_browser


class GitSpider(Spider):
    name = "github"
    allowed_domains = ["github.com"]
    start_urls = ["https://www.github.com/login"]

    def parse(self, response):
        formdata = {'login': 'username',
                'password': 'password' }
        yield FormRequest.from_response(response,
                                        formdata=formdata,
                                        clickdata={'name': 'commit'},
                                        callback=self.parse1)

    def parse1(self, response):
        open_in_browser(response)

After running the code

scrapy runspider github.py

It should show me the result page of the form, which should be a failed login in the same page as the username and password is fake. However it shows me the search page. The log file is located in pastebin

How should the code be fixed? Thanks in advance.

like image 557
Winston Avatar asked Jan 20 '15 06:01

Winston


People also ask

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.

How do you use Scrapy request?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you get a response from Scrapy request?

Using FormRequest. You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

Is Scrapy good for web scraping?

Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.


2 Answers

Your problem is that FormRequest.from_response() uses a different form - a "search form". But, you wanted it to use a "log in form" instead. Provide a formnumber argument:

yield FormRequest.from_response(response,
                                formnumber=1,
                                formdata=formdata,
                                clickdata={'name': 'commit'},
                                callback=self.parse1)

Here is what I see opened in the browser after applying the change (used "fake" user):

enter image description here

like image 188
alecxe Avatar answered Dec 08 '22 00:12

alecxe


Solution using webdriver.

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.contrib.spiders import CrawlSpider

class GitSpider(CrawlSpider):

    name = "gitscrape"
    allowed_domains = ["github.com"]
    start_urls = ["https://www.github.com/login"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        login_form = self.driver.find_element_by_name('login')
        password_form = self.driver.find_element_by_name('password')
        commit = self.driver.find_element_by_name('commit')
        login_form.send_keys("yourlogin")
        password_form.send_keys("yourpassword")
        actions = ActionChains(self.driver)
        actions.click(commit)
        actions.perform()
        # by this point you are logged to github and have access 
        #to all data in the main menù
        time.sleep(3)
        self.driver.close()
like image 24
aberna Avatar answered Dec 08 '22 01:12

aberna