I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.
import scrapy
from ..items import NameItem
class LoginSpider(scrapy.Spider):
name = "LoginSpider"
start_urls = ["http://www.starcitygames.com/buylist/"]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formcss='#existing_users form',
formdata={'ex_usr_email': '[email protected]', 'ex_usr_pass': 'password'},
callback=self.after_login
)
def after_login(self, response):
item = NameItem()
display_button = response.xpath('//a[contains(., "Display>>")]/@href').get()
yield response.follow(display_button, self.parse)
item["Name"] = response.css("div.bl-result-title::text").get()
return item
You cannot click a button with Scrapy. You can send requests & receive a response. It's upto you to interpret the response with a separate javascript engine.
Selenium is only used to automate web browser interaction, Scrapy is used to download HTML, process data and save it(whole web crawling framework). Talking about scraping I would recommend scrapy and if the problem is javascript. Scrapy already has its own official project for javascript called scrapy-splash.
You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):
http://www.starcitygames.com/buylist/search?search-type=category&id=5061
The only thing need to fill is the category_id
related to this request, this can be extracted from the HTML and declared in your code.
Category name:
//*[@id="bl-category-options"]/option/text()
Category id:
//*[@id="bl-category-options"]/option/@value
Working with JSON is much more simple than parsing HTML.
I have tried to emulate the click with scrapy-splash, making use of lua script. It works, you just have to integrate it with scrapy and to manipulate the content. I leave the script, in which I finish integrating it with scrapy.
function main(splash)
local url = 'https://www.starcitygames.com/login'
assert(splash:go(url))
assert(splash:wait(0.5))
assert(splash:runjs('document.querySelector("#ex_usr_email_input").value = "[email protected]"'))
assert(splash:runjs('document.querySelector("#ex_usr_pass_input").value = "your_password"'))
splash:wait(0.5)
assert(splash:runjs('document.querySelector("#ex_usr_button_div button").click()'))
splash:wait(3)
splash:go('https://www.starcitygames.com/buylist/')
splash:wait(2)
assert(splash:runjs('document.querySelectorAll(".bl-specific-name")[1].click()'))
splash:wait(1)
assert(splash:runjs('document.querySelector("#bl-search-category").click()'))
splash:wait(3)
splash:set_viewport_size(1200,2000)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With