Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scrapy in Jupyter notebook / accessing response directly

I want to directly interact with a Scrapy response object in a Jupyter notebook, the same way you can after entering the Scrapy shell by typing scrapy shell "some-url" in the command line.

In a notebook, I can run these commands without error:

import scrapy
request = scrapy.Request("some-url")
response = scrapy.http.Response("some-url")

But request and response both have an empty body property. According to the docs:

Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

It seems I'm missing the step where "the Downloader" executes a request object and returns a Response object. I can't figure out how that works.

Does anyone know what happens when you run scrapy shell "some-url"in the command line, so I can replicate those steps in a a Jupyter notebook?

Note: A very similar question was posted here, and the given answer works for me, but using the additional, third-party "Requests" library seems unnecessary/ non-ideal.

like image 949
Dustin Michels Avatar asked Apr 18 '18 20:04

Dustin Michels


People also ask

Can I use Scrapy in Jupyter notebook?

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner.

How do you scrape data from Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .


1 Answers

You can approach the problem this way

import requests
from scrapy.http import TextResponse

res = requests.get('some-url')
response = TextResponse(res.url, body=res.text, encoding='utf-8')
like image 152
BT Einstein Avatar answered Oct 10 '22 05:10

BT Einstein