Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.
My initial thought was to use the process_links
parameter on a rule
and generate items
in the function that it points to. However, whereas the callback
parameter points to a function that is an item generator, the process_links
paramter works more like a filter. In the callback
function you yield items and they are automaticaly collected and put in the pipeline. In the process_links
function you return a list of links. You don't generate items.
I could just make a database connection in the process_links
function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.
I could try to pass items from the process_links
function to the callback
function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.
In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback
function? What's the process that supplies the links to, and receives the links from, the process_links
function? The one that takes requests
and returns responses
?
From my point of view, I write code in a spider which genreates items
. The items
are automatically read and moved through a pipeline. I can create code in the pipeline and the items
will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items
get moved through the pipeline.
Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py
. It contains the starts_requests(
) and make_requests_from_url()
functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.
Going from the opposite direction, I can see that when I execute the command scrapy crawl...
I'm calling crawl.py
which in turn calls self.crawler_process.start()
in crawler.py
. That starts a Twisted reactor. There is also core/engine.py
which is another collection of functions which look as though they are designed to control the operation of the spiders.
Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.
Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.
This is how Scrapy works in short:
start_requests
methodstart_requests
method. If you don't, Scrapy will use the parse
method as the callback. response
object you get in the parse
callback allows you do extract the data using css selectors or xpath. Item
s and yield
them. If you need to go to another page, you can yield scrapy.Request
.Item
object, Scrapy will send those through the registered pipelines. If you yield scrapy.Request
, the request would be further parsed and the response will be fed back to a callback. Again you can define a separate callback or use the default one. Item
) go through the pipeline processors. In the pipelines you can store them in database or whatever you want to do. So in short:
parse
method or in any method inside the spider, we would extract and yield our data so they are sent through the pipelines. In the pipelines, you do the actual processing.
Here's a simple spider and pipeline example: https://gist.github.com/masnun/e85b38a00a74737bb3eb
I started using Scrapy not so long ago and I had some of your doubts myself (also considering I started with Python overall), but now it works for me, so don’t get discouraged – it’s a nice framework.
First, I would not get too worried at this stage about the details behind the framework, but rather start writing some basic spiders yourself.
Some of really key concepts are:
Start_urls – they define an initial URL (or URLs), where you will further look either for text or for further links to crawl. Let’s say you want to start from e.g. http://x.com
Parse(self.response) method – this will be the first method that will be processed that will give you Response of http://x.com. (basically its HTML markup)
You can use Xpath or CSS selectors to extract information from this markup e.g. a = Response.Xpath(‘//div[@class=”foo”]/@href’)
will extract the link to a page (e.g. http://y.com)
If you want to extract the text of the link, so literally "http://y.com" you just yield (return) an item within Parse(self.response) method. So your final statement in this method will be yield item
. If you want to go deeper and dwell to http://y.com your final statement will be scrapy.Request(a, callback= self.parse_final)
- parse_final being here an example of the callback to the parse_final(self.response) method.
Then you can extract the elements of html of http://y.com as the final call in parse_final(self.response) method, or keep repeating the process to dig for further links in the page structure
Pipelines are for processing items. So when items get yielded, they are by default just printed on the screen. So in pipelines you can redirect them either to csv file, database etc.
The entire process gets more complex, when you start getting more links in each of the methods, based on various conditions you call various callbacks etc. I think you should start with getting this concept first, before going to pipelines. The examples from Scrapy are somewhat difficult to get, but once you get the idea it is really nice and not that complicated in the end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With