Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work with the scrapy contracts?

Scrapy Contracts Problem

I started working on the scrapy framework. Implemented some spiders too for extraction, but I am not able to write a unit test case for the spider because the contracts package documentation provided by the scrapy doesn't have a proper procedure to write the test cases. Please help me with this thing.

like image 623
bhadram Avatar asked Sep 10 '14 11:09

bhadram


2 Answers

Yes, Spiders Contracts is far from being clear and detailed.

I'm not an expert in writing spider contracts (actually wrote them only once while working on web-scraping tutorial at newcoder.io). But whenever I needed to write tests for Scrapy spiders, I preferred to follow the approach suggested here - create a fake response from a local html file. It is arguable if this is still a unit testing procedure, but this gives you way more flexibility and robustness.

Note that you can still write contracts but you will quickly feel the need of extending them and writing custom contracts. Which is pretty much ok.

Relevant links:

  • Scrapy Unit Testing
  • Scrapy Contracts Evolution
like image 180
alecxe Avatar answered Nov 15 '22 01:11

alecxe


Scrapy Contracts

Testing spiders

The two most basic questions in testing the spider might be:

  1. will/did my code change break the spider?
  2. will/did the spider break because the page I'm scraping changed?

Contracts

Scrapy offers a means for testings spiders: contracts.

Contracts can look a bit magical. They live in multi-line doc strings. The contract "syntax" is: @contract_name <arg>. You can create your own contracts, which is pretty neat.

To use a contract, you prepend and @ to the name of a contract. The name of a contract is specified by the .name attribute on the given contract subclass. These contract subclasses are either built-in or a custom ones that you create.

Finally, the above-mentioned doc string must live in the callbacks of yours spiders. Here's an example of some basic contracts living in the parse callback; the default callback.

def parse(self, response):
  """This function gathers the author and the quote text.

  @url http://quotes.toscrape.com/
  @returns items 1 8
  @returns requests 0 0
  @scrapes author quote_text
  """

You can run this contract via scrapy check; alternatively, list your contracts with scrapy check -l.

Contracts in more depth

The above contract is tested using three built-in contracts:

  • scrapy.contracts.default.UrlContract
  • scrapy.contracts.default.ReturnsContract
  • scrapy.contracts.default.ScrapesContract

The UrlContract is mandatory and isn't really a contract as it is not used for validation. The @url contract is used to set the URL that the spider will crawl when testing the spider via scrapy check. In this case, we're specifying http://quotes.toscrape.com/. But we could've specified http://127.0.0.1:8080/home-11-05-2019-1720.html which is the local version of quotes.toscrape.com that I saved with the scrapy view http://quotes.toscrape.com/ command.

The ReturnsContract is used to check the output of the callback you're testing. As you can see, the contract is called twice, with different args. You can't just put any ol' arg in there though. Under the hood, there is a dictionary of expected args:

objects = {
  'request': Request,
  'requests': Request,
  'item': (BaseItem, dict),
  'items': (BaseItem, dict),
}

Our contract specifes that our spider @returns items 1 16. The 1 and the 16 are lower and upper bounds. The upper bound is optional; under the hood it is set to infinity if not specified 😆.

try:
    self.max_bound = int(self.args[2])
except IndexError:
    self.max_bound = float('inf')

But yeah, the @returns helps you know if your spider returns the expect amount of items or requests.

Finally, the @scrapes contract is the last built-in. It is used to check the presence of fields in scraped items. It just goes through the outputted dictionary of your callback and constructs a list of missing properties:

class ScrapesContract(Contract):
    """ Contract to check presence of fields in scraped items
        @scrapes page_name page_body
    """

    name = 'scrapes'

    def post_process(self, output):
        for x in output:
            if isinstance(x, (BaseItem, dict)):
                missing = [arg for arg in self.args if arg not in x]
                if missing:
                    raise ContractFail(
                        "Missing fields: %s" % ", ".join(missing))

Running contracts

Run: scrapy check

If all goes well, you see:

...
----------------------------------------------------------------------
Ran 3 contracts in 0.140s

OK

If something explodes, you see:

F..
======================================================================
FAIL: [example] parse (@returns post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/adnauseum/.virtualenvs/scrapy_testing-CfFR3tdG/lib/python3.7/site-packages/scrapy/contracts/__init__.py", line 151, in wrapper
    self.post_process(output)
  File "/Users/adnauseum/.virtualenvs/scrapy_testing-CfFR3tdG/lib/python3.7/site-packages/scrapy/contracts/default.py", line 90, in post_process
    (occurrences, self.obj_name, expected))
scrapy.exceptions.ContractFail: Returned 10 items, expected 0

----------------------------------------------------------------------

Custom contracts

Let's say you want a @has_header X-CustomHeader contract. This will ensure that your spiders check for the presence of X-CustomHeader. Scrapy contracts are just classes that have three overridable methods: adjust_request_args, pre_process, and post_process. From there, you'll need to raise ContractFail from pre_process or post_process whenever expectations are not met.

from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail

class HasHeaderContract(Contract):
  """Demo contract which checks the presence of a custom header
  @has_header X-CustomHeader
  """
  name = 'has_header' # add the command name to the registry

  def pre_process(self, response):
    for header in self.args:
      if header not in response.headers:
        raise ContractFail(f"{header} not present")

Why are contracts useful?

It looks like contracts can be useful for helping you know two things:

  1. your code changes didn't break things

    • Seems like it might be a good idea to run the spider against local copies of the page you're scraping and use contracts to validate that your code changes didn't break anything. In this case, you're controlling the page being scraped and you know it is unchanged. Thus, if your contracts fail, you know that it was your code change.
    • In this approach, it might be useful to name these HTML fixtures with some kind of timestamp, for record keeping. I.e., Site-Page-07-14-2019.html. You can save these pages by running scrapy view <url>. Scrapy will open this page in your browser, but will also save an HMTL file with everything you need.
  2. the page you're scraping didn't change (in ways that affect you)

    • Then you could also run your spider against the real thing and let the contracts tell you that what you're scraping has changed.

Though contracts are useful, you'll likely have to do more to ensure your spider. for instance, the amount of items you're scraping isn't guaranteed to be a constant all the time. In that case, you might consider crawling a mock server and running tests against the items collected. There's a dearth of documentation and best practices, it seems.

Finally, there is a project made by Scrapinghub, Spidermon, which is useful for monitoring your spider while it's running: https://spidermon.readthedocs.io/en/latest/getting-started.html

You can validate scraped items according to model definitions and get stats on your spider (current num items scraped, num items that don't meet validation, etc).

like image 12
adnauseam Avatar answered Nov 15 '22 02:11

adnauseam