Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test Driven Development (TDD) for Web Scraping

Summary

I have a Python based web scraping pet project that I'm trying to implement some TDD in, but I quickly run into a problem. The unit tests require an internet connection, as well as downloading of html text. While I understand that the actual parsing can be done with a local file, some methods are used to simply redefine the URL and query the website again. This seems to break some of the best practices for TDD (citation: Clean Code by Robert Martin claims that tests should be runnable in any environment). While this is a Python project, I ran into a similar issue using R for Yahoo Finance scraping, and I'm sure this kind of thing is language agnostic. At the very least, this problem seems to violate a major guideline in TDD, which is that the tests should run fast.

tldr; Are there any best practices for handling network connections in TDD?

Reproducible Example

AbstractScraper.py

from urllib.request import urlopen
from bs4 import BeautifulSoup


class AbstractScraper:

    def __init__(self, url):
        self.url = url
        self.dataDictionary = None

    def makeDataDictionary(self):
        html = urlopen(self.url)
        text = html.read().decode("utf-8")
        soup = BeautifulSoup(text, "lxml")
        self.dataDictionary = {"html": html, "text": text, "soup": soup}

    def writeSoup(self, path):
        with open(path, "w") as outfile:
            outfile.write(self.dataDictionary["soup"].prettify())

TestAbstractScraper.py

import unittest
from http.client import HTTPResponse
from bs4 import BeautifulSoup
from CrackedScrapeProject.scrape.AbstractScraper import AbstractScraper
from io import StringIO


class TestAbstractScraperMethods(unittest.TestCase):

    def setUp(self):
        self.scraper = AbstractScraper("https://docs.python.org/2/library/unittest.html")
        self.scraper.makeDataDictionary()

    def test_dataDictionaryContents(self):
        self.assertTrue(isinstance(self.scraper.dataDictionary, dict))
        self.assertTrue(isinstance(self.scraper.dataDictionary["html"], HTTPResponse))
        self.assertTrue(isinstance(self.scraper.dataDictionary["text"], str))
        self.assertTrue(isinstance(self.scraper.dataDictionary["soup"], BeautifulSoup))
        self.assertSetEqual(set(self.scraper.dataDictionary.keys()), set(["text", "soup", "html"]))

    def test_writeSoup(self):
        filePath = "C:/users/athompson/desktop/testFile.html"
        self.scraper.writeSoup(filePath)
        self.writtenData = open(filePath, "r").read()
        self.assertEqual(self.writtenData, self.scraper.dataDictionary["soup"].prettify())

if __name__ == '__main__':
    suite = unittest.TestLoader().loadTestsFromTestCase(TestAbstractScraperMethods)
    unittest.TextTestRunner(verbosity=2).run(suite)
like image 776
Alex Thompson Avatar asked Dec 25 '16 21:12

Alex Thompson


1 Answers

As you said, tests run during TDD must run fast, and there are other aspects like being deterministic etc. (so, what if the connection breaks?). As it was mentioned in the comments, this typically implies that you have to use mocks for those disturbing dependencies.

There is, however, one underlying assumption here: Namely, that the code you are writing can be sensibly tested with unit-testing. What does this mean? It means that there is a reasonably high chance that unit-testing will ever find a bug. In other words, if it is extremely unlikely to ever find a bug with unit-testing, unit-testing is not the right thing to do.

Regarding your function makeDataDictionary, it consists mostly of calls to dependencies. Thus, it seems likely that integration tests (that is, tests that check how your code interacts with the real libraries it uses) will help finding bugs: Does your code call the library correctly with the right arguments? Does the library provide the results actually in the way you expect them? Is the order of interactions correct? Mocks of the libraries will not answer these questions: If your assumptions about a library you use are wrong, you will implement your mocks based on your wrong assumptions.

On the other hand, if you mock away all dependencies from makeDataDictionary, what bugs do you expect to find? Possibly (in the last line of the function) the creation of the data dictionary itself could be wrong (like, wrong names for the keys). Thus, from my perspective, this line is the only part of makeDataDictionary where actual unit-testing makes sense.

Consequently, my recommendation in such scenarios is to first separate the code with pure logic (algorithmic code) from the code that is dominated by interactions. For example, create a helper method _makeDataDictionary(html, text, soup) which does nothing but return {"html": html, "text": text, "soup": soup}. Then, apply unit-testing to _makeDataDictionary, but not to makeDataDictionary. In contrast, test makeDataDictionary with integration tests.

This saves a lot of effort on mocking as well: For unit-testing _makeDataDictionary, no mocks are needed. For integration-testing makeDataDictionary, mocks make no sense. For code that calls makeDataDictionary and shall be unit-tested, you are better off stubbing the call to makeDataDictionary as a whole instead of replacing its individual dependencies anyway.

In a TDD context, however, this is somewhat difficult to handle: TDD seems not to have a notion of code for which unit-testing is not appropriate. But, with the right amount of thinking ahead (also known as design phase), you can recognise early if you should separate algorithmic code from interaction dominated code. Another example that one should not be mislead to believe that TDD eliminates the need for some proper design work.

like image 148
Dirk Herrmann Avatar answered Sep 24 '22 16:09

Dirk Herrmann