Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Design Pattern: using class attributes to store data vs. local function variables

Often I find myself running into the same question. A common pattern is that I create a class that performs some operations. Eg. Loads data, transforms/cleans data, saves data. The question then arises how to pass/save intermediate data. Look at the following 2 options:

import read_csv_as_string, store_data_to_database

class DataManipulator:
    ''' Intermediate data states are saved in self.results'''

    def __init__(self):
        self.results = None

    def load_data(self):
        '''do stuff to load data, set self.results'''
        self.results = read_csv_as_string('some_file.csv')

    def transform(self):
        ''' transforms data, eg get first 10 chars'''
        transformed = self.results[:10]
        self.results = transformed

    def save_data(self):
        ''' stores string to database'''
        store_data_to_database(self.results)

    def run(self):
        self.load_data()
        self.transform()
        self.save_data()

DataManipulator().run()

class DataManipulator2:
    ''' Intermediate data states are not saved but passed along'''


    def load_data(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string('some_file.csv')

    def transform(self, results):
        ''' transforms data, eg get first 10 chars'''
        return results[:10]

    def save_data(self, data):
        ''' stores string to database'''
        store_data_to_database(data)

    def run(self):
        results = self.load_data()
        trasformed_results = self.transform(results)
        self.save_data(trasformed_results)

DataManipulator2().run()

Now for writing tests, I find DataManipulator2 better since functions can be tested more easily in isolation. At the same time I also like the clean run function of DataManipulator. What is the most pythonic way?

like image 586
tvgriek Avatar asked Feb 03 '23 18:02

tvgriek


1 Answers

Unlike what was said in the other answers, I don't think this is a matter of personal taste.

As you wrote, DataManipulator2 seems, at first sight, easier to test. (But as @AliFaizan stated, it's not so easy to unit test a function that needs a database connection.) And it seems easier to test because it's stateless. A stateless class is not automatically easier to test, but it is easier to understand: for one input, you always get the same output.

But that's not the only point: with DataManipulator2, the order of the actions in run can't be wrong, because each function passes some data to the next one, and the next one can't proceed without this data. That would be more obvious with a statically (and strongly) typed language, because you can't even compile a erroneous run function.

On the contrary, DataManipulator is not easily testable, stateful and doesn't ensure the order of the actions. That's why the method DataManipulator.run is so clean. It's event too clean because its implementation hides something very important: function calls are ordered.

Hence, my answer: prefer the DataManipulator2 implementation to the DataManipulator implementation.

But is the DataManipulator2 perfect? Yes and no. For a quick and dirty implementation, that's the way to go. But let's try to go further.

You need the function run to be public, but load_data, save_data and transform have no reason to be public (by "public" I mean: not marked as implementation detail with an underscore). If you mark them with an underscore, they are not part of the contract anymore and you are not comfortable with testing them. Why? Because the implementation may change without breaking the class contract although there may be tests failures. That's a cruel dilemma: either your class DataManipulator2 has the correct API or it is not fully testable.

Nevertheless, these functions should be testable, but as part of the API of another class. Think of a 3-tier architecture:

  • load_data and save_data are in the data layer
  • transform is in the business layer.
  • the run call is in the presentation layer

Let's try to implement this:

class DataManipulator3:
    def __init__(self, data_store, transformer):
        self._data_store = data_store
        self._transformer = transformer

    def run(self):
        results = self._data_store.load()
        trasformed_results = self._transformer.transform(results)
        self._data_store.save(transformed_results)

class DataStore:
    def load(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string('some_file.csv')

    def save(self, data):
        ''' stores string to database'''
        store_data_to_database(data)

class Transformer:
    def transform(self, results):
        ''' transforms data, eg get first 10 chars'''
        return results[:10]

DataManipulator3(DataStore(), Transformer()).run()

That's not bad, and the Transformer is easy to test. But:

  • the DataStore is not handy: the file to read is buried in the code and the database too.
  • the DataManipulator should be able to run a Transformer on multiple data samples.

Hence another version that adresses these issues:

class DataManipulator4:
    def __init__(self, transformer):
        self._transformer = transformer

    def run(self, data_sample):
        data = data_sample.load()
        results = self._transformer.transform(data)
        self.data_sample.save(results)

class DataSample:
    def __init__(self, filename, connection)
        self._filename = filename
        self._connection = connection

    def load(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string(self._filename)

    def save(self, data):
        ''' stores string to database'''
        store_data_to_database(self._connection, data)

with get_db_connection() as conn:
    DataManipulator4(Transformer()).run(DataSample('some_file.csv', conn))

There's one more point: the filename. Try to prefer file-like object to filenames as arguments, because you can test your code with the io module:

class DataSample2:
    def __init__(self, file, connection)
        self._file = file
        self._connection = connection

    ...

dm = DataManipulator4(Transformer())
with get_db_connection() as conn, open('some_file.csv') as f:
    dm.run(DataSample2(f, conn))

With mock objects, it's now very easy to test the behaviour of the classes.

Let's summarize the advantages of this code:

  • the order of actions is ensured (as in DataManipulator2)
  • the run method is as clean as it should be (as in DataManipulator2)
  • the code is modular: you can create a new Transformer, or a new DataSample (load from a DB and save to a csv file for instance)
  • the code is testable: every method is public (in Python sense) but the APIs remain simple.

Of course, this is really (old style) Java-like. In python, you can simply pass the function transform instead of an instance of the Transformer class. But as soon as your transform begins to be complex, a class is a good solution.

like image 89
jferard Avatar answered Feb 06 '23 09:02

jferard