I have a question about how to make a good design for my program. My program is quite simple but I want to have good architecture and make my program easily extensible in the future.
My program need to fetch data from external data sources (XML), extract information from these data and at the end it need to prepare SQL statements to import information to database. So for all the external data sources that there are now and there will be in the future there is simple 'flow' of my application: fetch, extract and load.
I was thinking about creating generic classes called: DataFetcher, DataExtractor and DataLoader and then write specific ones that will inherit from them. I suppose I will need some factory design pattern, but which? FactoryMethod or Abstract Factory?
I want also to not use code like this:
if data_source == 'X':
fetcher = XDataFetcher()
elif data_source == 'Y':
fetcher = YDataFetcher()
....
Ideally (I am not sure if this is easily possible), I would like to write new 'data source processor', add one or two lines in existing code and my program would load data from new data source.
How can I get use of design patterns to accomplish my goals? If you could provide some examples in python it would be great.
If the fetchers all have the same interface, you can use a dictionary:
fetcher_dict = {'X':XDataFetcher,'Y':YDataFetcher}
data_source = ...
fetcher = fetcher_dict[data_source]()
As far as keeping things flexible -- Just write clean idiomatic code. I tend to like the "You ain't gonna need it" (YAGNI) philosophy. If you spend too much time trying to look into the future to figure out what you're going to need, your code will end up too bloated and complex to make the simple adjustments when you find out what you actually need. If the code is clean up front, it should be easy enough to refactor later to suit your needs.
You have neglected to talk about the most important part i.e. the shape of your data. That's really the most important thing here. "Design Patterns" are a distraction--many of these patterns exist because of language limitations that Python doesn't have and introduce unnecessary rigidity.
For example, the interface for an "extractor" might be "an iterable that yields xml strings". Note this could be either a generator or a class with an __iter__
and next()
method! No need to define an abstract class and subclass it!
What kind of configurable polymorphism you add to your data depends on the exact shape of your data. For example you could use convention:
# persisters.py
def persist_foo(data):
pass
# main.py
import persisters
data = {'type':'foo', 'values':{'field1':'a','field2':[1,2]}}
try:
foo_persister = getitem(persisters, 'persist_'+data['type'])
except AttributeError:
# no 'foo' persister is available!
Or if you need further abstraction (maybe you need to add new modules you can't control), you could use a registry (which is just a dict) and a module convention:
# registry.py
def register(registry, method, type_):
"""Returns a decorator that registers a callable in a registry for the method and type"""
def register_decorator(callable_):
registry.setdefault(method, {})[type_] = callable_
return callable_
return register_decorator
def merge_registries(r1, r2):
for method, type_ in r2.iteritems():
r1.setdefault(method, {}).update(r2[method])
def get_callable(registry, method, type_):
try:
callable_ = registry[method][type]
except KeyError, e:
e.message = 'No {} method for type {} in registry'.format(method, type)
raise e
return callable_
def retrieve_registry(module):
try:
return module.get_registry()
except AttributeError:
return {}
def add_module_registry(yourregistry, *modules)
for module in modules:
merge_registries(yourregistry, module)
# extractors.py
from registry import register
_REGISTRY = {}
def get_registry():
return _REGISTRY
@register(_REGISTRY, 'extract', 'foo')
def foo_extractor(abc):
print 'extracting_foo'
# main.py
import extractors, registry
my_registry = {}
registry.add_module_registry(my_registry, extractors)
foo_extracter = registry.get_callable(my_registry, 'extract', 'foo')
You can easily build a global registry on top of this structure if you want (although you should avoid global state even if it's a little less convenient.)
If you are building public framework and you need a maximum of extensibility and formalism and are willing to pay in complexity, you can look at something like zope.interface
. (Which is used by Pyramid.)
Rather than roll your own extract-transform-load app, have you considered scrapy? Using scrapy you would write a "Spider" which is given a string and returns sequences of Items (your data) or Requests (requests for more strings, e.g. URLs to fetch). The Items are sent down a configurable item pipeline which does whatever it wants with the items it receives (e.g. persist in a DB) before passing them along.
Even if you don't use Scrapy, you should adopt a data-centric pipeline-like design and prefer thinking in terms of abstract "callable" and "iterable" interfaces instead of concrete "classes" and "patterns".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With