I have an assortment of similar (but not identical) databases, and would like to use SQLAlchemy as a way to "standardize" access. The databases can differ very slightly, such as having a unique prefix on the column names, or they can differ more dramatically and be missing columns (or for old databases, missing entire tables).
What I'm looking for help on isn't so much an SQLAlchemy problem as it is a Python/Organizational one. How can I have multiple databases setup that can be reused in projects easily?
I've read about SQLAlchemy sessions, but can't see a way of using those without instantiating each of them every project.
My question is this: How can I go about making a module/package that will contain many database model setups to be used in SQLAlchemy that can be easily imported/used in another python project?
I'm not so worried about dealing with the missing columns/tables as of yet. I can tackle that problem later, but it is something that needs to be kept in mind as I can't use the exact same model for each database.
Any resources, pointers, or reading material on this topic would be truly appreciated. Thanks in advance, and I'm sorry if this has been answered elsewhere, searches didn't show anything relating to this.
EDIT: I've left the original intact, and am adding more content based on Paul's advice.
RE: SA ORM - Yes, I plan on using the SQLAlchemy ORM. For what are likely obvious reasons, I can't provide real databases. However, assume these three fictitious databases, aptly named DB1, DB2, and DB3 (we'll assume one table in each, with only a few columns, real world would have significantly more of both).
Each database has a user table with a few columns in each. The following is some SQL notation for the tables/columns:
DB1.user --> DB1.user.id, DB1.user.username, DB1.user.email DB2.user --> DB2.user.id, DB2.user.user_name, DB2.user.email DB3._user --> DB3._user.userid, DB3._user.username, DB3.user.email_address
At present, I am trying to separate out these databases to 'modular', and be able to just add additional databases as I go.
I've consider a couple different file organization aspects (assume __init__.py exists where needed, but omitted for brevity's sake), including:
Databases | Databases | Databases DB1.py | DB1 | DB1 DB2.py | models.py | models DB3.py | DB2 | user.py | models.py | anothertable.py | DB2 | ... | models.py | DB3 | | models | | user.py | | anothertable.py
I'd love to be able to access these with the SA ORM, and do so with as little importing/declarations as possible when it comes time to use these databases in a python file. Needing to do something similar to:
from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker from Database import DB1, ..., DB[N] db1_engine = create_engine('connect_string/db1') db1_session = sessionmaker(bind=db1_engine)() ... db3_engine = create_engine('connect_string/db3') db3_session = sessionmaker(bind=db3_engine)()
would be incredibly cumbersome as I'll be dealing with far more than just three databases. I'd much rather have that already taken care of for me (in the __init__.py file, maybe?)
Being able to access and use it similar to:
import Databases Databases.DB1.session.query('blahblah')
would be infinitely better.
EDIT2: I also know how to get around the variants in the naming conventions of the databases/columns when setting up my models. That isn't an issue, but I did mention it so that it was known I can't just use one model set for multiple databases.
I hope by expanding this I didn't muddy the waters or make this too confusing. Thanks for taking the time to read it!
EDIT3: I've managed to spend a little more time on this. I've set up the project in the following way:
Databases __init__.py databases.py settings.py DB1 __init__.py models.py ... DB3 __init__.py models.py
At present, I have a tupple of databases that are 'installed' in the settings.py file. Each entry is would like like INSTALLED_DATABASES = ('DB1', ..., 'DB3')
. As I complete more models, and they get added to the list of tupples. This allows me to add or remove content as I go.
I have the engine and sessios setup inside the models.py file, and have the init.py file for each database setup to from models import *
.
In the databases.py file I have the following
class Databases(object): def __init__(self): for database in INSTALLED_DATABASES: setattr(self, database, __import__(database))
I can now use these via:
from databases import Databases db = Databases() for qr in db.DB1.query(db.DB1.User): print qr.userid, qr.username
SQLAlchemy is allowing me to manually specify column names when defining the models, which is a huge bonus to the standardization I wanted.
I've got a lot of work ahead of me still. I would like to create objects that force model validation (ie, is a field present? does a non-present field have a default value? etc.) and better incorporate how this works with my IDE (it kind of doesn't at present). But I'm off to the right track. I figured I'd update this for anyone who might by chance be wondering how to do the same thing I was.
Sorry this has become so long!
Cheers!
One of the key aspects of any data science workflow is the sourcing, cleaning, and storing of raw data in a form that can be used upstream. This process is commonly referred to as “Extract-Transform-Load,” or ETL for short.
Sqlite is a database storage engine, which can be better compared with things such as MySQL, PostgreSQL, Oracle, MSSQL, etc. It is used to store and retrieve structured data from files. SQLAlchemy is a Python library that provides an object relational mapper (ORM).
SQLAlchemy is great because it provides a good connection / pooling infrastructure; a good Pythonic query building infrastructure; and then a good ORM infrastructure that is capable of complex queries and mappings (as well as some pretty stone-simple ones).
As per the requests to my initial question, I've taken my third edit and made it my answer. As I'm not sure of the proper protocols, I've left the third edit in place above. If you've already read EDIT3, then you've read what I have as an answer.
I've managed to spend a little more time on this. I've set up the project in the following way:
Databases __init__.py databases.py settings.py DB1 __init__.py models.py ... DB3 __init__.py models.py
At present, I have a tuple of databases that are 'installed' in the settings.py file. Each DB entry is specified in INSTALLED_DATABASES = ('DB1', ..., 'DB3')
. As I complete more models, they get added to the list of tuples. This allows me to add or remove content as I go.
I have the engine and sessions setup inside the models.py
file, and have the __init.py__
file for each database set to from models import *
.
In the databases.py file I have the following
class Databases(object): def __init__(self): for database in INSTALLED_DATABASES: setattr(self, database, __import__(database))
I can now use these via:
from databases import Databases db = Databases() for qr in db.DB1.query(db.DB1.User): print qr.userid, qr.username
SQLAlchemy is allowing me to manually specify column names when defining the models, which is a huge bonus to the standardization I wanted.
I've got a lot of work ahead of me still. I would like to create objects that force model validation (ie, is a field present? does a non-present field have a default value? etc.) and better incorporate how this works with my IDE (it kind of doesn't at present). But I'm off to the right track. I figured I'd update this for anyone who might by chance be wondering how to do the same thing I was.
Sorry this has become so long!
Cheers!
Your solution looks pretty good. Here's what I did.
I have a package named connectors, and in it a module for each db as well as a settings file.
Each of these connector modules creates its connection string and its engine, along with the declarative base and classes for the tables.
Then there is a method loadSession that returns the session (this one I got from a tutorial or another post here somewhere, cant recall exactly) and another one I added that returns the engine in case I want to do something with that.
So then in some other module of the program, I would do something like this
from connectors import x, y, z x_ses = x.loadSession() y_ses = y.loadSession() z_ses = z.loadSession() xq = x_ses.query(...) yq = y_ses.query(...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With