Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: interact with complex data warehouse

We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.

Is there any best/standard practice for interfacing between python and a complex database layout?

I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.

Some of my requirements, that may be somewhat different than usual:

  1. load large amounts of data relatively quickly
  2. update/insert small amounts of data quickly and easily
  3. handle large numbers of rows easily (300 entries per minute over 5 years)
  4. allow for modifications in the schema, for future requirements

Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?

like image 283
bukzor Avatar asked Sep 23 '10 20:09

bukzor


2 Answers

Don't get confused by your requirements. One size does not fit all.

load large amounts of data relatively quickly

Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.

update/insert small amounts of data quickly and easily

That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.

That's what ORM's and web frameworks are for.

handle large numbers of rows easily (300 entries per minute over 5 years)

Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.

alter schema (along with python interface) easily, for future requirements

You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.

BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.

like image 114
S.Lott Avatar answered Sep 23 '22 07:09

S.Lott


I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that. SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.

One thing though, I rather modify my schema by hand. I don't trust any tool for that job.

like image 34
Mariano Avatar answered Sep 23 '22 07:09

Mariano