Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mapping lots of similar tables in SQLAlchemy

I have many (~2000) locations with time series data. Each time series has millions of rows. I would like to store these in a Postgres database. My current approach is to have a table for each location time series, and a meta table which stores information about each location (coordinates, elevation etc). I am using Python/SQLAlchemy to create and populate the tables. I would like to have a relationship between the meta table and each time series table to do queries like "select all locations that have data between date A and date B" and "select all data for date A and export a csv with coordinates". What is the best way to create many tables with the same structure (only the name is different) and have a relationship with a meta table? Or should I use a different database design?

Currently I am using this type of approach to generate a lot of similar mappings:

from sqlalchemy import create_engine, MetaData
from sqlalchemy.types import Float, String, DateTime, Integer
from sqlalchemy import Column, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()


def make_timeseries(name):
    class TimeSeries(Base):

        __tablename__ = name
        table_name = Column(String(50), ForeignKey('locations.table_name'))
        datetime = Column(DateTime, primary_key=True)
        value = Column(Float)

        location = relationship('Location', backref=backref('timeseries',
                                lazy='dynamic'))

        def __init__(self, table_name, datetime, value):
            self.table_name = table_name
            self.datetime = datetime
            self.value = value

        def __repr__(self):
            return "{}: {}".format(self.datetime, self.value)

    return TimeSeries


class Location(Base):

    __tablename__ = 'locations'
    id = Column(Integer, primary_key=True)
    table_name = Column(String(50), unique=True)
    lon = Column(Float)
    lat = Column(Float)

if __name__ == '__main__':
    connection_string = 'postgresql://user:pw@localhost/location_test'
    engine = create_engine(connection_string)
    metadata = MetaData(bind=engine)
    Session = sessionmaker(bind=engine)
    session = Session()

    TS1 = make_timeseries('ts1')
    # TS2 = make_timeseries('ts2')   # this breaks because of the foreign key
    Base.metadata.create_all(engine)
    session.add(TS1("ts1", "2001-01-01", 999))
    session.add(TS1("ts1", "2001-01-02", -555))

    qs = session.query(Location).first()
    print qs.timeseries.all()

This approach has some problems, most notably that if I create more than one TimeSeries the foreign key doesn't work. Previously I've used some work arounds, but it all seems like a big hack and I feel that there must be a better way of doing this. How should I organise and access my data?

like image 881
bananafish Avatar asked Mar 28 '14 04:03

bananafish


People also ask

Is SQLAlchemy good for ETL?

One of the key aspects of any data science workflow is the sourcing, cleaning, and storing of raw data in a form that can be used upstream. This process is commonly referred to as “Extract-Transform-Load,” or ETL for short.

Is SQLAlchemy faster than SQLite?

Interesting to note that querying using bare sqlite3 is still about 3 times faster than using SQLAlchemy Core. I guess that's the price you pay for having a ResultProxy returned instead of a bare sqlite3 row. SQLAlchemy Core is about 8 times faster than using ORM. So querying using ORM is a lot slower no matter what.

Is SQLAlchemy worth learning?

SQLAlchemy is the ORM of choice for working with relational databases in python. The reason why SQLAlchemy is so popular is because it is very simple to implement, helps you develop your code quicker and doesn't require knowledge of SQL to get started.

What is _sa_instance_state in SQLAlchemy?

_sa_instance_state is a non-database-persisted value used by SQLAlchemy internally (it refers to the InstanceState for the instance. While not directly relevant to this section, if we want to get at it, we should use the inspect() function to access it).


2 Answers

Two parts:

only use two tables

there's no need to have dozens or hundreds of identical tables. just have a table for location and one for location_data , where every entry will fkey onto location. also create an index on the location_data table for the location_id, so you have efficient searching.

don't use sqlalchemy to create this

i love sqlalchemy. i use it every day. it's great for managing your database and adding some rows, but you don't want to use it for initial setup that has millions of rows. you want to generate a file that is compatible with postgres' "COPY" statement [ http://www.postgresql.org/docs/9.2/static/sql-copy.html ] COPY will let you pull in a ton of data fast; it's what is used during dump/restore operations.

sqlalchemy will be great for querying this and adding rows as they come in. if you have bulk operations, you should use COPY.

like image 62
Jonathan Vanasco Avatar answered Sep 19 '22 21:09

Jonathan Vanasco


Alternative-1: Table Partitioning

Partitioning immediately comes to mind as soon as I read exactly the same table structure. I am not a DBA, and do not have much production experience using it (even more so on PostgreSQL), but please read PostgreSQL - Partitioning documentation. Table partitioning seeks to solve exactly the problem you have, but over 1K tables/partitions sounds challenging; therefore please do more research on forums/SO for scalability related questions on this topic.

Given that both of your mostly used search criterias, datetime component is very important, therefore there must be solid indexing strategy on it. If you decide to go with partitioning root, the obvious partitioning strategy would be based on date ranges. This might allow you to partition older data in different chunks compared to most recent data, especially assuming that old data is (almost never) updated, so physical layouts would be dense and efficient; while you could employ another strategy for more "recent" data.

Alternative-2: trick SQLAlchemy

This basically makes your sample code work by tricking SA to assume that all those TimeSeries are children of one entity using Concrete Table Inheritance. The code below is self-contained and creates 50 table with minimum data in it. But if you have a database already, it should allow you to check the performance rather quickly, so that you can make a decision if it is even a close possibility.

from datetime import date, datetime

from sqlalchemy import create_engine, Column, String, Integer, DateTime, Float, ForeignKey, func
from sqlalchemy.orm import sessionmaker, relationship, configure_mappers, joinedload
from sqlalchemy.ext.declarative import declarative_base, declared_attr
from sqlalchemy.ext.declarative import AbstractConcreteBase, ConcreteBase


engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base(engine)


# MODEL
class Location(Base):
    __tablename__ = 'locations'
    id = Column(Integer, primary_key=True)
    table_name = Column(String(50), unique=True)
    lon = Column(Float)
    lat = Column(Float)


class TSBase(AbstractConcreteBase, Base):
    @declared_attr
    def table_name(cls):
        return Column(String(50), ForeignKey('locations.table_name'))


def make_timeseries(name):
    class TimeSeries(TSBase):
        __tablename__ = name
        __mapper_args__ = { 'polymorphic_identity': name, 'concrete':True}

        datetime = Column(DateTime, primary_key=True)
        value = Column(Float)

        def __init__(self, datetime, value, table_name=name ):
            self.table_name = table_name
            self.datetime = datetime
            self.value = value

    return TimeSeries


def _test_model():
    _NUM = 50
    # 0. generate classes for all tables
    TS_list = [make_timeseries('ts{}'.format(1+i)) for i in range(_NUM)]
    TS1, TS2, TS3 = TS_list[:3] # just to have some named ones
    Base.metadata.create_all()
    print('-'*80)

    # 1. configure mappers
    configure_mappers()

    # 2. define relationship
    Location.timeseries = relationship(TSBase, lazy="dynamic")
    print('-'*80)

    # 3. add some test data
    session.add_all([Location(table_name='ts{}'.format(1+i), lat=5+i, lon=1+i*2)
        for i in range(_NUM)])
    session.commit()
    print('-'*80)

    session.add(TS1(datetime(2001,1,1,3), 999))
    session.add(TS1(datetime(2001,1,2,2), 1))
    session.add(TS2(datetime(2001,1,2,8), 33))
    session.add(TS2(datetime(2002,1,2,18,50), -555))
    session.add(TS3(datetime(2005,1,3,3,33), 8))
    session.commit()


    # Query-1: get all timeseries of one Location
    #qs = session.query(Location).first()
    qs = session.query(Location).filter(Location.table_name == "ts1").first()
    print(qs)
    print(qs.timeseries.all())
    assert 2 == len(qs.timeseries.all())
    print('-'*80)


    # Query-2: select all location with data between date-A and date-B
    dateA, dateB = date(2001,1,1), date(2003,12,31)
    qs = (session.query(Location)
            .join(TSBase, Location.timeseries)
            .filter(TSBase.datetime >= dateA)
            .filter(TSBase.datetime <= dateB)
            ).all()
    print(qs)
    assert 2 == len(qs)
    print('-'*80)


    # Query-3: select all data (including coordinates) for date A
    dateA = date(2001,1,1)
    qs = (session.query(Location.lat, Location.lon, TSBase.datetime, TSBase.value)
            .join(TSBase, Location.timeseries)
            .filter(func.date(TSBase.datetime) == dateA)
            ).all()
    print(qs)
    # @note: qs is list of tuples; easy export to CSV
    assert 1 == len(qs)
    print('-'*80)


if __name__ == '__main__':
    _test_model()

Alternative-3: a-la BigData

If you do get into performance problems using database, I would probably try:

  • still keep the data in separate tables/databases/schemas like you do right now
  • bulk-import data using "native" solutions provided by your database engine
  • use MapReduce-like analysis.
    • Here I would stay with python and sqlalchemy and implemnent own distributed query and aggregation (or find something existing). This, obviously, only works if you do not have requirement to produce those results directly on the database.

edit-1: Alternative-4: TimeSeries databases

I have no experience using those on a large scale, but definitely an option worth considering.


Would be fantastic if you could later share your findings and whole decision-making process on this.

like image 30
van Avatar answered Sep 17 '22 21:09

van