Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQLAlchemy - create table from yaml or a dictionary?

Tags:

Is there a way to create a dynamic table from a dictionary specified in a yaml file? I define a lot of the ETL config in the yaml file, so I was curious if I could also add the table creation aspect to it as well so I do not have to modify a separate .sql file in a separate directory.

database:
  table: 'schema.fact_stuff'
  create_columns: [
    {}
  ] #not sure how this section should be

I found a solution on stackoverflow which zipped some lists together which is something similar, but I would prefer to define each column explicitly.

{'column_name': 'id', 'column_type': Integer, 'primary_key': False, 'nullable': True}

I ended up getting it to work with this:

from sqlalchemy.types import (Integer, NUMERIC, TEXT, BOOLEAN, TIMESTAMP, DATE)

sql_types = {'integer': Integer,
        'numeric': NUMERIC,
        'text': TEXT,
        'date': DATE,
        'timestamp': TIMESTAMP(timezone=False),
        'timestamptz': TIMESTAMP(timezone=True)}

exclude_list = ['original_name']
table_dict = [{k: v for k, v in d.items() if k not in exclude_list} for d in c[variable]['load']['columns']]
for column in table_dict:
    for key, val in column.copy().items():
        if key == 'type_':
            column[key] = sql_types[val]
        elif key == 'default':
            column[key] = dt.datetime.utcnow

metadata = sa.MetaData(schema=c[variable]['load']['schema'])
metadata.reflect(bind=engine, autoload=True)
fact = sa.Table(c[variable]['load']['table'], metadata, extend_existing=True,
        *(sa.Column(**kwargs) for kwargs in table_dict))
fact.create_all(engine, checkfirst=True)

But then I moved over to having pandas determine the dtypes instead of defining them in a yaml file. This creates the sql with a jinja2 template and I loop through all of my data sources to create the DDL.

def pandas_to_postgres(df):
    dtype_dict = {
      'i': 'integer',
      'O': 'text',
      'f': 'real',
      'b': 'boolean',
      'datetime64[ns]': 'timestamp',
      'datetime64[ns, UTC]': 'timestampz',
    }
    column_list = []
    column_dict = {}
    for k, v in df.dtypes.items():
        column_dict['name'] = k
        column_dict['dtype'] = dtype_dict.get(v.kind, 'text')
        column_list.append(column_dict.copy())
    return column_list


def generate_create_table(df, schema, table, table_type, columns, constraint, unique_columns):
    """ Returns a dictionary of coefs from training """
    query = Template(
        template
    ).render(
        schema_name=schema,
        table_name=table,
        table_type=table_type,
        columns=columns,
        constraint=constraint,
        constraint_columns=unique_columns
    )
    print(query)
like image 581
trench Avatar asked Jul 26 '18 21:07

trench


1 Answers

Today's release of SQLAthanor (v.0.3.0) supports exactly that. Using SQLAthanor, you can programmatically generate a SQLAlchemy Table object with the following code (assuming metadata contains your SQLAlchemy MetaData object):

from sqlathanor import Table

my_table = Table.from_yaml('yaml_file.yaml', 
                           'my_table_name', 
                           metadata, 
                           primary_key = 'id')

ETA: Note that you can also create Table objects using Table.from_json(), Table.from_dict(), and Table.from_csv() as well.

Here's the documentation (generally) for how it works: https://sqlathanor.readthedocs.io/en/latest/using.html#generating-sqlalchemy-tables-from-serialized-data

And here's a link to the documentation of the specific Table.from_yaml() method: https://sqlathanor.readthedocs.io/en/latest/api.html#sqlathanor.schema.Table.from_yaml

(I recommend reviewing the method documentation - it goes into some of the "gotcha's" of programmatically constructing a Table object from serialized data)


ETA:

Basically, the way programmatic Table generation works is that SQLAthanor:

  1. First converts a serialized string (or file) into a Python dict. For YAML, the default de-serializer is PyYAML. For JSON, the default de-serializer is simplejson (both default deserializers can be overriden using the deserialize_function argument).

  2. Once a Python dict has been generated, SQLAthanor reads each of the keys in that dict to determine the column names. It reads the values for each key, and based on the data type of the value tries to "guess" at the SQLAlchemy data type.

  3. Given what it found in step 2, it creates a Table object with Column objects where each Column object corresponds to a key in the de-serialized dict.

If you need more precise control over each Column, you can:

  • override its SQLAlchemy data type using the type_mapping argument (type_mapping receives a dict where top-level keys correspond to a column name, and each value is the data type to apply to the Column)
  • pass additional keyword arguments to the Column constructor using the column_kwargs argument (column_kwargs receives a dict where top-level keys corresponds to the column name, and each value is a dict with keyword arguments that will be supplied to that column's constructor.

By default, Table.from_<format>() does not support nested data structures. By default, skip_nested is set to True, which means that a key in the de-serialized dict that contains a nested object (either an iterable or a dict) will be skipped (i.e. won't receive a corresponding Column). If your Table needs to store nested data, you can set skip_nested to False and activate default_to_str to True. This will convert nested data (iterables or dict objects) to strings and thus persist them in a Text column (unless overridden by type_mapping).


Table.from_dict() example

The following is an example dict that can be supplied to Table.from_dict():

sample_dict = {
    'id': 123,
    'some_column_name': 'Some Column Value',
    'created_datetime': datetime.utcnow()
}

my_table = Table.from_dict(sample_dict, 
                           'my_table', 
                           metadata, 
                           primary_key = 'id')

When supplied to Table.from_dict() this dict will produce a Table object with database table name my_table that contains three columns:

  • id which will have type Integer that is set to be the table's primary key
  • some_column_name which will have type Text
  • created_datetime which will have type DateTime

Table.from_yaml() example

The following is the same example, but using a YAML string/document instead that can be supplied to Table.from_yaml():

sample_yaml = """
    id: 123
    some_column_name: Test Value
    created_timestamp: 2018-01-01T01:23:45.67890
"""

my_table = Table.from_yaml(sample_yaml, 
                           'my_table', 
                           metadata, 
                           primary_key = 'id')

When supplied to Table.from_yaml() this will first de-serialize sample_yaml into a dict just like in the earlier example, and then produce a Table object with database table name my_table that contains three columns:

  • id which will have type Integer that is set to be the table's primary key
  • some_column_name which will have type Text
  • created_datetime which will have type DateTime

Hope this helps!

like image 88
Chris Modzelewski Avatar answered Sep 28 '22 18:09

Chris Modzelewski