Is there a way to create a dynamic table from a dictionary specified in a yaml file? I define a lot of the ETL config in the yaml file, so I was curious if I could also add the table creation aspect to it as well so I do not have to modify a separate .sql file in a separate directory.
database:
table: 'schema.fact_stuff'
create_columns: [
{}
] #not sure how this section should be
I found a solution on stackoverflow which zipped some lists together which is something similar, but I would prefer to define each column explicitly.
{'column_name': 'id', 'column_type': Integer, 'primary_key': False, 'nullable': True}
I ended up getting it to work with this:
from sqlalchemy.types import (Integer, NUMERIC, TEXT, BOOLEAN, TIMESTAMP, DATE)
sql_types = {'integer': Integer,
'numeric': NUMERIC,
'text': TEXT,
'date': DATE,
'timestamp': TIMESTAMP(timezone=False),
'timestamptz': TIMESTAMP(timezone=True)}
exclude_list = ['original_name']
table_dict = [{k: v for k, v in d.items() if k not in exclude_list} for d in c[variable]['load']['columns']]
for column in table_dict:
for key, val in column.copy().items():
if key == 'type_':
column[key] = sql_types[val]
elif key == 'default':
column[key] = dt.datetime.utcnow
metadata = sa.MetaData(schema=c[variable]['load']['schema'])
metadata.reflect(bind=engine, autoload=True)
fact = sa.Table(c[variable]['load']['table'], metadata, extend_existing=True,
*(sa.Column(**kwargs) for kwargs in table_dict))
fact.create_all(engine, checkfirst=True)
But then I moved over to having pandas determine the dtypes instead of defining them in a yaml file. This creates the sql with a jinja2 template and I loop through all of my data sources to create the DDL.
def pandas_to_postgres(df):
dtype_dict = {
'i': 'integer',
'O': 'text',
'f': 'real',
'b': 'boolean',
'datetime64[ns]': 'timestamp',
'datetime64[ns, UTC]': 'timestampz',
}
column_list = []
column_dict = {}
for k, v in df.dtypes.items():
column_dict['name'] = k
column_dict['dtype'] = dtype_dict.get(v.kind, 'text')
column_list.append(column_dict.copy())
return column_list
def generate_create_table(df, schema, table, table_type, columns, constraint, unique_columns):
""" Returns a dictionary of coefs from training """
query = Template(
template
).render(
schema_name=schema,
table_name=table,
table_type=table_type,
columns=columns,
constraint=constraint,
constraint_columns=unique_columns
)
print(query)
Today's release of SQLAthanor (v.0.3.0) supports exactly that. Using SQLAthanor, you can programmatically generate a SQLAlchemy Table
object with the following code (assuming metadata
contains your SQLAlchemy MetaData
object):
from sqlathanor import Table
my_table = Table.from_yaml('yaml_file.yaml',
'my_table_name',
metadata,
primary_key = 'id')
ETA: Note that you can also create Table
objects using Table.from_json()
, Table.from_dict()
, and Table.from_csv()
as well.
Here's the documentation (generally) for how it works: https://sqlathanor.readthedocs.io/en/latest/using.html#generating-sqlalchemy-tables-from-serialized-data
And here's a link to the documentation of the specific Table.from_yaml()
method: https://sqlathanor.readthedocs.io/en/latest/api.html#sqlathanor.schema.Table.from_yaml
(I recommend reviewing the method documentation - it goes into some of the "gotcha's" of programmatically constructing a Table
object from serialized data)
ETA:
Basically, the way programmatic Table
generation works is that SQLAthanor:
First converts a serialized string (or file) into a Python dict
. For YAML, the default de-serializer is PyYAML. For JSON, the default de-serializer is simplejson (both default deserializers can be overriden using the deserialize_function
argument).
Once a Python dict
has been generated, SQLAthanor reads each of the keys in that dict
to determine the column names. It reads the values for each key, and based on the data type of the value tries to "guess" at the SQLAlchemy data type.
Given what it found in step 2, it creates a Table
object with Column
objects where each Column
object corresponds to a key in the de-serialized dict
.
If you need more precise control over each Column
, you can:
type_mapping
argument (type_mapping
receives a dict
where top-level keys correspond to a column name, and each value is the data type to apply to the Column
)Column
constructor using the column_kwargs
argument (column_kwargs
receives a dict
where top-level keys corresponds to the column name, and each value is a dict
with keyword arguments that will be supplied to that column's constructor.By default, Table.from_<format>()
does not support nested data structures. By default, skip_nested
is set to True
, which means that a key in the de-serialized dict
that contains a nested object (either an iterable or a dict
) will be skipped (i.e. won't receive a corresponding Column
). If your Table
needs to store nested data, you can set skip_nested
to False
and activate default_to_str
to True
. This will convert nested data (iterables or dict
objects) to strings and thus persist them in a Text
column (unless overridden by type_mapping
).
Table.from_dict()
example
The following is an example dict
that can be supplied to Table.from_dict()
:
sample_dict = {
'id': 123,
'some_column_name': 'Some Column Value',
'created_datetime': datetime.utcnow()
}
my_table = Table.from_dict(sample_dict,
'my_table',
metadata,
primary_key = 'id')
When supplied to Table.from_dict()
this dict
will produce a Table
object with database table name my_table
that contains three columns:
id
which will have type Integer
that is set to be the table's primary keysome_column_name
which will have type Text
created_datetime
which will have type DateTime
Table.from_yaml()
example
The following is the same example, but using a YAML string/document instead that can be supplied to Table.from_yaml()
:
sample_yaml = """
id: 123
some_column_name: Test Value
created_timestamp: 2018-01-01T01:23:45.67890
"""
my_table = Table.from_yaml(sample_yaml,
'my_table',
metadata,
primary_key = 'id')
When supplied to Table.from_yaml()
this will first de-serialize sample_yaml
into a dict
just like in the earlier example, and then produce a Table
object with database table name my_table
that contains three columns:
id
which will have type Integer
that is set to be the table's primary keysome_column_name
which will have type Text
created_datetime
which will have type DateTime
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With