I have several pandas Dataframe
that I wish to write into a SQL database
. However, because the existing SQL database
might not have that particular column name
that was in the pandas Dataframe
, I get an error message saying that the column in the table was not found, thus unable to append data
.
# Example:
df1
out= column1, column2, column3, column4
value1, value2, value3, value4
df2
out= columnA, columnB, columnC
valueA, valueB, valueC
# Initially I concat the df together and save it into SQL
combined_data = pandas.concat([df1, df2], axis=1,
join='inner')
pandas.DataFrame.to_sql(combined_data, name='table1', con=engine,
if_exists='append', index=False)
However, because this table has already been created, with all the columns, if df2 was to have additional columns, i get an error message.
df2
out= columnA, columnB, columnC, columnD, columnE, columnF
valueA, valueB, valueC, valueD, valueE, valueF
How do i structure a code, that would create new columns in the existing SQL table
, with the names of these columns, as the missing column names from pandas Dataframe
?
I think I can add new columns with the below sql code
connection.execute("ALTER TABLE table1 ADD COLUMN new_column INTEGER DEFAULT 0")
But how do I make sure that the new_column
that was added, follows the column name in df2?
I had a similar problem and took the following approach:
1) Get a list of the columns from the database table. This can be done several ways, but I was using postgres instead of sqllite. See this SE question for getting the column names of a table from postgresql. This question seems to answer how to do it for sqlite.
db_columns = list(engine.execute("SELECT column_name FROM information_schema.columns WHERE table_schema = 'public' AND table_name = 'my_table'"))
This returns a list of tuples so get the first one of every tuple:
db_columns = [x[0] for x in db_columns]
You could load the table into pandas and then use the dataframe's columns instead. This will obviously take more resources:
db_columns = pd.read_sql_query("SELECT * FROM my_table", connection).columns
2) Get the difference between the columns of the database table and the columns of the df. I like using sets because I find them intuitive. However they do not preserve order:
new_columns = set(df1.columns) - set(db_columns)
If order matters then you can use a filter:
new_columns = list(filter(lambda x: x not in db_columns, df1.columns))
3) Iterate over the new columns and prepare to add them to the table:
query = ''
query params = []
for column in new_columns:
query+= "ALTER TABLE %s ADD COLUMN %s %s;"
query_params.extend(["my_table", column,"text"])
In this example I used "text" but you might want to replace that with the primitive data type that corresponds to the pandas/numpy dtype. np.asscalar(value)
is one way of convert numpy types to python types. See this SO question for more on converting numpy to python types.
Finally add all the columns to the table:
result = connection.execute(query, query_params)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With