Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create new columns in existing sql table, with extra columns from pandas Dataframe

I have several pandas Dataframe that I wish to write into a SQL database. However, because the existing SQL database might not have that particular column name that was in the pandas Dataframe, I get an error message saying that the column in the table was not found, thus unable to append data.

# Example:

df1 
out= column1, column2, column3, column4
     value1,  value2,  value3,  value4

df2
out= columnA, columnB, columnC
     valueA,  valueB,  valueC

# Initially I concat the df together and save it into SQL
combined_data = pandas.concat([df1, df2], axis=1,
                               join='inner')
pandas.DataFrame.to_sql(combined_data, name='table1', con=engine, 
                        if_exists='append', index=False)

However, because this table has already been created, with all the columns, if df2 was to have additional columns, i get an error message.

df2
out= columnA, columnB, columnC, columnD, columnE, columnF
     valueA,  valueB,  valueC,  valueD,  valueE,  valueF      

How do i structure a code, that would create new columns in the existing SQL table, with the names of these columns, as the missing column names from pandas Dataframe?

I think I can add new columns with the below sql code

connection.execute("ALTER TABLE table1 ADD COLUMN new_column INTEGER DEFAULT 0")

But how do I make sure that the new_column that was added, follows the column name in df2?

like image 686
jake wong Avatar asked Oct 18 '22 05:10

jake wong


1 Answers

I had a similar problem and took the following approach:

1) Get a list of the columns from the database table. This can be done several ways, but I was using postgres instead of sqllite. See this SE question for getting the column names of a table from postgresql. This question seems to answer how to do it for sqlite.

db_columns = list(engine.execute("SELECT column_name FROM information_schema.columns WHERE table_schema = 'public' AND table_name = 'my_table'")) 

This returns a list of tuples so get the first one of every tuple:

db_columns = [x[0] for x in db_columns]

You could load the table into pandas and then use the dataframe's columns instead. This will obviously take more resources:

db_columns = pd.read_sql_query("SELECT * FROM my_table", connection).columns

2) Get the difference between the columns of the database table and the columns of the df. I like using sets because I find them intuitive. However they do not preserve order:

new_columns = set(df1.columns) - set(db_columns)

If order matters then you can use a filter:

new_columns = list(filter(lambda x: x not in db_columns, df1.columns))

3) Iterate over the new columns and prepare to add them to the table:

query = ''   
query params = []
for column in new_columns:
query+= "ALTER TABLE %s ADD COLUMN %s %s;"  
query_params.extend(["my_table", column,"text"])

In this example I used "text" but you might want to replace that with the primitive data type that corresponds to the pandas/numpy dtype. np.asscalar(value) is one way of convert numpy types to python types. See this SO question for more on converting numpy to python types. Finally add all the columns to the table:

 result = connection.execute(query, query_params)
like image 173
Albert Rothman Avatar answered Oct 22 '22 09:10

Albert Rothman