I'm building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise's records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.
However, due to the nature of the content I'm working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.
# save dataframe to mssql DB
engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb')
df.to_sql('Calls', engine, if_exists='append')
My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):
+---+------------+-------------+
| | FileName | Name |
+---+------------+-------------+
| 1 | 123.flac | Robert |
| 2 | 456.flac | Michael |
| 3 | 789.flac | Joesph |
+---+------------+-------------+
Any ideas? Thanks!
The SQL EXISTS Operator The EXISTS operator is used to test for the existence of any record in a subquery. The EXISTS operator returns TRUE if the subquery returns one or more records.
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.
Assuming you have no memory constraints and you're not inserting null values, you could:
sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')
Depending on the application you could also reduce the size of sql_df by changing the query.
Update - Better overall and can insert null values:
sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')
What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With