Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check if a record exists when passing a dataframe to SQL in pandas?

Background

I'm building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise's records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.

However, due to the nature of the content I'm working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.

Current code

# save dataframe to mssql DB engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb') df.to_sql('Calls', engine, if_exists='append')

Sample data

My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):

+---+------------+-------------+
|   |  FileName  |    Name     |
+---+------------+-------------+
| 1 | 123.flac   | Robert      |
| 2 | 456.flac   | Michael     |
| 3 | 789.flac   | Joesph      |
+---+------------+-------------+

Any ideas? Thanks!

like image 367
rchav9 Avatar asked Jul 23 '14 15:07

rchav9


People also ask

How do I check if a record exists in SQL?

The SQL EXISTS Operator The EXISTS operator is used to test for the existence of any record in a subquery. The EXISTS operator returns TRUE if the subquery returns one or more records.

How do you check if data exists in DataFrame?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.


Video Answer


2 Answers

Assuming you have no memory constraints and you're not inserting null values, you could:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')

Depending on the application you could also reduce the size of sql_df by changing the query.

Update - Better overall and can insert null values:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')
like image 141
Lucas Avatar answered Nov 15 '22 00:11

Lucas


What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.

like image 42
Benloper Avatar answered Nov 15 '22 00:11

Benloper