How can I check if a record exists when passing a dataframe to SQL in pandas?

Background

I'm building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise's records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.

However, due to the nature of the content I'm working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.

Current code

# save dataframe to mssql DB engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb') df.to_sql('Calls', engine, if_exists='append')

Sample data

My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):

Click to copy

+---+------------+-------------+
|   |  FileName  |    Name     |
+---+------------+-------------+
| 1 | 123.flac   | Robert      |
| 2 | 456.flac   | Michael     |
| 3 | 789.flac   | Joesph      |
+---+------------+-------------+

Any ideas? Thanks!

367

asked Jul 23 '14 15:07

rchav9

Video Answer

2 Answers

Assuming you have no memory constraints and you're not inserting null values, you could:

Click to copy

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')

Depending on the application you could also reduce the size of sql_df by changing the query.

Update - Better overall and can insert null values:

Click to copy

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')

141

answered Nov 15 '22 00:11

Lucas

What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.

answered Nov 15 '22 00:11

Benloper

Related questions
                            
                                python lxml different result on windows and linux
                            
                                Does Python support conditional structure in regex?
                            
                                Python - Sending Outlook email from different address using pywin32
                            
                                Pandas - Choose the starting day when resampling every 2 weeks
                            
                                Pandas diff() functionality on two columns in a dataframe
                            
                                Is there any difference between type and class?
                            
                                Trying to get a Python client talk to a Java Server using thrift's TFileTransport and TFileProcessor
                            
                                Matplotlib is not recognizing the attribute set_xdata.
                            
                                why the returned object of os.popen doesn't support next() call
                            
                                Custom size array
                            
                                python and scrapy THE encoding issue
                            
                                Python Turtle: Draw concentric circles using circle() method
                            
                                RuntimeWarning: overflow encountered in np.exp(x**2)
                            
                                How to use pyglet within a class
                            
                                Time-series plotting inconsistencies in Pandas
                            
                                How to convert 4 byte IEEE (little endian) float binary representation to float
                            
                                How do I create a progress bar when a DataFrame is initializing?
                            
                                Pandas - Edit Index using pattern / regex
                            
                                Millisecond precise python timer
                            
                                Get only new lines from file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I check if a record exists when passing a dataframe to SQL in pandas?

Tags:

python

sql-server

pandas

csv

sqlalchemy

Background

Current code

Sample data

rchav9

People also ask

Video Answer

2 Answers

Lucas

Benloper

Recent Activity

Donate For Us