I have a Pandas DataFrame (called df
), which I would like to upload to a MySql database.
The dataframe has columns [A, B, C] and the table in the database has columns [ID, A, B, C]. The ID column in the database is the auto-incrementing primary key.
I can upload the dataframe to the database using the df.to_sql('table_name', engine)
command. However, this does not give me any information about the values that the database assigned to the ID column of the incoming data. The only way I have of getting this information is by querying the database using the values for columns A, B, C:
select
ID, A, B, C
from db_table
where (A, B, C) in ((x1, y1, z1), (x2, y2, z2), ...)
However, this query takes a very long time when I am inserting a lot of data.
Is there a simpler and quicker way of getting the values that the database assigned to the ID column of the incoming data?
Edit 1: I can assign the ID column myself, as per user3364098's answer below. However, my job is part of a pipeline that is ran in parallel. If I assign the ID column myself, there is a chance that I may assign the same id values to different dataframes that are uploaded at the same time. This is why I would like to relegate the ID assignment task to the database.
Solution: I ended up assigning the ID column myself, and issuing a lock on the table while uploading the data in order to guarantee that no other process uploads data with the same id value. Basically:
try:
engine.execute('lock tables `table_name` write')
max_id_query = 'select max(ID) FROM `table_name`'
max_id = int(pd.read_sql_query(max_id_query, engine).values)
df['ID'] = range(max_id + 1, max_id + len(df) + 1)
df.to_sql('table_name', engine, if_exists='append', index=False)
finally:
engine.execute('unlock tables')
The MS SQL Server uses the IDENTITY keyword to perform an auto-increment feature. In the example above, the starting value for IDENTITY is 1, and it will increment by 1 for each new record. Tip: To specify that the "Personid" column should start at value 10 and increment by 5, change it to IDENTITY(10,5) .
Auto-increment allows a unique number to be generated automatically when a new record is inserted into a table. Often this is the primary key field that we would like to be created automatically every time a new record is inserted.
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
Python's Pandas module provide easy ways to do aggregation and calculate metrics. Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest().
You can assign id by yourself:
import pandas as pd
df['ID'] = pd.read_sql_query('select ifnull(max(id),0)+1 from db_table',cnx).iloc[0,0]+range(len(df))
where cnx is your connection and then upload your df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With