I've been working to develop a product which centers in the daily execution of a data analysis Python 3.7.0 script. Everyday at midnight it will proccess a huge amount of data, and then export the result to two MySQL tables. The first one will only contain the data relative to the current day, while the other table will contain the concatenated data of all executions.
To exemplify what I current have, see the code below, supposing df
would be the final DataFrame generated from the data analysis:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine(r"mysql+pymysql://user:psswd@localhost/pathToMyDB")
df = pd.DataFrame({'Something':['a','b','c']})
df.to_sql('DReg', engine, index = True, if_exists='replace') #daily database
df.to_sql('AReg', engine, index = False, if_exists='append') #anual database
As you can see in the parameters of my second to_sql
function, I ain't setting an index to the anual database. However, my manager asked me to do so, creating an index that would center around a simple rule: it would be an auto increasing numeric index, that would automatically attribute a number to every row saved on the database corresponding to its position.
So basically, the first time I saved df
, the database should look like:
index Something
0 a
1 b
2 c
And in my second execution:
index Something
0 a
1 b
2 c
3 a
4 b
5 c
However, when I set my index to True
in the second df.to_sql
command (turning it into df.to_sql('AReg', engine, index = True, if_exists='append')
), after two executions my database ends up looking like:
index Something
0 a
1 b
2 c
0 a
1 b
2 c
I did some research, but could not find a way to allow this auto increase on the index. I considered reading the anual database at every execution and then adapting my dataframe's index to it, but my database can easily get REALLY huge, which would make it's execution absurdly slow (and also forbid me to simultaneously execute the same data analysis in two computers without compromising my index).
So what is the best solution to make this index work? What am I missing here?
Even though Pandas has a lot of export options, its main purpose is not intented to use as database management api. Managing indexes is typically something a database should take care of.
I would suggest to set index=False, if_exists='append'
and create the table with an auto-increment index:
CREATE TABLE AReg (
id INT NOT NULL AUTO_INCREMENT,
# your fields here
PRIMARY KEY (id)
);
Here is my solution. SQL + Python。
update the id need read from database sequency to ensure unique id if multi user/session cases.
In mysql, we get max id manualy. in Oracle Or postgresql, we can get the max sequence id by advanced sql command.
import pandas as pd
from pprint import pprint
from sqlalchemy import create_engine
db_name = 'temp'
table_name = 'tmp_table'
index_name = 'id'
mysql_url = f'mysql+mysqlconnector://root:[email protected]:13306/{db_name}'
engine=create_engine(mysql_url)
def to_sql_seq(df,table_name=table_name, engine=engine):
get_seq_id_sql = f"""
select your_sequence.nextval as id
from dual
connect by level < {df.shape[0]}
"""
# sql_get_max_id = f'select max({index_name}) as id from {table_name}'
s_id = pd.read_sql(get_seq_id_sql , engine)
df.index =s_id['id'].values
df.index.name=index_name
df.to_sql(table_name,engine,if_exists='append')
return
#Check the current database record
current_table = pd.read_sql(f"select * from {table_name}",engine)
pprint(current_table)
#Simu the new data
new_data = [1,2,3,4]
new_table = pd.DataFrame(new_data,columns=['value'])
to_sql_seq(new_table)
#show the index auto_increment result
inserted_table = pd.read_sql(f'select * from {table_name}',engine)
pprint(inserted_table)
And output
id value
0 1 123
1 2 234
2 3 1
3 4 2
4 5 3
5 6 4
id value
0 1 123
1 2 234
2 3 1
3 4 2
4 5 3
5 6 4
6 7 1
7 8 2
8 9 3
9 10 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With