Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_sql - Increase table's index when appending DataFrame

I've been working to develop a product which centers in the daily execution of a data analysis Python 3.7.0 script. Everyday at midnight it will proccess a huge amount of data, and then export the result to two MySQL tables. The first one will only contain the data relative to the current day, while the other table will contain the concatenated data of all executions.

To exemplify what I current have, see the code below, supposing df would be the final DataFrame generated from the data analysis:

import pandas as pd
import sqlalchemy

engine = sqlalchemy.create_engine(r"mysql+pymysql://user:psswd@localhost/pathToMyDB")

df = pd.DataFrame({'Something':['a','b','c']})

df.to_sql('DReg', engine, index = True, if_exists='replace') #daily database
df.to_sql('AReg', engine, index = False, if_exists='append') #anual database

As you can see in the parameters of my second to_sql function, I ain't setting an index to the anual database. However, my manager asked me to do so, creating an index that would center around a simple rule: it would be an auto increasing numeric index, that would automatically attribute a number to every row saved on the database corresponding to its position.

So basically, the first time I saved df, the database should look like:

index   Something
0       a
1       b
2       c

And in my second execution:

index   Something
0       a
1       b
2       c
3       a
4       b
5       c

However, when I set my index to True in the second df.to_sql command (turning it into df.to_sql('AReg', engine, index = True, if_exists='append')), after two executions my database ends up looking like:

index   Something
0       a
1       b
2       c
0       a
1       b
2       c

I did some research, but could not find a way to allow this auto increase on the index. I considered reading the anual database at every execution and then adapting my dataframe's index to it, but my database can easily get REALLY huge, which would make it's execution absurdly slow (and also forbid me to simultaneously execute the same data analysis in two computers without compromising my index).

So what is the best solution to make this index work? What am I missing here?

like image 718
Pedro Martins de Souza Avatar asked Feb 21 '19 14:02

Pedro Martins de Souza


2 Answers

Even though Pandas has a lot of export options, its main purpose is not intented to use as database management api. Managing indexes is typically something a database should take care of.

I would suggest to set index=False, if_exists='append' and create the table with an auto-increment index:

CREATE TABLE AReg (
     id INT NOT NULL AUTO_INCREMENT,
     # your fields here
     PRIMARY KEY (id)
);
like image 148
tvgriek Avatar answered Oct 20 '22 05:10

tvgriek


Here is my solution. SQL + Python。

Use SQL to get max index id instead of read whole table. it is fast and light load on DB and python.

update the id need read from database sequency to ensure unique id if multi user/session cases.

It is best to design the table with auto incremental id. if not, then follow cases need be followed. all new id should get from the sequence instance of database. The sequency instance can make sure the id will be unique even multi user/sesseion readging.

In mysql, we get max id manualy. in Oracle Or postgresql, we can get the max sequence id by advanced sql command.

import pandas as pd
from pprint import pprint
from sqlalchemy import create_engine


db_name = 'temp'
table_name = 'tmp_table'
index_name = 'id'
mysql_url = f'mysql+mysqlconnector://root:[email protected]:13306/{db_name}'
engine=create_engine(mysql_url)

def to_sql_seq(df,table_name=table_name, engine=engine):

    get_seq_id_sql = f"""
                       select your_sequence.nextval as id
                        from dual 
                         connect by level < {df.shape[0]}
                     """

    # sql_get_max_id = f'select max({index_name}) as id from {table_name}'

    s_id = pd.read_sql(get_seq_id_sql , engine)

    df.index =s_id['id'].values
    df.index.name=index_name
    df.to_sql(table_name,engine,if_exists='append')
    return
#Check the current database record
current_table = pd.read_sql(f"select * from {table_name}",engine)
pprint(current_table)

#Simu the new data
new_data = [1,2,3,4]
new_table = pd.DataFrame(new_data,columns=['value'])
to_sql_seq(new_table)

#show the index auto_increment result
inserted_table = pd.read_sql(f'select * from {table_name}',engine)
pprint(inserted_table)

And output

   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
   id  value
0   1    123
1   2    234
2   3      1
3   4      2
4   5      3
5   6      4
6   7      1
7   8      2
8   9      3
9  10      4
like image 1
Yong Wang Avatar answered Oct 20 '22 06:10

Yong Wang