Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

REPLACE rows in mysql database table with pandas DataFrame

Python Version - 2.7.6

Pandas Version - 0.17.1

MySQLdb Version - 1.2.5

In my database ( PRODUCT ) , I have a table ( XML_FEED ). The table XML_FEED is huge ( Millions of record ) I have a pandas.DataFrame() ( PROCESSED_DF ). The dataframe has thousands of rows.

Now I need to run this

REPLACE INTO TABLE PRODUCT.XML_FEED
(COL1, COL2, COL3, COL4, COL5),
VALUES (PROCESSED_DF.values)

Question:-

Is there a way to run REPLACE INTO TABLE in pandas? I already checked pandas.DataFrame.to_sql() but that is not what I need. I do not prefer to read XML_FEED table in pandas because it very huge.

like image 841
Yogesh Yadav Avatar asked Jan 07 '16 17:01

Yogesh Yadav


People also ask

Can I use Pandas instead of SQL?

Both Pandas and SQL are essential tools for data scientists and analysts. There are, of course, alternatives for both but they are the predominant ones in the field. Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both.


3 Answers

With the release of pandas 0.24.0, there is now an official way to achieve this by passing a custom insert method to the to_sql function.

I was able to achieve the behavior of REPLACE INTO by passing this callable to to_sql:

def mysql_replace_into(table, conn, keys, data_iter):
    from sqlalchemy.dialects.mysql import insert
    from sqlalchemy.ext.compiler import compiles
    from sqlalchemy.sql.expression import Insert

    @compiles(Insert)
    def replace_string(insert, compiler, **kw):
        s = compiler.visit_insert(insert, **kw)
        s = s.replace("INSERT INTO", "REPLACE INTO")
        return s

    data = [dict(zip(keys, row)) for row in data_iter]

    conn.execute(table.table.insert(replace_string=""), data)

You would pass it like so:

df.to_sql(db, if_exists='append', method=mysql_replace_into)

Alternatively, if you want the behavior of INSERT ... ON DUPLICATE KEY UPDATE ... instead, you can use this:

def mysql_replace_into(table, conn, keys, data_iter):
    from sqlalchemy.dialects.mysql import insert

    data = [dict(zip(keys, row)) for row in data_iter]

    stmt = insert(table.table).values(data)
    update_stmt = stmt.on_duplicate_key_update(**dict(zip(stmt.inserted.keys(), 
                                               stmt.inserted.values())))

    conn.execute(update_stmt)

Credits to https://stackoverflow.com/a/11762400/1919794 for the compile method.

like image 138
devnull Avatar answered Oct 24 '22 11:10

devnull


Till this version (0.17.1) I am unable find any direct way to do this in pandas. I reported a feature request for the same. I did this in my project with executing some queries using MySQLdb and then using DataFrame.to_sql(if_exists='append')

Suppose

1) product_id is my primary key in table PRODUCT

2) feed_id is my primary key in table XML_FEED.

SIMPLE VERSION

import MySQLdb
import sqlalchemy
import pandas

con = MySQLdb.connect('localhost','root','my_password', 'database_name')
con_str = 'mysql+mysqldb://root:my_password@localhost/database_name'
engine = sqlalchemy.create_engine(con_str) #because I am using mysql
df = pandas.read_sql('SELECT * from PRODUCT', con=engine)
df_product_id = df['product_id']
product_id_str = (str(list(df_product_id.values))).strip('[]')
delete_str = 'DELETE FROM XML_FEED WHERE feed_id IN ({0})'.format(product_id_str)
cur = con.cursor()
cur.execute(delete_str)
con.commit()
df.to_sql('XML_FEED', if_exists='append', con=engine)# you can use flavor='mysql' if you do not want to create sqlalchemy engine but it is depreciated

Please note:- The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.

like image 25
Yogesh Yadav Avatar answered Oct 24 '22 11:10

Yogesh Yadav


I needed a generic solution to this problem, so I built on shiva's answer--maybe it will be helpful to others. This is useful in situations where you grab a table from a MySQL database (whole or filtered), update/add some rows, and want to perform a REPLACE INTO statement with df.to_sql().

It finds the table's primary keys, performs a delete statement on the MySQL table with all keys from the pandas dataframe, and then inserts the dataframe into the MySQL table.

def to_sql_update(df, engine, schema, table):
    df.reset_index(inplace=True)
    sql = ''' SELECT column_name from information_schema.columns
              WHERE table_schema = '{schema}' AND table_name = '{table}' AND
                    COLUMN_KEY = 'PRI';
          '''.format(schema=schema, table=table)
    id_cols = [x[0] for x in engine.execute(sql).fetchall()]
    id_vals = [df[col_name].tolist() for col_name in id_cols]
    sql = ''' DELETE FROM {schema}.{table} WHERE 0 '''.format(schema=schema, table=table)
    for row in zip(*id_vals):
        sql_row = ' AND '.join([''' {}='{}' '''.format(n, v) for n, v in zip(id_cols, row)])
        sql += ' OR ({}) '.format(sql_row)
    engine.execute(sql)
    
    df.to_sql(table, engine, schema=schema, if_exists='append', index=False)
like image 2
dbc Avatar answered Oct 24 '22 11:10

dbc