duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Tags:

Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error:

IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL:  Key (typname, typnamespace)=(test1, 2200) already exists.
 [SQL: '\nCREATE TABLE test1 (\n\t"A" BIGINT, \n\t"B" BIGINT, \n\t"C" BIGINT, \n\t"D" BIGINT, \n\t"E" BIGINT, \n\t"F" BIGINT, \n\t"G" BIGINT, \n\t"H" BIGINT, \n\t"I" BIGINT, \n\t"J" BIGINT, \n\tidx BIGINT\n)\n\n']

You can recreate the error with the following code:

import numpy as np
import dask.dataframe as dd
import dask
import pandas as pd
import sqlalchemy_utils as sqla_utils
import sqlalchemy as sqla
DATABASE_CONFIG = {
    'driver': '',
    'host': '',
    'user': '',
    'password': '',
    'port': 5432,
}
DBNAME = 'dask'
url = '{driver}://{user}:{password}@{host}:{port}/'.format(
        **DATABASE_CONFIG)
db_url = url.rstrip('/') + '/' + DBNAME
# create db if non-existent
if not sqla_utils.database_exists(db_url):
    print('Creating database \'{}\''.format(DBNAME))
    sqla_utils.create_database(db_url)
conn = sqla.create_engine(db_url)
# create pandas df with random numbers
df = pd.DataFrame(np.random.randint(0,40,size=(100, 10)), columns=list('ABCDEFGHIJ'))
# add index so that it can be used as primary key later on
df['idx'] = df.index
# create dask df
ddf = dd.from_pandas(df, npartitions=4)
# Write to psql
dto_sql = dask.delayed(pd.DataFrame.to_sql)
out = [dto_sql(d, 'test', db_url, if_exists='append', index=False, index_label='idx')
       for d in ddf.to_delayed()]
dask.compute(*out)

The code doesn't produce an error if npartitions is set to 1. So I'm guessing it has to do with postgres not being able to handle parallel requests to write to a same sql table...? How can I fix this?

314

asked Jan 24 '19 16:01

Ludo

3 Answers

I was reading this. It seems this error rises when you are creating/updating the same table with parallel processing. I understand it depends because of this (as explained on the google group discussion).

So I think it depend from PostgreSQL itself and not from the connection driver or the module used for the multiprocessing.

Well, Actually, the only way I found to solve this is to create chunks big enough to have back a writing process slower than the calculation itself. With bigger chunks this error doesn't rise.

answered Oct 03 '22 08:10

Glori P.

In PostgreSQL that helps me.

set enable_parallel_hash=off;

After u can turn it on

set enable_parallel_hash=on;

answered Oct 03 '22 10:10

Sergey Nakonechny

I had the same error with ponyORM on PostgreSQL in Heroku. I solved it by locking the thread until it executes the DB operation. In my case:

lock = threading.Lock()
with lock:
    PonyOrmEntity(name='my_name', description='description')
    PonyOrmEntity.get(lambda u: u.name == 'another_name')

answered Oct 03 '22 09:10

genchev

Related questions
                            
                                pandas idxmax: return all rows in case of ties
                            
                                How to obtain the chi squared value as an output of scipy.optimize.curve_fit?
                            
                                Repartition Dask DataFrame to get even partitions
                            
                                Creating a list of dictionaries with same keys? [duplicate]
                            
                                Format pandas dataframe row wise
                            
                                Removing backslashes from values in a Pandas dataframe
                            
                                Error : PerfectSeparationError: Perfect separation detected, results not available
                            
                                Why is random number generator tf.random_uniform in tensorflow much faster than the numpy equivalent
                            
                                Efficiently resize batch of np.array images
                            
                                RabbitMQ Error 530 vhost not found with pika
                            
                                Accessing super(parent) class variable in python
                            
                                asynchronous python itertools chain multiple generators
                            
                                How do I remove a model but keep the database table on Django
                            
                                python pandas "cannot set a row with mismatched columns" error
                            
                                os.walk is returning a generator object. What am I doing wrong?
                            
                                How to customize the LayerControl in Folium?
                            
                                Any concurrent.futures timeout that actually works?
                            
                                Make to make text size auto adjust to an image with PIL
                            
                                Change order of pandas.MultiIndex
                            
                                How can I to run Windows PowerShell commands from Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Tags:

python

pandas

postgresql

dask

pandas-to-sql