Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get pandas dataframe.to_sql to use default values defined by database DDL for NaN?

In my database ddl (SQL Server) I have fields which have default values:

my_database_field varchar(255)  NOT NULL 
     DEFAULT  'NVT'

However, when I try to insert data into this table using DataFrame.to_sql() I get the following error:

Cannot insert the value NULL into column 'my_database_field', table 'MYDB.dbo.my_table'; column does not allow nulls. INSERT fails. 

Here is the code:

with engine.begin() as conn:
    dataframe.to_sql(table_name, conn, index=False, if_exists='append')

The column 'my_database_field' has NaN values:

data['my_database_field']
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4154   NaN
4155   NaN
4156   NaN
4157   NaN
4158   NaN

This seems to be because the to_sql method is replacing NaN values with NULLs (probably how it should be)

I want the option to insert NaN values and have the database use the DEFAULT value defined in the schema instead of inserting NULL directly when using the to_sql method.

I've tried replacing all NaNs in the dataframe with empty strings however this suffers from the same problem. It propagates those empty strings to the data base and that's what appears, not the default value specified in the DDL.

The only way I can get it to work is to not use pandas dataframe.to_sql method at all but instead load the data from a csv file and write line by line into the data base using basic sqlalchemy SQL Expressions.

I've opened an issue in the pandas github: https://github.com/pandas-dev/pandas/issues/42408

Edit: As per @Larnu's comment, the key bit of information I was missing was that:

A DEFAULT value is only used when you don't specify the column in the INSERT. If you (or the application) explicitly supplies a NULL value, then a NULL value will be inserted. DEFAULT isn't a "use this value when NULL is supplied" property it's a "use this value when the column is omitted"

The problem now becomes how to elegantly handle inserting a large dataframe where some columns contain some NaNs. I'd like to avoid having to go row by row to insert the data with a check to see if any values in that row are NaN.

Edit 2: I solved this problem by using the ORM as follows:

Base = declarative_base()


class MyModel(Base):
    __tablename__ = 'my_table'

my_id = Column(String, primary_key=True)
my_integer_field = Column(Integer)
my_database_field = Column(String, server_default='NVT')

@staticmethod
def from_dict(input_dict: dict):
    for k, v in input_dict.items():
        logging.debug(f"Key: {k} - Value: {v}")
        if type(v) is str:
            continue
        try:
            if np.isnan(v):
                input_dict[k] = None
        except TypeError as e:
            logging.debug(e)
            logging.debug(f"Could not test if: {type(v)}:{v} was NaN")

    return __class__(**input_dict)

As you can see I have added a custom static method called 'from_dict' which builds my model from a dictionary. For each value I check if it is a np.nan and if it is I set it to None instead. This seems to make the ORM not include that field in the insert method under the hood and therefore let the database use the default value. The np.isnan() method only works for numbers not strings so for those I just catch the exception, log out some info and move on.

For string values it seems I must set the 'server_default' in the model itself. If anyone knows more about what's going on here please feel free to share.

like image 275
Robin Staunton-Collins Avatar asked Jan 18 '26 21:01

Robin Staunton-Collins


1 Answers

As @Larnu mentions in his comment, the default value will only be used when that column is not stated in the insert statement. Pandas is seeing that column name in the dataframe and inserting. So you need to split up your data and remove columns where you want defaults.

With small data, you can iterate through the rows, use dropna on the columns and then to_sql.

For larger datasets you could split up your data so that you upload only rows with no NaN's first. Then create sets for every combination of columns, on rows where all those column values are null, drop the columns and only upload the columns that do have data. For example a dataframe with columns A and B, upload all rows where both values are not null, upload a dataframe of column A only where the value in B is null and vice versa.

import numpy as np
import pandas as pd
import itertools

#create random dataframe of integers
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), 
    columns=['A','B','C','D'])
#turn ~20% of them into NaN's
df = df.mask(np.random.random(df.shape) < .2)

#upload the no nulls data first
df.dropna().to_sql(table_name, conn, index=False, if_exists='append')

#get column names into list
cols = list(df.columns)
#create powerset of column names
p_set = itertools.chain.from_iterable(
    itertools.combinations(cols,r) for r in range(1,len(cols)))

#iterate through all combinations of column names
for null_cols in p_set:
    #determine what the columns are where we DO want values
    not_null_cols = list(set(cols) - set(null_cols))

    #filter original dataframe to rows where ALL the null_cols are 
    # null AND ALL the other cols have a non-null value
    sub_df = df[df[list(null_cols)].isnull().all(1) &
                df[list(not_null_cols)].notnull().all(1)]

    #if this dataframe has values, send it to database
    if len(sub_df) > 0:
        #sub_df still has all columns at this point, only keep the
        #  ones with values
        sub_df = sub_df[not_null_cols]
        sub_df.to_sql(table_name, conn, index=False, if_exists='append')
like image 71
FLSte Avatar answered Jan 21 '26 11:01

FLSte