Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Pandas to_sql determine what dataframe column is placed into what database field?

I'm currently using Pandas to_sql in order to place a large dataframe into an SQL database. I'm using sqlalchemy in order to connect with the database and part of that process is defining the columns of the database tables.

My question is, when I'm running to_sql on a dataframe, how does it know what column from the dataframe goes into what field in the database? Is it looking at column names in the dataframe and looking for the same fields in the database? Is it the order that the variables are in?

Here's some example code to facilitate discussion:

engine = create_engine('sqlite:///store_data.db')
meta = MetaData()

table_pop = Table('xrf_str_geo_ta4_1511', meta, 
    Column('TDLINX',Integer, nullable=True, index=True),
    Column('GEO_ID',Integer, nullable=True),
    Column('PERCINCL', Numeric, nullable=True)
)

meta.create_all(engine)

for df in pd.read_csv(file, chunksize=50000, iterator=True, encoding='utf-8', sep=',')
    df.to_sql('table_name', engine, flavor='sqlite', if_exists='append', index=index)

The dataframe in question has 3 columns TDLINX, GEO_ID, and PERCINCL

like image 983
Alexander Moore Avatar asked Jan 13 '16 15:01

Alexander Moore


People also ask

How does pandas determine data type?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

What is DF To_sql?

DataFrame - to_sql() function. The to_sql() function is used to write records stored in a DataFrame to a SQL database. Syntax: DataFrame.to_sql(self, name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)

What is the data type of each column in pandas?

Data type of columns. Rows in Dataframe. non-null entries in each column.

Is pandas column or row based?

Each column in a DataFrame is a Series If you are familiar to Python dictionaries, the selection of a single column is very similar to selection of dictionary values based on the key. A pandas Series has no column labels, as it is just a single column of a DataFrame . A Series does have row labels.


1 Answers

The answer is indeed what you suggest: it is looking at the column names. So matching columns names is important, the order does not matter.

To be fully correct, pandas will not actually check this. What to_sql does under the hood is executing an insert statement where the data to insert is provided as a dict, and then it is just up to the database driver to handle this.
This also means that pandas will not check the dtypes or the number of columns (e.g. if not all fields of the database are present as columns in the dataframe, these will filled with a default value in the database for these rows).

like image 160
joris Avatar answered Oct 19 '22 11:10

joris