Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading from SQL Server with params: pandas (or pyodbc) not functioning properly

I'm using a query in SQL Server that requires a range to check whether a number is in that range (e.g. in the below to check if DemographicGroupDimID is either (1,2 or 3) . After doing some googling the only way I found to be able to do that was the below:

SQL

DECLARE @adults table (Id int)
INSERT INTO @adults VALUES (1), (2), (3)

SELECT [date], [station], [impression] = SUM([impressions]) / COUNT(DISTINCT [datetime] )
       FROM 
       (SELECT [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), [date] = ddt.DateKey, [station] = nd.Name, [impressions] = SUM(naf.Impression)
       FROM [Nielsen].[dbo].[NielsenAnalyticsFact] as naf
       LEFT JOIN [dbo].[DateDim] AS ddt
       ON naf.StartDateDimID = ddt.DateDimID
       LEFT JOIN [dbo].NetworkDim as nd
       ON naf.NetworkDimID = nd.NetworkDimID
       LEFT JOIN [dbo].TimeDim as td
       ON naf.QuarterHourDimID = td.TimeDimID
       WHERE (naf.NielsenMarketDimID = 1
                     AND naf.RecordTypeDimID = 2
                     AND naf.AudienceEstimateTypeDimID = 1
                     AND naf.DailyOrWeeklyDimID = 1
                     AND naf.RecordSequenceCodeDimID = 5
                     AND naf.ViewingTypeDimID = 4
                     AND naf.QuarterHourDimID IS NOT NULL
                     AND naf.DemographicGroupDimID < 31
                     AND nd.Affiliation = 'Cable'
                     AND naf.NetworkDimID != 1278
                     AND naf.DemographicGroupDimID in (SELECT Id FROM @adults))
       GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), nd.Name, ddt.DateKey) 
AS grouped_table
GROUP BY [date], [station]
ORDER BY [date], [station]

If I need to dynamically do this, with different ranges, this fails, like so:

Pandas query

from queries import DB_CREDENTIALS
import pyodbc
import pandas as pd

sql_ = """DECLARE @adults table (Id int)
INSERT INTO @adults VALUES ?

SELECT [date], [station], [impression] = SUM([impressions]) / COUNT(DISTINCT [datetime] )
       FROM
       (SELECT [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), [date] = ddt.DateKey, [station] = nd.Name, [impressions] = SUM(naf.Impression)
       FROM [Nielsen].[dbo].[NielsenAnalyticsFact] as naf
       LEFT JOIN [dbo].[DateDim] AS ddt
       ON naf.StartDateDimID = ddt.DateDimID
       LEFT JOIN [dbo].NetworkDim as nd
       ON naf.NetworkDimID = nd.NetworkDimID
       LEFT JOIN [dbo].TimeDim as td
       ON naf.QuarterHourDimID = td.TimeDimID
       WHERE (naf.NielsenMarketDimID = 1
                     AND naf.RecordTypeDimID = 2
                     AND naf.AudienceEstimateTypeDimID = 1
                     AND naf.DailyOrWeeklyDimID = 1
                     AND naf.RecordSequenceCodeDimID = 5
                     AND naf.ViewingTypeDimID = 4
                     AND naf.QuarterHourDimID IS NOT NULL
                     AND naf.DemographicGroupDimID < 31
                     AND nd.Affiliation = 'Cable'
                     AND naf.NetworkDimID != 1278
                     AND naf.DemographicGroupDimID in (SELECT Id FROM @adults))
       GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), nd.Name, ddt.DateKey)
AS grouped_table
GROUP BY [date], [station]
ORDER BY [date], [station]"""

with pyodbc.connect(DB_CREDENTIALS) as cnxn:
    df = pd.read_sql(sql=sql_, con=cnxn, params=['(30)'])

The error:

---------------------------------------------------------------------------
DatabaseError                             Traceback (most recent call last)
<ipython-input-5-4b63847d007f> in <module>()
      1 with pyodbc.connect(DB_CREDENTIALS) as cnxn:
----> 2     df = pd.read_sql(sql=sql_, con=cnxn, params=['(30)'])

C:\Users\mburke\AppData\Local\Continuum\Anaconda64\lib\site-packages\pandas\io\sql.pyc in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns, chunksize)
    497             sql, index_col=index_col, params=params,
    498             coerce_float=coerce_float, parse_dates=parse_dates,
--> 499             chunksize=chunksize)
    500 
    501     try:

C:\Users\mburke\AppData\Local\Continuum\Anaconda64\lib\site-packages\pandas\io\sql.pyc in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1593 
   1594         args = _convert_params(sql, params)
-> 1595         cursor = self.execute(*args)
   1596         columns = [col_desc[0] for col_desc in cursor.description]
   1597 

C:\Users\mburke\AppData\Local\Continuum\Anaconda64\lib\site-packages\pandas\io\sql.pyc in execute(self, *args, **kwargs)
   1570             ex = DatabaseError(
   1571                 "Execution failed on sql '%s': %s" % (args[0], exc))
-> 1572             raise_with_traceback(ex)
   1573 
   1574     @staticmethod

C:\Users\mburke\AppData\Local\Continuum\Anaconda64\lib\site-packages\pandas\io\sql.pyc in execute(self, *args, **kwargs)
   1558                 cur.execute(*args, **kwargs)
   1559             else:
-> 1560                 cur.execute(*args)
   1561             return cur
   1562         except Exception as exc:

DatabaseError: Execution failed on sql 'DECLARE @adults table (Id int)
INSERT INTO @adults VALUES ?

SELECT [date], [station], [impression] = SUM([impressions]) / COUNT(DISTINCT [datetime] )
       FROM
       (SELECT [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), [date] = ddt.DateKey, [station] = nd.Name, [impressions] = SUM(naf.Impression)
       FROM [Nielsen].[dbo].[NielsenAnalyticsFact] as naf
       LEFT JOIN [dbo].[DateDim] AS ddt
       ON naf.StartDateDimID = ddt.DateDimID
       LEFT JOIN [dbo].NetworkDim as nd
       ON naf.NetworkDimID = nd.NetworkDimID
       LEFT JOIN [dbo].TimeDim as td
       ON naf.QuarterHourDimID = td.TimeDimID
       WHERE (naf.NielsenMarketDimID = 1
                     AND naf.RecordTypeDimID = 2
                     AND naf.AudienceEstimateTypeDimID = 1
                     AND naf.DailyOrWeeklyDimID = 1
                     AND naf.RecordSequenceCodeDimID = 5
                     AND naf.ViewingTypeDimID = 4
                     AND naf.QuarterHourDimID IS NOT NULL
                     AND naf.DemographicGroupDimID < 31
                     AND nd.Affiliation = 'Cable'
                     AND naf.NetworkDimID != 1278
                     AND naf.DemographicGroupDimID in (SELECT Id FROM @adults))
       GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.NielsenLocalHour,CONVERT(smalldatetime, ddt.DateKey))), nd.Name, ddt.DateKey)
AS grouped_table
GROUP BY [date], [station]
ORDER BY [date], [station]': ('42000', "[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]Incorrect syntax near '@P1'. (102) (SQLExecDirectW); [42000] [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. (8180)")

Is this because the declare statement needs to be within the bounds of the select statement itself? I'm not sure how pandas handles the pyodbc cursor object so I'm unsure where this error stems from.

Edit: Just to note, the param I passed in this instance was (30) just to use the simple case of when there is only one number in the range which fails. It of course also fails for more complex strings like (1), (2), (3) as was the case with the example above.

like image 231
mburke05 Avatar asked May 05 '16 13:05

mburke05


1 Answers

If you use prepared statements in your SQL, you can't put multiple values for one placeholder/parameter/bind variable!

Beside this you can use placeholders/parameters/bind variables only in place of literals, you can't use it for part of SQL statement which is not a literal.

In your case you tried to put ( and ) which is part of SQL, but not a literal as parameters.

Using parameters/prepared statements/bind variable will also protect you from some SQL injections.

that said, try to change your code as follows:

change

INSERT INTO @adults VALUES ?

to

INSERT INTO @adults VALUES (?)

and

df = pd.read_sql(sql=sql_, con=cnxn, params=['(30)'])

to

df = pd.read_sql(sql=sql_, con=cnxn, params=['30'])

UPDATE:

you can prepare your SQL this way:

In [9]: vals = [20,30,40]

In [32]: vals
Out[32]: [20, 30, 40]

In [33]: ' (?)' * len(vals)
Out[33]: ' (?) (?) (?)'

then:

In [14]: sql_ = """DECLARE @adults table (Id int)
   ....: INSERT INTO @adults VALUES {}
   ....:
   ....: SELECT [date],
   ....: """

In [15]: sql_.format(' (?)' * len(vals))
Out[15]: 'DECLARE @adults table (Id int)\nINSERT INTO @adults VALUES (?) (?) (?)\n\nSELECT [date],\n'

Pay attention at generated (?) (?) (?)

and finally call your SQL:

df = pd.read_sql(sql=sql_.format(' (?)' * len(vals)), con=cnxn, params=vals)
like image 94
MaxU - stop WAR against UA Avatar answered Oct 21 '22 18:10

MaxU - stop WAR against UA