I am reading data from Redshift using Pandas. I have one bigint
(int8
) column which is coming as exponential.
I tried following ways, but getting data truncation in those cases.
Sample Value of data in that column is : 635284328055690862
. It is reading as 6.352843e+17
.
I tried to convert that into int64
in Python.
import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)
Output in this case is : 635284328055690880
. Here I am loosing my data, it is scaling it to 0
at the end.
Expected Output: 635284328055690862
Even, I am getting same result If I am doing this.
pd.set_option('display.float_format', lambda x: '%.0f' % x)
Output: 635284328055690880
Expected Output: 635284328055690862
It seems like this is normal Pandas behavior. I even tried creating a Dataframe using list and still getting the same.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)
Output:
0 635284328055690880 758364950923147648
1 nan nan
2 1 3
What I have noticed is, whenever we have nan
in dataframe, we are having this issue.
I am using below code to fetch data from Redshift.
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)
Please help me in fixing this. Thanks in Advance.
This happens because standard integer datatypes do not provide a way to represent missing data. Since floating point datatypes do provide nan
, the old way of handling this was to convert numerical columns with missing data to float
.
To correct this, pandas has introduced a Nullable integer data type. If you were doing something as simple as reading a csv
, you could explicitly specify this type in your call to read_csv
like so:
>>> pandas.read_csv('sample.csv', dtype="Int64")
column_a column_b
0 635284328055690880 45564
1 <NA> 45
2 1 <NA>
3 1 5
However, the problem persists! It seems that even though 635284328055690862 can be represented as a 64-bit integer, at some point, pandas
still passes the value through a floating-point conversion step, changing the value. This is pretty odd, and might even be worth raising as an issue with the pandas developers.
The best workaround I see in this scenario is to use the "object" datatype, like so:
>>> pandas.read_csv('sample.csv', dtype="object")
column_a column_b
0 635284328055690862 45564
1 NaN 45
2 1 NaN
3 1 5
This preserves the exact value of the large integer and also allows for NaN
values. However, because these are now arrays of python objects, there will be a significant performance hit for compute-intensive tasks. Furthermore, on closer examination, it appears that these are Python str
objects, so we still need another conversion step. To my surprise, there was no straightforward approach. This was the best I could do:
def col_to_intNA(col):
return {ix: pandas.NA if pandas.isnull(v) else int(v)
for ix, v in col.to_dict().items()}
sample = {col: col_to_intNA(sample[col])
for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")
This gives the desired result:
>>> sample
column_a column_b
0 635284328055690862 45564
1 <NA> 45
2 1 <NA>
3 1 5
>>> sample.dtypes
column_a Int64
column_b Int64
dtype: object
So that solves one problem. But a second problem arises, because to read from a Redshift database, you would normally use read_sql
, which doesn't provide any way to specify data types.
So we'll roll our own! This is based on the code you posted, as well as some code from the pandas_redshift
library. It uses psycopg2
directly, rather than using sqlalchemy
, because I am not sure sqlalchemy
provides a cursor_factory
parameter that accepts a RealDictCursor
. Caveat: I have not tested this at all because I am too lazy to set up a postgres database just to test a StackOverflow answer! I think it should work but I am not certain. Please let me know whether it works and/or what needs to be corrected.
import psycopg2
from psycopg2.extras import RealDictCursor # Turn rows into proper dicts.
import pandas
def row_null_to_NA(row):
return {col: pandas.NA if pandas.isnull(val) else val
for col, val in row.items()}
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'
try: # `with conn:` only closes the transaction, not the connection
conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
cursor = conn.cursor()
cursor.execute('''select * from schema.table_name''')
# The DataFrame constructor accepts generators of dictionary rows.
df = pandas.DataFrame(
(row_null_to_NA(row) for row in cursor.fetchall()),
dtype="Int64"
)
finally:
conn.close()
print(df)
Note that this assumes that all your columns are integer columns. You might need to load the data column-by-column if not.
One of the fix can be instead of doing select * from schema.table_name
. You can pass all columns separately and then cast the particular column.
Let's say, you have 5 columns in table and col2 is bigint(int8)
column. So, you can read like below:
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)
P.S.: I am not sure this is the smartest solution but logically, if python is not able to cast to int64
properly then we can read casted value from SQL
itself.
Further, I would like to try to cast int columns dynamically if it's length is more than 17.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With