Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas

I am reading data from Redshift using Pandas. I have one bigint (int8) column which is coming as exponential. I tried following ways, but getting data truncation in those cases.

Sample Value of data in that column is : 635284328055690862. It is reading as 6.352843e+17.

I tried to convert that into int64 in Python.

import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)

Output in this case is : 635284328055690880. Here I am loosing my data, it is scaling it to 0 at the end.

Expected Output: 635284328055690862

Even, I am getting same result If I am doing this.

pd.set_option('display.float_format', lambda x: '%.0f' % x)

Output: 635284328055690880

Expected Output: 635284328055690862

It seems like this is normal Pandas behavior. I even tried creating a Dataframe using list and still getting the same.

import pandas as pd
import numpy as np

pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)


Output:
0 635284328055690880 758364950923147648
1                nan                nan
2                  1                  3

What I have noticed is, whenever we have nan in dataframe, we are having this issue.

I am using below code to fetch data from Redshift.

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)

Please help me in fixing this. Thanks in Advance.

like image 620
Xyz Avatar asked Sep 16 '20 12:09

Xyz


2 Answers

This happens because standard integer datatypes do not provide a way to represent missing data. Since floating point datatypes do provide nan, the old way of handling this was to convert numerical columns with missing data to float.

To correct this, pandas has introduced a Nullable integer data type. If you were doing something as simple as reading a csv, you could explicitly specify this type in your call to read_csv like so:

>>> pandas.read_csv('sample.csv', dtype="Int64")
             column_a  column_b
0  635284328055690880     45564
1                <NA>        45
2                   1      <NA>
3                   1         5

However, the problem persists! It seems that even though 635284328055690862 can be represented as a 64-bit integer, at some point, pandas still passes the value through a floating-point conversion step, changing the value. This is pretty odd, and might even be worth raising as an issue with the pandas developers.

The best workaround I see in this scenario is to use the "object" datatype, like so:

>>> pandas.read_csv('sample.csv', dtype="object")
             column_a column_b
0  635284328055690862    45564
1                 NaN       45
2                   1      NaN
3                   1        5 

This preserves the exact value of the large integer and also allows for NaN values. However, because these are now arrays of python objects, there will be a significant performance hit for compute-intensive tasks. Furthermore, on closer examination, it appears that these are Python str objects, so we still need another conversion step. To my surprise, there was no straightforward approach. This was the best I could do:

def col_to_intNA(col):
    return {ix: pandas.NA if pandas.isnull(v) else int(v)
            for ix, v in col.to_dict().items()}

sample = {col: col_to_intNA(sample[col])
          for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")

This gives the desired result:

>>> sample
             column_a  column_b
0  635284328055690862     45564
1                <NA>        45
2                   1      <NA>
3                   1         5
>>> sample.dtypes
column_a    Int64
column_b    Int64
dtype: object

So that solves one problem. But a second problem arises, because to read from a Redshift database, you would normally use read_sql, which doesn't provide any way to specify data types.

So we'll roll our own! This is based on the code you posted, as well as some code from the pandas_redshift library. It uses psycopg2 directly, rather than using sqlalchemy, because I am not sure sqlalchemy provides a cursor_factory parameter that accepts a RealDictCursor. Caveat: I have not tested this at all because I am too lazy to set up a postgres database just to test a StackOverflow answer! I think it should work but I am not certain. Please let me know whether it works and/or what needs to be corrected.

import psycopg2
from psycopg2.extras import RealDictCursor  # Turn rows into proper dicts.

import pandas

def row_null_to_NA(row):
    return {col: pandas.NA if pandas.isnull(val) else val
            for col, val in row.items()}

connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'

try:  # `with conn:` only closes the transaction, not the connection 
    conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
    cursor = conn.cursor()
    cursor.execute('''select * from schema.table_name''')

    # The DataFrame constructor accepts generators of dictionary rows.
    df = pandas.DataFrame(
        (row_null_to_NA(row) for row in cursor.fetchall()), 
        dtype="Int64"
    )
finally:
    conn.close()

print(df)

Note that this assumes that all your columns are integer columns. You might need to load the data column-by-column if not.

like image 69
senderle Avatar answered Oct 10 '22 14:10

senderle


One of the fix can be instead of doing select * from schema.table_name. You can pass all columns separately and then cast the particular column.

Let's say, you have 5 columns in table and col2 is bigint(int8) column. So, you can read like below:

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)

P.S.: I am not sure this is the smartest solution but logically, if python is not able to cast to int64 properly then we can read casted value from SQL itself.

Further, I would like to try to cast int columns dynamically if it's length is more than 17.

like image 1
Rajnish kumar Avatar answered Oct 10 '22 13:10

Rajnish kumar