Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark Dataframe Creation DecimalType issue

Tags:

pyspark

I am trying to create a pyspark dataframe from a list of dict and a defined schema for the dataframe. One column in the defined schema is a DecimalType. While I create the dataframe, I get an error;

TypeError: field b: DecimalType(38,18) can not accept object 0.1 in type <class 'float'>

test_data = [{"a": "1", "b": 0.1}, {"a": "2", "b": 0.2}]
schema = StructType(
    [
        StructField("a", StringType()),
        StructField("b", DecimalType(38, 18)),
    ]
)
# Create an dataframe
df = spark.createDataFrame(data = test_data,
                           schema = schema)

Could someone help out with this issue. How can I pass a decimaltype data in a list.?

like image 861
Chakra Avatar asked Aug 31 '25 11:08

Chakra


2 Answers

If you can lose some accuracy then you can change the type to FloatType as Bala suggested .
You can also change to DoubleType if you need more accuracy. FloatType support 4 bytes of information while DoubleType have 8 bytes (see here).

If you need maximum accuracy, you can use Pythons Decimal module that by default has 28 digits after the dot:

from pyspark.sql.types import *

from decimal import Decimal

test_data = [{"a": "1", "b": Decimal(0.1) }, {"a": "2", "b": Decimal(0.2) }]
schema = StructType(
    [
        StructField("a", StringType()),
        StructField("b", DecimalType(38, 18)),
    ]
)
# Create a dataframe
df = spark.createDataFrame(data = test_data,
                           schema = schema)

If for example we run this code:

from pyspark.sql.types import *
from decimal import Decimal

test_data = [
  (1.9868968969869869045652421846, 1.9868968969869869045652421846, Decimal(1.9868968969869869045652421846)),
]

schema = StructType(
    [
        StructField("float_col", FloatType()),
        StructField("double_col", DoubleType()),
        StructField("decimal_col", DecimalType(38, 28)),
    ]
)
# Create an dataframe
df = spark.createDataFrame(data = test_data,
                           schema = schema)

we would get this difference: enter image description here

like image 110
walking Avatar answered Sep 03 '25 23:09

walking


Change it to FloatType

test_data = [{"a": "1", "b": 0.1}, {"a": "2", "b": 0.2}]
schema2 = StructType(
    [
        StructField("a", StringType()),
        StructField("b", FloatType()),
    ]
)

df = spark.createDataFrame(data=test_data,schema=schema2)
df.show()

+---+---+
|  a|  b|
+---+---+
|  1|0.1|
|  2|0.2|
+---+---+
like image 21
Bala Avatar answered Sep 04 '25 01:09

Bala