I am trying to read a flat file, assign column names to the dataframe and cast few columns as per my requirements. However while casting a column to Decimal gives me error in Polars. I implemented the same successfully in Spark, but need help if anyone can guide me how to do the same in Polars.
sample data
data = b"""
D|120277|Patricia|167.2|26/12/1982
D|757773|Charles|167.66|08/04/2019
D|248498|Katrina|168.68|20/11/2016
D|325561|Christina|170.86|05/10/1998
D|697464|Joshua|171.41|07/09/1970
D|244750|Zachary|169.43|08/12/2014
""".strip()
Polars Script
import polars as pl
cols_dict = {'column_1': 'rtype', 'column_2': 'EMP_ID', 'column_3': 'EMP_NAME', 'column_4': 'SALARY', 'column_5': 'DOB'}
df = pl.read_csv(data, separator='|', has_header=False)
df = df.select(pl.all().name.map(lambda col_name: cols_dict.get(col_name)))
df = df.with_columns(
pl.col('EMP_ID').cast(pl.Decimal(scale=6, precision=0)))
It throws me exception:
InvalidOperationError: conversion from
i64todecimal[1,6]failed in column 'EMP_ID' for 5 out of 5 values: [120277, 757773, … 697464]
Spark Script:
rdd = sparkContext.textFile(core_data_file).filter(lambda x: x\[0\] == "D").map(
lambda x: x.split('|'))
sparkDf= rdd.toDF(schema=\["rtype"\] + list(cols_dict .values()))
sparkDf= sparkDf.withColumn(col_name, coalesce(
col(col_name).cast(DecimalType(data_length, data_scale)), lit(0)))
sarkDf.show()
+-----+------+---------+------+-------------------+
|rtype|EMP_ID|EMP_NAME |SALARY|DOB |
+-----+------+---------+------+-------------------+
|D |120277|Patricia |167.20|1982-12-26 00:00:00|
|D |757773|Charles |167.66|2019-04-08 00:00:00|
|D |248498|Katrina |168.68|2016-11-20 00:00:00|
|D |325561|Christina|170.86|1998-10-05 00:00:00|
|D |697464|Joshua |171.41|1970-09-07 00:00:00|
|D |244750|Zachary |169.43|2014-12-08 00:00:00|
+-----+------+---------+------+-------------------+
In polars precision and scale are reversed from most other SQL-like definitions.
In your Spark code, the DecimalType is expecting a precision then scale.
col(col_name).cast(DecimalType(data_length, data_scale))
However, because the polars Decimal has a default of None for precision, and no default for scale, the order is reversed.
# A decimal[6,0]
pl.Decimal(0, 6)
pl.Decimal(scale=0, precision=6)
If the decimal configuration is enabled, and precision and scale are set appropriately, the cast will work.
pl.Config(activate_decimals = True)
df = df.with_columns(pl.col("EMP_ID").cast(pl.Decimal(precision=6, scale=0)))
print(df)
shape: (6, 5)
┌───────┬──────────────┬───────────┬────────┬─────────────┐
│ rtype ┆ EMP_ID ┆ EMP_NAME ┆ SALARY ┆ DOB │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ decimal[6,0] ┆ str ┆ f64 ┆ str │
╞═══════╪══════════════╪═══════════╪════════╪═════════════╡
│ D ┆ 120277 ┆ Patricia ┆ 167.2 ┆ 26/12/1982 │
│ D ┆ 757773 ┆ Charles ┆ 167.66 ┆ 08/04/2019 │
│ D ┆ 248498 ┆ Katrina ┆ 168.68 ┆ 20/11/2016 │
│ D ┆ 325561 ┆ Christina ┆ 170.86 ┆ 05/10/1998 │
│ D ┆ 697464 ┆ Joshua ┆ 171.41 ┆ 07/09/1970 │
│ D ┆ 244750 ┆ Zachary ┆ 169.43 ┆ 08/12/2014 │
└───────┴──────────────┴───────────┴────────┴─────────────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With