Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Casting the Dataframe columns with validation in spark

I need to cast the column of the data frame containing values as all string to a defined schema data types. While doing the casting we need to put the corrupt records (records which are of wrong data types) into a separate column

Example of Dataframe

+---+----------+-----+
|id |name      |class|
+---+----------+-----+
|1  |abc       |21   |
|2  |bca       |32   |
|3  |abab      | 4   |
|4  |baba      |5a   |
|5  |cccca     |     |
+---+----------+-----+

Json Schema of the file:

 {"definitions":{},"$schema":"http://json-schema.org/draft-07/schema#","$id":"http://example.com/root.json","type":["object","null"],"required":["id","name","class"],"properties":{"id":{"$id":"#/properties/id","type":["integer","null"]},"name":{"$id":"#/properties/name","type":["string","null"]},"class":{"$id":"#/properties/class","type":["integer","null"]}}}

In this row 4 is corrupt records as the class column is of type Integer So only this records has to be there in corrupt records, not the 5th row

like image 613
Etisha Avatar asked Nov 24 '25 21:11

Etisha


1 Answers

Just check if value is NOT NULL before casting and NULL after casting

import org.apache.spark.sql.functions.when

df
  .withColumn("class_integer", $"class".cast("integer"))
  .withColumn(
    "class_corrupted", 
    when($"class".isNotNull and $"class_integer".isNull, $"class"))

Repeat for each column / cast you need.

like image 145
user11692913 Avatar answered Nov 27 '25 12:11

user11692913



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!