Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter column with two different schemas in spark scala

I have dataframe with three columns; ID, CO_ID and DATA, where the DATA column have two diffrent schemas given below:

|ID  |CO_ID |Data
|130 |NA    | [{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]
|536 |NA    | [{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]   
|518 |NA    | null
|938 |611   | {"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}                                                                                                                           
|742 |NA    | {"NUMBER":"AW9F","ADDRESS":"PLOT NO. 230, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}

Now I want to create a dataframe of columns ID, CO_ID, NUMBER,ADDRESS and NAME. If no value then value null be filled in NUMBER,ADDRESS and NAME.

First of all I have to filter above dataframe with different schemas, how can I do it?

like image 631
Mohammad Sunny Avatar asked Feb 02 '26 13:02

Mohammad Sunny


1 Answers

Here is one approach:

val df = Seq(
(130, "NA","""[{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 231, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}]"""),
(536, "NA","""[{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 232, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}}]"""),
(518,"NA", null),
(938, "611", """{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 233, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}"""),
(742, "NA", """{"NUMBER":"AW9F","ADDRESS":"PLOT NO. 234, JAIPUR RJ","PHONE":999999999,"NAME":"SACHIN"}"""))
.toDF("ID","CO_ID","Data")


import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.{from_json, array, when, length, lit}

val schema = (new StructType)
   .add("NUMBER", "string", true)
   .add("ADDRESS", "string", true)
   .add("PHONE", "string", true)
   .add("NAME", "string", true)

val df_ar = df.withColumn("json", 
                       when($"data"
                         .startsWith("[{") && $"data".endsWith("}]"), $"data".substr(lit(2), length($"data") - 2))
                         .otherwise($"data")) //checks whether data start with '[{' and ends with '}]' if it does removes []
              .withColumn("json", from_json($"json", schema)) //covert to JSON based on given schema
              .withColumn("number", $"json.NUMBER")
              .withColumn("address", $"json.ADDRESS")
              .withColumn("name", $"json.NAME")

df_ar.select("ID", "CO_ID", "number", "address", "name").show(false)

This solution first removes [] from the JSON string then applies the given schema converting the string JSON into a StructType column.

Output:

+---+-----+------+-----------------------+------+
|ID |CO_ID|number|address                |name  |
+---+-----+------+-----------------------+------+
|130|NA   |AW9F  |PLOT NO. 231, JAIPUR RJ|SACHIN|
|536|NA   |AW9F  |PLOT NO. 232, JAIPUR RJ|SACHIN|
|518|NA   |null  |null                   |null  |
|938|611  |AW9F  |PLOT NO. 233, JAIPUR RJ|SACHIN|
|742|NA   |AW9F  |PLOT NO. 234, JAIPUR RJ|SACHIN|
+---+-----+------+-----------------------+------+
like image 60
abiratsis Avatar answered Feb 04 '26 01:02

abiratsis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!