Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnFlatten Dataframe to a specific structure

I have a flat dataframe (df) with the structure as below:

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- title: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_Date: string (nullable = true)
 |-- city: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)
 |-- email_name: string (nullable = true)
 |-- company: struct (nullable = true)
 |-- org_name: string (nullable = true)
 |-- company_phone: string (nullable = true)
 |-- partition_column: string (nullable = true)

And I need to convert this dataframe into a structure like (as my next data will be in this format):

root
 |-- firstName: string (nullable = true)
 |-- middleName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- currentPosition: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- title: string (nullable = true)
 |    |    |-- startDate: string (nullable = true)
 |    |    |-- endDate: string (nullable = true)
 |    |    |-- address: struct (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- zipCode: string (nullable = true)
 |    |    |    |-- state: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |    |-- emailName: string (nullable = true)
 |    |    |-- company: struct (nullable = true)
 |    |    |    |-- orgName: string (nullable = true)
 |    |    |    |-- companyPhone: string (nullable = true)
 |-- partitionColumn: string (nullable = true)

So far I have implemented this:

case class IndividualCompany(orgName: String,
                             companyPhone: String)

case class IndividualAddress(city: String,
                   zipCode: String,
                   state: String,
                   country: String)

case class IndividualPosition(title: String,
                              startDate: String,
                              endDate: String,
                              address: IndividualAddress,
                              emailName: String,
                              company: IndividualCompany)

case class Individual(firstName: String,
                     middleName: String,
                     lastName: String,
                     currentPosition: Seq[IndividualPosition],
                     partitionColumn: String)


val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))

val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany) 
                    => List(IndividualPosition(title, startDate, endDate, address, emailName, company)))


val selectData = df.select(
      col("first_name").as("firstName"),
      col("middle_name).as("middleName"),
      col("last_name").as("lastName"),
      makePosition(col("job_title"),
        col("start_date"),
        col("end_Date"),
        makeAddress(col("city"),
          col("zip_code"),
          col("state"),
          col("country")),
        col("email_name"),
        makeCompany(col("org_name"),
          col("company_phone"))).as("currentPosition"),
      col("partition_column").as("partitionColumn")
    ).as[Individual]

select_data.printSchema()
select_data.show(10)

I can see a proper schema generated for select_data, but it gives an error on the last line where I am trying to get some actual data. I am getting an error saying failed to execute user defined function.

 org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)

Is there any better way to achieve this?

like image 910
Harshad_Pardeshi Avatar asked Oct 16 '25 19:10

Harshad_Pardeshi


1 Answers

The problem here is that an udf can't take IndividualAddress and IndividualCompany directly as input. These are represented as structs in Spark and to use them in an udf the correct input type is Row. That means you need to change the declaration of makePosition to:

val makePosition = udf((title: String, 
                        startDate: String, 
                        endDate: String, 
                        address: Row, 
                        emailName: String, 
                        company: Row)

Inside the udf you now need to use e.g. address.getAs[String]("city") to access the case class elements, and to use the class as a whole you need to create it again.

The easier and better alternative would be to do everything in a single udf as follows:

val makePosition = udf((title: String, 
    startDate: String, 
    endDate: String, 
    city: String, 
    zipCode: String, 
    state: String, 
    country: String,
    emailName: String, 
    orgName: String, 
    companyPhone: String) => 
        Seq(
          IndividualPosition(
            title, 
            startDate, 
            endDate, 
            IndividualAddress(city, zipCode, state, country),
            emailName, 
            IndividualCompany(orgName, companyPhone)
          )
        )
)
like image 138
Shaido Avatar answered Oct 18 '25 12:10

Shaido



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!