Schema:
|-- c0: string (nullable = true)
|-- c1: struct (nullable = true)
| |-- c2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- orangeID: string (nullable = true)
| | | |-- orangeId: string (nullable = true)
I am trying to flatten the schema above in spark.
Code:
var df = data.select($"c0",$"c1.*").select($"c0",explode($"c2")).select($"c0",$"col.orangeID", $"col.orangeId")
The flattening code is working fine. The problem is in the last part where the 2 columns differ only by 1 letter (orangeID and orangeId). Hence I am getting this error:
Error:
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(orangeID,StringType,true), StructField(orangeId,StringType,true);
Any suggestions to avoid this ambiguity will be great.
Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.
PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames; it supports all basic join type operations available in traditional SQL like INNER , LEFT OUTER , RIGHT OUTER , LEFT ANTI , LEFT SEMI , CROSS , SELF JOIN.
Scala Schemas leverages Scala's class definition syntax, which includes the ability to specify defaults, along with Scala's implicit parameter resolution to safely interact with external protocols and systems. Currently supported systems are: Scalding Type Safe API: Parquet and Tuple Sources. Hive.
turn on the spark sql case sensitivity configuration and try
spark.sql("set spark.sql.caseSensitive=true")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With