Is there an easy way how use explode
on array column on SparkSQL DataFrame
? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.
An option is to use SQLContext.sql(...)
and explode
function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrame
s are loaded from parquet files.
Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.
You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.
Spark Check if Column Exists in DataFrame Spark DataFrame has an attribute columns that returns all column names as an Array[String] , once you have the columns, you can use the array function contains() to check if the column present. Note that df. columns returns only top level columns but not nested struct columns.
Spark function explode(e: Column) is used to explode or create array or map columns to rows. When an array is passed to this function, it creates a new default column “col1” and it contains all array elements.
I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".
Then you get from initial schema :
root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- companyName: string (nullable = true)
| | |-- title: string (nullable = true)
...
to schema:
root
|-- personName: string (nullable = true)
|-- companyName: string (nullable = true)
|-- positionTitle: string (nullable = true)
by doing:
DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));
DataFrame test = personPositions.select(personPositions.col("personName"),
personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));
It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col)
and DataFrame.withColumn(String colName, Column col)
to replace the column with the exploded version of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With