Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSQL and explode on DataFrame in Java

Is there an easy way how use explode on array column on SparkSQL DataFrame? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.

An option is to use SQLContext.sql(...) and explode function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrames are loaded from parquet files.

like image 564
JiriS Avatar asked Aug 06 '15 15:08

JiriS


People also ask

Can we do Spark SQL on DataFrame?

Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.

How do I show a Spark on a data frame?

You can visualize a Spark dataframe in Jupyter notebooks by using the display(<dataframe-name>) function. The display() function is supported only on PySpark kernels. The Qviz framework supports 1000 rows and 100 columns. By default, the dataframe is visualized as a table.

How do you check if a DataFrame has a column Spark?

Spark Check if Column Exists in DataFrame Spark DataFrame has an attribute columns that returns all column names as an Array[String] , once you have the columns, you can use the array function contains() to check if the column present. Note that df. columns returns only top level columns but not nested struct columns.

How do you explode an array in Spark?

Spark function explode(e: Column) is used to explode or create array or map columns to rows. When an array is passed to this function, it creates a new default column “col1” and it contains all array elements.


2 Answers

I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".

Then you get from initial schema :

root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- companyName: string (nullable = true)
    |    |    |-- title: string (nullable = true)
...

to schema:

root
 |-- personName: string (nullable = true)
 |-- companyName: string (nullable = true)
 |-- positionTitle: string (nullable = true)

by doing:

    DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
          org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));

    DataFrame test = personPositions.select(personPositions.col("personName"),
    personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));
like image 81
marilena.oita Avatar answered Oct 30 '22 04:10

marilena.oita


It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with the exploded version of it.

like image 6
JiriS Avatar answered Oct 30 '22 05:10

JiriS