Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with array<String> in spark dataframe?

I have a json dataset, and it is formated as:

val data = spark.read.json("user.json").select("user_id","friends").show()
+--------------------+--------------------+
|             user_id|             friends|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...|[rpOyqD_893cqmDAt...|
|rpOyqD_893cqmDAtJ...|[18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...|[18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...|[18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
data: org.apache.spark.sql.DataFrame = [user_id: string, friends: array<string>]

How can I transform it to [user_id: String, friend: String], eg:

+--------------------+--------------------+
|             user_id|             friend|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...| rpOyqD_893cqmDAt...|
|18kPq7GPye-YQ3LyK...| 18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...| 18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...| 18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+

How can I get this dataframe?

like image 343
Pi Pi Avatar asked Jul 04 '17 12:07

Pi Pi


People also ask

What is ArrayType in spark?

Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.

How do I cast an array to a string in SQL?

In order to convert array to a string, Spark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using org. apache. spark.

How do you split a string in PySpark DataFrame?

The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.


1 Answers

You can use concat_ws function to concat the array of string and get only a string

data.withColumn("friends", concat_ws("",col("friends")))

concat_ws(java.lang.String sep, Column... exprs) Concatenates multiple input string columns together into a single string column, using the given separator.

Or you can use simple udf to convert array to string as below

 import org.apache.spark.sql.functions._

 val value = udf((arr: Seq[String]) => arr.mkString(" "))

 val newDf = data.withColumn("hobbies", value($"friends"))

If you are trying to get values of array for user then you can use explode method as

data.withColumn("friends", explode($"friends"))

explode(Column e) Creates a new row for each element in the given array or map column.

If you are trying to get only one data then, as @ramesh suggested you can get first element as

data.withColumn("friends", $"friends"(0))

Hope this helps!

like image 169
koiralo Avatar answered Sep 26 '22 02:09

koiralo