I have a json dataset, and it is formated as:
val data = spark.read.json("user.json").select("user_id","friends").show()
+--------------------+--------------------+
| user_id| friends|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...|[rpOyqD_893cqmDAt...|
|rpOyqD_893cqmDAtJ...|[18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...|[18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...|[18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
data: org.apache.spark.sql.DataFrame = [user_id: string, friends: array<string>]
How can I transform it to [user_id: String, friend: String], eg:
+--------------------+--------------------+
| user_id| friend|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...| rpOyqD_893cqmDAt...|
|18kPq7GPye-YQ3LyK...| 18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...| 18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...| 18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
How can I get this dataframe?
Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.
In order to convert array to a string, Spark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using org. apache. spark.
The PySpark SQL provides the split() function to convert delimiter separated String to an Array (StringType to ArrayType) column on DataFrame It can be done by splitting the string column on the delimiter like space, comma, pipe, etc.
You can use concat_ws function to concat the array of string and get only a string
data.withColumn("friends", concat_ws("",col("friends")))
concat_ws(java.lang.String sep, Column... exprs)
Concatenates multiple input string columns together into a single string column, using the given separator.
Or you can use simple udf to convert array to string as below
import org.apache.spark.sql.functions._
val value = udf((arr: Seq[String]) => arr.mkString(" "))
val newDf = data.withColumn("hobbies", value($"friends"))
If you are trying to get values of array for user then you can use explode method as
data.withColumn("friends", explode($"friends"))
explode(Column e) Creates a new row for each element in the given array or map column.
If you are trying to get only one data then, as @ramesh suggested you can get first element as
data.withColumn("friends", $"friends"(0))
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With