Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a WrappedArray column in spark dataframe to Strings?

I am trying to convert a column which contains Array[String] to String, but I consistently get this error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 78.0 failed 4 times, most recent failure: Lost task 0.3 in stage 78.0 (TID 1691, ip-******): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; 

Here's the piece of code

val mkString = udf((arrayCol:Array[String])=>arrayCol.mkString(","))  
val dfWithString=df.select($"arrayCol").withColumn("arrayString",
      mkString($"arrayCol"))  
like image 954
bdguy Avatar asked Dec 30 '15 23:12

bdguy


People also ask

How do you convert a column into a string in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.

How do I change the DataType of a column in spark data frame?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.

How do I extract a column in spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .


2 Answers

WrappedArray is not an Array (which is plain old Java Array not a natve Scala collection). You can either change signature to:

import scala.collection.mutable.WrappedArray

(arrayCol: WrappedArray[String]) => arrayCol.mkString(",")

or use one of the supertypes like Seq:

(arrayCol: Seq[String]) => arrayCol.mkString(",")

In the recent Spark versions you can use concat_ws instead:

import org.apache.spark.sql.functions.concat_ws

df.select(concat_ws(",", $"arrayCol"))
like image 106
zero323 Avatar answered Sep 22 '22 17:09

zero323


The code work for me:

df.select("wifi_ids").rdd.map(row =>row.get(0).asInstanceOf[WrappedArray[WrappedArray[String]]].toSeq.map(x=>x.toSeq.apply(0)))

In your case,I guess it is:

val mkString = udf(arrayCol=>arrayCol.asInstanceOf[WrappedArray[String]].toArray.mkString(","))  
val dfWithString=df.select($"arrayCol").withColumn("arrayString",mkString($"arrayCol"))  
like image 33
Burt Avatar answered Sep 19 '22 17:09

Burt