Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Convert column of string to an array

Tags:

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thanks.

like image 602
Nikhil Utane Avatar asked Jun 22 '17 04:06

Nikhil Utane


People also ask

How do you convert a column into a string in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.

How do I change the dataType of a column in Spark?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.

How do you convert a comma separated string to a list in PySpark?

In pyspark SQL, the split() function converts the delimiter separated String to an Array. It is done by splitting the string based on delimiters like spaces, commas, and stack them into an array.

How to convert an array to a string in spark?

In order to convert array to a string, Spark SQL provides a built-in function concat_ws () which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. concat_ws ( sep : scala. Predef.String, exprs : org. apache. spark. sql. Column *) : org. apache. spark. sql. Column

How to convert a list to a data frame in spark?

And the schema of the data frame should look like the following: First, let’s convert the list to a data frame in Spark by using the following code: JSON is read into a data frame through sqlContext. The output is: At current stage, column attr_2 is string type instead of array of struct. For column attr_2, the value is JSON array string.

How do I convert a string to an array in pyspark?

PySpark Convert String to Array Column. PySpark SQL provides split () function to convert delimiter separated String to an Array ( StringType to ArrayType) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.

What are arraytype columns in spark dataframe?

Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length. This blog post will demonstrate Spark methods that return ArrayType columns, describe how to create your own ArrayType columns, and explain when to use arrays in your analyses.


1 Answers

There are various method,

The best way to do is using split function and cast to array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create simple udf to convert the values

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

like image 151
koiralo Avatar answered Sep 20 '22 15:09

koiralo