How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema
scala> test.printSchema
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
+---+---+
| a| b|
+---+---+
| 1|2,3|
+---+---+
| 2|4,5|
+---+---+
To:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:
df = spark.read.format('com.databricks.spark.csv').options(
header='true', inferschema='true', delimiter='|').load(input_file)
Thanks.
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.
To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.
In pyspark SQL, the split() function converts the delimiter separated String to an Array. It is done by splitting the string based on delimiters like spaces, commas, and stack them into an array.
In order to convert array to a string, Spark SQL provides a built-in function concat_ws () which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. concat_ws ( sep : scala. Predef.String, exprs : org. apache. spark. sql. Column *) : org. apache. spark. sql. Column
And the schema of the data frame should look like the following: First, let’s convert the list to a data frame in Spark by using the following code: JSON is read into a data frame through sqlContext. The output is: At current stage, column attr_2 is string type instead of array of struct. For column attr_2, the value is JSON array string.
PySpark Convert String to Array Column. PySpark SQL provides split () function to convert delimiter separated String to an Array ( StringType to ArrayType) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.
Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length. This blog post will demonstrate Spark methods that return ArrayType columns, describe how to create your own ArrayType columns, and explain when to use arrays in your analyses.
There are various method,
The best way to do is using split
function and cast to array<long>
data.withColumn("b", split(col("b"), ",").cast("array<long>"))
You can also create simple udf to convert the values
val tolong = udf((value : String) => value.split(",").map(_.toLong))
data.withColumn("newB", tolong(data("b"))).show
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With