I have another question that is related to the split function. I am new to Spark/Scala.
below is the sample data frame -
+-------------------+---------+
| VALUES|Delimiter|
+-------------------+---------+
| 50000.0#0#0#| #|
| [email protected]@| @|
| 1$| $|
|1000.00^Test_string| ^|
+-------------------+---------+
and I want the output to be -
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
I tried to split this manually -
dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()
and the output is -
+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
| [50000.0, 0, 0, ]|
| [0, 1000.0, ]|
| [1, ]|
| [1000.00, Test_st...|
+-----------------------+
But I want to pull up the delimiter automatically for a large dataset.
Spark split() function to convert string to Array column. Spark SQL provides split() function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.
pyspark. sql. functions provide a function split() which is used to split DataFrame string Column into multiple columns.
In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .
In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.
Before we start with an example of Spark split function, first let’s create a DataFrame and will use one of the column from this DataFrame to split into multiple columns From the above DataFrame, column name of type String is a combined field of the first name, middle & lastname separated by comma delimiter.
1. Split Pandas DataFrame column by single Delimiter In this example, we are splitting columns into multiple columns using the str.split () method with delimiter hyphen (-). We have dataframe column “Mark” that we are splitting into “Mark” and “Mark_” columns. We can use any delimiter as per need.
To do that split, select the Accounts column and then select Split Column > By Delimiter. Inside the Split column window, apply the following configuration: Select or enter delimiter: Comma Split at: Each occurrence of the delimiter
You can use the pandas Series.str.split () function to split strings in the column around a given separator/delimiter. It is similar to the python string split () function but applies to the entire dataframe column. The following is the syntax:
You need to use expr
with split()
to make the split dynamic
df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()
+------------+---------+-----------------+
| VALUES|Delimiter| split|
+------------+---------+-----------------+
|50000.0#0#0#| #|[50000.0, 0, 0, ]|
| [email protected]@| @| [0, 1000.0, ]|
+------------+---------+-----------------+
EDIT: Please check the bottom of the answer for scala version.
You can use a custom user-defined function (pyspark.sql.functions.udf
) to achieve this.
from typing import List
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
def split_col(value: StringType, delimiter: StringType) -> List[str]:
return str(value).split(str(delimiter))
udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')
df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))
df.show(truncate=False)
Output
+-------------------+---------+----------------------+
|VALUES |Delimiter|split_values |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
Note that the split_values
column contains a list of strings. You can also update split_col
function to do more changes to values.
EDIT : Scala version
import org.apache.spark.sql.functions.udf
import spark.implicits._
val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")
val udf_split_col = udf {(x:String,y:String)=> x.split(y)}
df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))
df.show(false)
Edit 2
To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split()
method as follow.
val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }
This is another way of handling this,using sparksql
df.createOrReplaceTempView("test")
spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)
Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.
+-------------------+---------+----------------------+
|VALUES |delimiter|split_value |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0, ] |
|[email protected]@ |@ |[0, 1000.0, ] |
|1$ |$ |[1, ] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
This is my lately solution
import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)
it will skip special characters in Delimiter column. Other answers not work for
("50000.0\\0\\0\\", "\\")
and linusRian's answer need to add special characters manually
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With