Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a column by using delimiters from another column in Spark/Scala

I have another question that is related to the split function. I am new to Spark/Scala.

below is the sample data frame -


+-------------------+---------+
|             VALUES|Delimiter|
+-------------------+---------+
|       50000.0#0#0#|        #|
|          [email protected]@|        @|
|                 1$|        $|
|1000.00^Test_string|        ^|
+-------------------+---------+

and I want the output to be -

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

I tried to split this manually -

dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()

and the output is -

+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
|      [50000.0, 0, 0, ]|
|          [0, 1000.0, ]|
|                  [1, ]|
|   [1000.00, Test_st...|
+-----------------------+


But I want to pull up the delimiter automatically for a large dataset.

like image 314
Glarixon Avatar asked Jul 14 '21 15:07

Glarixon


People also ask

How do I split a column in Scala spark?

Spark split() function to convert string to Array column. Spark SQL provides split() function to convert delimiter separated String to array (StringType to ArrayType) column on Dataframe. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting into ArrayType.

How do I split one column into multiple columns in spark?

pyspark. sql. functions provide a function split() which is used to split DataFrame string Column into multiple columns.

How do I extract a column in spark?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .

How do I substring a column in spark?

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

How to split a Dataframe into multiple columns in spark?

Before we start with an example of Spark split function, first let’s create a DataFrame and will use one of the column from this DataFrame to split into multiple columns From the above DataFrame, column name of type String is a combined field of the first name, middle & lastname separated by comma delimiter.

How do I split a column in a Dataframe in Python?

1. Split Pandas DataFrame column by single Delimiter In this example, we are splitting columns into multiple columns using the str.split () method with delimiter hyphen (-). We have dataframe column “Mark” that we are splitting into “Mark” and “Mark_” columns. We can use any delimiter as per need.

How do I split a column by delimiter in Excel?

To do that split, select the Accounts column and then select Split Column > By Delimiter. Inside the Split column window, apply the following configuration: Select or enter delimiter: Comma Split at: Each occurrence of the delimiter

How to split a string around a separator/delimiter in pandas Dataframe?

You can use the pandas Series.str.split () function to split strings in the column around a given separator/delimiter. It is similar to the python string split () function but applies to the entire dataframe column. The following is the syntax:


Video Answer


4 Answers

You need to use expr with split() to make the split dynamic

df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()

+------------+---------+-----------------+
|      VALUES|Delimiter|            split|
+------------+---------+-----------------+
|50000.0#0#0#|        #|[50000.0, 0, 0, ]|
|   [email protected]@|        @|    [0, 1000.0, ]|
+------------+---------+-----------------+
like image 160
dsk Avatar answered Nov 11 '22 05:11

dsk


EDIT: Please check the bottom of the answer for scala version.

You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.

from typing import List

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType


def split_col(value: StringType, delimiter: StringType) -> List[str]:
    return str(value).split(str(delimiter))


udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')

df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))

df.show(truncate=False)

Output

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.

EDIT : Scala version

import org.apache.spark.sql.functions.udf

import spark.implicits._

val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")

val udf_split_col = udf {(x:String,y:String)=> x.split(y)}

df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))

df.show(false)

Edit 2

To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.

val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }
like image 35
Pubudu Sitinamaluwa Avatar answered Nov 11 '22 06:11

Pubudu Sitinamaluwa


This is another way of handling this,using sparksql

df.createOrReplaceTempView("test")

spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)

Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.

+-------------------+---------+----------------------+
|VALUES             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

like image 44
linusRian Avatar answered Nov 11 '22 05:11

linusRian


This is my lately solution

import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)

it will skip special characters in Delimiter column. Other answers not work for

("50000.0\\0\\0\\", "\\")

and linusRian's answer need to add special characters manually

like image 33
Apollo Avatar answered Nov 11 '22 05:11

Apollo