I have another question that is related to the split function. I am new to Spark/Scala. below is the sample data frame - <pre class="prettyprint"><code> +-------------------+---------+ | VALUES|Delimiter| +-------------------+---------+ | 50000.0#0#0#| #| | 0@1000.0@| @| | 1$| $| |1000.00^Test_string| ^| +-------------------+---------+ </code></pre> and I want the output to be - <pre class="prettyprint"><code>+-------------------+---------+----------------------+ |VALUES |Delimiter|split_values | +-------------------+---------+----------------------+ |50000.0#0#0# |# |[50000.0, 0, 0, ] | |0@1000.0@ |@ |[0, 1000.0, ] | |1$ |$ |[1, ] | |1000.00^Test_string|^ |[1000.00, Test_string]| +-------------------+---------+----------------------+ </code></pre> I tried to split this manually - <pre class="prettyprint"><code>dept.select(split(col("VALUES"),"#|@|\\$|\\^").show() </code></pre> and the output is - <pre class="prettyprint"><code>+-----------------------+ |split(VALUES,#|@|\$|\^)| +-----------------------+ | [50000.0, 0, 0, ]| | [0, 1000.0, ]| | [1, ]| | [1000.00, Test_st...| +-----------------------+ </code></pre> But I want to pull up the delimiter automatically for a large dataset.

You need to use <code>expr</code> with <code>split()</code> to make the split dynamic <pre class="prettyprint"><code>df = spark.createDataFrame([("50000.0#0#0#","#"),("0@1000.0@","@")],["VALUES","Delimiter"]) df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)""")) df.show() +------------+---------+-----------------+ | VALUES|Delimiter| split| +------------+---------+-----------------+ |50000.0#0#0#| #|[50000.0, 0, 0, ]| | 0@1000.0@| @| [0, 1000.0, ]| +------------+---------+-----------------+ </code></pre>

EDIT: Please check the bottom of the answer for scala version. You can use a custom user-defined function (<code>pyspark.sql.functions.udf</code>) to achieve this. <pre class="prettyprint lang-py prettyprint-override"><code>from typing import List from pyspark.sql import SparkSession from pyspark.sql.functions import udf from pyspark.sql.types import StringType, ArrayType def split_col(value: StringType, delimiter: StringType) -> List[str]: return str(value).split(str(delimiter)) udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType())) spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([ ('50000.0#0#0#', '#'), ('0@1000.0@', '@'), ('1$', '$'), ('1000.00^Test_string', '^') ], schema='VALUES String, Delimiter String') df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter'])) df.show(truncate=False) </code></pre> Output <pre class="prettyprint"><code>+-------------------+---------+----------------------+ |VALUES |Delimiter|split_values | +-------------------+---------+----------------------+ |50000.0#0#0# |# |[50000.0, 0, 0, ] | |0@1000.0@ |@ |[0, 1000.0, ] | |1$ |$ |[1, ] | |1000.00^Test_string|^ |[1000.00, Test_string]| +-------------------+---------+----------------------+ </code></pre> Note that the <code>split_values</code> column contains a list of strings. You can also update <code>split_col</code> function to do more changes to values. EDIT : Scala version <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.udf import spark.implicits._ val data = Seq(("50000.0#0#0#", "#"), ("0@1000.0@", "@"), ("1$", "$"), ("1000.00^Test_string", "^")) var df = data.toDF("VALUES", "Delimiter") val udf_split_col = udf {(x:String,y:String)=> x.split(y)} df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter"))) df.show(false) </code></pre> Edit 2 To avoid the issue with special characters used in regexes, you can use char instead of a String when using the <code>split()</code> method as follow. <pre class="prettyprint lang-scala prettyprint-override"><code>val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) } </code></pre>

This is another way of handling this,using sparksql <pre class="prettyprint"><code>df.createOrReplaceTempView("test") spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false) </code></pre> Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split. <pre class="prettyprint"><code>+-------------------+---------+----------------------+ |VALUES |delimiter|split_value | +-------------------+---------+----------------------+ |50000.0#0#0# |# |[50000.0, 0, 0, ] | |0@1000.0@ |@ |[0, 1000.0, ] | |1$ |$ |[1, ] | |1000.00^Test_string|^ |[1000.00, Test_string]| +-------------------+---------+----------------------+ </code></pre>

How do I split a column by using delimiters from another column in Spark/Scala

Tags:

scala

apache-spark

apache-spark-sql

I have another question that is related to the split function. I am new to Spark/Scala.

below is the sample data frame -


+-------------------+---------+
|             VALUES|Delimiter|
+-------------------+---------+
|       50000.0#0#0#|        #|
|          [email protected]@|        @|
|                 1$|        $|
|1000.00^Test_string|        ^|
+-------------------+---------+

and I want the output to be -

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

I tried to split this manually -

dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()

and the output is -

+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
|      [50000.0, 0, 0, ]|
|          [0, 1000.0, ]|
|                  [1, ]|
|   [1000.00, Test_st...|
+-----------------------+

But I want to pull up the delimiter automatically for a large dataset.

314

asked Jul 14 '21 15:07

Glarixon

Video Answer

4 Answers

You need to use expr with split() to make the split dynamic

df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()

+------------+---------+-----------------+
|      VALUES|Delimiter|            split|
+------------+---------+-----------------+
|50000.0#0#0#|        #|[50000.0, 0, 0, ]|
|   [email protected]@|        @|    [0, 1000.0, ]|
+------------+---------+-----------------+

160

answered Nov 11 '22 05:11

dsk

EDIT: Please check the bottom of the answer for scala version.

You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.

from typing import List

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType


def split_col(value: StringType, delimiter: StringType) -> List[str]:
    return str(value).split(str(delimiter))


udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')

df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))

df.show(truncate=False)

Output

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.

EDIT : Scala version

import org.apache.spark.sql.functions.udf

import spark.implicits._

val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")

val udf_split_col = udf {(x:String,y:String)=> x.split(y)}

df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))

df.show(false)

Edit 2

To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.

val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }

answered Nov 11 '22 06:11

Pubudu Sitinamaluwa

This is another way of handling this,using sparksql

df.createOrReplaceTempView("test")

spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)

Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.

+-------------------+---------+----------------------+
|VALUES             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

answered Nov 11 '22 05:11

linusRian

This is my lately solution

import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)

it will skip special characters in Delimiter column. Other answers not work for

("50000.0\\0\\0\\", "\\")

and linusRian's answer need to add special characters manually

answered Nov 11 '22 05:11

Apollo

Related questions
                            
                                Why is match a keyword and not a function in Scala?
                            
                                How to create schema (StructType) with one or more StructTypes?
                            
                                Understanding NotUsed and Done
                            
                                Left-biased and Right-biased?
                            
                                Scala hex literal for bytes
                            
                                Compose optional queries for for-comprehension in doobie?
                            
                                Scala - Future.sequence on Tuples
                            
                                Generating a list of random Integers in Java (in a Scala-like fashion)
                            
                                Incompatible Jackson version: Spark Structured Streaming
                            
                                ScalaTest AsyncFunSuiteLike multiple asserts
                            
                                Scala's mutable.ListBuffer seems to use List's tail function yet it is documented as having linear complexity?
                            
                                Scala groupby key sum value over a Seq of (key, value) while maintaining Order
                            
                                What's the best way to handle invalid arguments in scala
                            
                                scala how to convert future of one type to future of another type
                            
                                Scala object extends nothing
                            
                                How do you attach the Scala Intellij debugger for tests?
                            
                                How to skip appending some condition in map scala?
                            
                                How to enforce to run ZIO Tests sequentially
                            
                                how does scala's type check work in this case? [closed]
                            
                                Simplifying Option[Boolean] expression in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I split a column by using delimiters from another column in Spark/Scala

Tags:

scala

apache-spark

apache-spark-sql

Glarixon

People also ask

Video Answer

4 Answers

dsk

Pubudu Sitinamaluwa

linusRian

Apollo

Recent Activity

Donate For Us