Pyspark filter using startswith from list

Question

I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes and no, they should match yes23 and no3 but not 35yes or 41no. Using pyspark, how can i use startswith any element in list or tuple.

An example DF would be:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    2| 45yes|
|    3| no123|
|    4|  75no|
+-----+------+

When I try:

Element_List = ['yes','no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))

The resulting df should look something like:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    3| no123|
+-----+------+

Instead I get The error:

Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

is prompted, so it looks like startsWith can't be used with any type of list. Is there a simple work around?

Alper t. Turker · Accepted Answer

Compose expression like this:

from pyspark.sql.functions import col, lit
from functools import reduce

element_list = ['yes','no']

df = spark.createDataFrame(
    ["yes23", "no3", "35yes", """41no["maybe"]"""],
    "string"
).toDF("location")

starts_with = reduce(
    lambda x, y: x | y,
    [col("location").startswith(s) for s in element_list], 
    lit(False))

df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# |   yes23|
# |     no3|
# +--------+

Kris · Answer

I feel best way to achieve this is with native pyspark function like "rlike()". startswith() is meant for filtering the static strings. It can't accept dynamic content. If you want to dynamically take the keywords from list; the best bet can be creating a Regular Expression from the list as below.

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

Custom Udfs or working with RDD functions might very well work, but usage of custom Udfs might hinder the performance.

Below is the complete working example.

#Test Dataframe
df = spark.createDataFrame(
    ["yes23", "no3", "35yes"],
    "string"
).toDF("label")

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

#Filter dataframe with RegEx
df.filter(df.label.rlike(reg_str)).show()

# +--------+
# |label   |
# +--------+
# |   yes23|
# |     no3|
# +--------+

Pyspark filter using startswith from list

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jenks

2 Answers

Alper t. Turker

Kris

Recent Activity

Donate For Us

Pyspark filter using startswith from list

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jenks

2 Answers

Alper t. Turker

Kris

Related questions

Recent Activity

Donate For Us