Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark filter using startswith from list

I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes and no, they should match yes23 and no3 but not 35yes or 41no. Using pyspark, how can i use startswith any element in list or tuple.

An example DF would be:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    2| 45yes|
|    3| no123|
|    4|  75no|
+-----+------+

When I try:

Element_List = ['yes','no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))

The resulting df should look something like:

+-----+------+
|index| label|
+-----+------+
|    1|yes342|
|    3| no123|
+-----+------+

Instead I get The error:

Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

is prompted, so it looks like startsWith can't be used with any type of list. Is there a simple work around?

like image 780
Jenks Avatar asked Jan 31 '18 18:01

Jenks


2 Answers

Compose expression like this:

from pyspark.sql.functions import col, lit
from functools import reduce

element_list = ['yes','no']

df = spark.createDataFrame(
    ["yes23", "no3", "35yes", """41no["maybe"]"""],
    "string"
).toDF("location")

starts_with = reduce(
    lambda x, y: x | y,
    [col("location").startswith(s) for s in element_list], 
    lit(False))

df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# |   yes23|
# |     no3|
# +--------+
like image 190
Alper t. Turker Avatar answered Nov 01 '22 00:11

Alper t. Turker


I feel best way to achieve this is with native pyspark function like "rlike()". startswith() is meant for filtering the static strings. It can't accept dynamic content. If you want to dynamically take the keywords from list; the best bet can be creating a Regular Expression from the list as below.

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

Custom Udfs or working with RDD functions might very well work, but usage of custom Udfs might hinder the performance.

Below is the complete working example.

#Test Dataframe
df = spark.createDataFrame(
    ["yes23", "no3", "35yes"],
    "string"
).toDF("label")

# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"

#Filter dataframe with RegEx
df.filter(df.label.rlike(reg_str)).show()

# +--------+
# |label   |
# +--------+
# |   yes23|
# |     no3|
# +--------+
like image 32
Kris Avatar answered Nov 01 '22 02:11

Kris