I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes
and no
, they should match yes23
and no3
but not 35yes
or 41no
. Using pyspark, how can i use startswith
any element in list or tuple.
An example DF would be:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 2| 45yes|
| 3| no123|
| 4| 75no|
+-----+------+
When I try:
Element_List = ['yes','no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))
The resulting df should look something like:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 3| no123|
+-----+------+
Instead I get The error:
Py4JError: An error occurred while calling o250.startsWith. Trace:
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
is prompted, so it looks like startsWith
can't be used with any type of list. Is there a simple work around?
Compose expression like this:
from pyspark.sql.functions import col, lit
from functools import reduce
element_list = ['yes','no']
df = spark.createDataFrame(
["yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
starts_with = reduce(
lambda x, y: x | y,
[col("location").startswith(s) for s in element_list],
lit(False))
df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# | yes23|
# | no3|
# +--------+
I feel best way to achieve this is with native pyspark function like "rlike()". startswith() is meant for filtering the static strings. It can't accept dynamic content. If you want to dynamically take the keywords from list; the best bet can be creating a Regular Expression from the list as below.
# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"
Custom Udfs or working with RDD functions might very well work, but usage of custom Udfs might hinder the performance.
Below is the complete working example.
#Test Dataframe
df = spark.createDataFrame(
["yes23", "no3", "35yes"],
"string"
).toDF("label")
# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"
#Filter dataframe with RegEx
df.filter(df.label.rlike(reg_str)).show()
# +--------+
# |label |
# +--------+
# | yes23|
# | no3|
# +--------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With