I am trying to filter my pyspark data frame the following way: I have one column which contains long_text
and one column which contains numbers. If the long text contains the number
I want to keep the column.
I am trying to use the SQL LIKE
statement, but it seems I can't apply it to another column (here number
)
My code is the following:
from pyspark.sql.functions import regexp_extract, col, concat, lit
from pyspark.sql.types import *
PN_in_NC = (df
.filter(df.long_text.like(concat(lit("%"), df.number, lit("%"))))))
I get the following error:
Method like([class org.apache.spark.sql.Column]) does not exist
.
I tried multiple things to fix it (such as creating the '%number%'
string as column before the filter, not using lit
, using '%' + number + '%'
) but nothing worked. If LIKE
can't be applied to another column, is there another way to do this?
Filter Based on Starts With, Ends With, Contains You can also filter DataFrame rows by using startswith (), endswith () and contains () methods of Column class. For more examples on Column class, refer to PySpark Column Functions.
1. Filter DataFrame Column contains() in a String. The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string mes on the name column.
The parameter used by the like function is the character on which we want to filter the data. The LIKE operation is a simple expression that is used to find or manipulate any character in a PySpark SQL or data frame architecture. This takes up two special characters that can be further used up to match elements out there.
filter (): It is a function which filters the columns/row based on SQL expression or condition. Example 2: Filter columns with multiple conditions. Here we are going to use the SQL col function, this function refers the column name of the dataframe with dataframe_object.col. Syntax: Dataframe_obj.col (column_name).
You can use the contains
function.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"])
df1.filter(col("long_text").contains(col("number"))).show()
The long_text column should contain the number in the number column.
Output:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With