Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)

I am trying to filter my pyspark data frame the following way: I have one column which contains long_text and one column which contains numbers. If the long text contains the number I want to keep the column. I am trying to use the SQL LIKE statement, but it seems I can't apply it to another column (here number) My code is the following:

from pyspark.sql.functions import regexp_extract, col, concat, lit
from pyspark.sql.types import *
PN_in_NC = (df
        .filter(df.long_text.like(concat(lit("%"), df.number, lit("%"))))))

I get the following error: Method like([class org.apache.spark.sql.Column]) does not exist.

I tried multiple things to fix it (such as creating the '%number%' string as column before the filter, not using lit, using '%' + number + '%') but nothing worked. If LIKE can't be applied to another column, is there another way to do this?

like image 685
LN_P Avatar asked Feb 25 '19 12:02

LN_P


People also ask

How to filter Dataframe rows in pyspark?

Filter Based on Starts With, Ends With, Contains You can also filter DataFrame rows by using startswith (), endswith () and contains () methods of Column class. For more examples on Column class, refer to PySpark Column Functions.

How to filter Dataframe column contains in a string in Python?

1. Filter DataFrame Column contains() in a String. The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not. Below example returns, all rows from DataFrame that contains string mes on the name column.

What is the parameter used by the like function in pyspark?

The parameter used by the like function is the character on which we want to filter the data. The LIKE operation is a simple expression that is used to find or manipulate any character in a PySpark SQL or data frame architecture. This takes up two special characters that can be further used up to match elements out there.

How to filter columns with multiple conditions in a Dataframe?

filter (): It is a function which filters the columns/row based on SQL expression or condition. Example 2: Filter columns with multiple conditions. Here we are going to use the SQL col function, this function refers the column name of the dataframe with dataframe_object.col. Syntax: Dataframe_obj.col (column_name).


1 Answers

You can use the contains function.

from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"]) 
df1.filter(col("long_text").contains(col("number"))).show()

The long_text column should contain the number in the number column.

Output:

+--------------------+------+
|           long_text|number|
+--------------------+------+
|hahaha the 3 is good|     3|
|    what is 5 doing?|     5|
|          ajajaj 123|     2|
+--------------------+------+
like image 57
gaw Avatar answered Oct 05 '22 23:10

gaw