Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)

Tags:

I am trying to filter my pyspark data frame the following way: I have one column which contains long_text and one column which contains numbers. If the long text contains the number I want to keep the column. I am trying to use the SQL LIKE statement, but it seems I can't apply it to another column (here number) My code is the following:

from pyspark.sql.functions import regexp_extract, col, concat, lit
from pyspark.sql.types import *
PN_in_NC = (df
        .filter(df.long_text.like(concat(lit("%"), df.number, lit("%"))))))

I get the following error: Method like([class org.apache.spark.sql.Column]) does not exist.

I tried multiple things to fix it (such as creating the '%number%' string as column before the filter, not using lit, using '%' + number + '%') but nothing worked. If LIKE can't be applied to another column, is there another way to do this?

685

asked Feb 25 '19 12:02

LN_P

1 Answers

You can use the contains function.

from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"]) 
df1.filter(col("long_text").contains(col("number"))).show()

The long_text column should contain the number in the number column.

Output:

+--------------------+------+
|           long_text|number|
+--------------------+------+
|hahaha the 3 is good|     3|
|    what is 5 doing?|     5|
|          ajajaj 123|     2|
+--------------------+------+

answered Oct 05 '22 23:10

gaw

Related questions
                            
                                Error loading the saved optimizer. keras python raspberry
                            
                                openAI Gym NameError in Google Colaboratory
                            
                                Read/Write single file in DataBricks
                            
                                TypeError: create_superuser() missing 1 required positional argument: 'profile_picture'
                            
                                Gracefully stopping ecs container
                            
                                Python requests response encoded in utf-8 but cannot be decoded
                            
                                How to create a "dot plot" in Matplotlib? (not a scatter plot)
                            
                                How to resize a labeled mask with nearest neighbor interpolation using scikit-image
                            
                                What is the quickest way to increment date string YYYY-MM-DD in Python?
                            
                                How to avoid python autopep8 formatting in a line in VSCode?
                            
                                Syntax to select previous row in pandas after filtering
                            
                                How can I download a specific part of Coco Dataset?
                            
                                Pipenv not recognizing Pyenv version?
                            
                                How can I encrypt with a RSA private key in python?
                            
                                How to count the number of columns with a value on each row in python?
                            
                                Is it possible to minify python code like javascript?
                            
                                "KeyError: 0" when trying to load a sequential model in Keras
                            
                                Google App Engine Python: Error in yaml config file when deploying
                            
                                TypeError: expected bytes-like object, not str
                            
                                Pandas - find specific value in entire dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)

Tags:

python

sql-like

apache-spark

pyspark

LN_P

People also ask

1 Answers

gaw

Recent Activity

Donate For Us