Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf df.filter(sf.col('location').contains('google.com')).show(5) 

but this throws a

TypeError: _TypeError: 'Column' object is not callable' 

How do I go around and filter my df properly? Many thanks in advance!

like image 399
gaatjeniksaan Avatar asked Jan 27 '17 08:01

gaatjeniksaan


People also ask

How do I filter string values in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do you select part of a string in PySpark?

In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.

How do I find a word in PySpark?

Spark Contains() Function to Search Strings in DataFrame You can use contains() function in Spark and PySpark to match the dataframe column values contains a literal string.

How do you select distinct in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.


2 Answers

Spark 2.2 onwards

df.filter(df.location.contains('google.com')) 

Spark 2.2 documentation link


Spark 2.1 and before

You can use plain SQL in filter

df.filter("location like '%google.com%'") 

or with DataFrame column methods

df.filter(df.location.like('%google.com%')) 

Spark 2.1 documentation link

like image 87
mrsrinivas Avatar answered Sep 27 '22 20:09

mrsrinivas


pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com')) 
like image 44
joaofbsm Avatar answered Sep 27 '22 20:09

joaofbsm