I have a large pyspark.sql.dataframe.DataFrame
and I want to keep (so filter
) all rows where the URL saved in the location
column contains a pre-determined string, e.g. 'google.com'.
I have tried:
import pyspark.sql.functions as sf df.filter(sf.col('location').contains('google.com')).show(5)
but this throws a
TypeError: _TypeError: 'Column' object is not callable'
How do I go around and filter my df properly? Many thanks in advance!
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
In PySpark, the substring() function is used to extract the substring from a DataFrame string column by providing the position and length of the string you wanted to extract.
Spark Contains() Function to Search Strings in DataFrame You can use contains() function in Spark and PySpark to match the dataframe column values contains a literal string.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
df.filter(df.location.contains('google.com'))
Spark 2.2 documentation link
You can use plain SQL in
filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link
pyspark.sql.Column.contains()
is only available in pyspark version 2.2 and above.
df.where(df.location.contains('google.com'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With