I have a large <code>pyspark.sql.dataframe.DataFrame</code> and I want to keep (so <code>filter</code>) all rows where the URL saved in the <code>location</code> column contains a pre-determined string, e.g. 'google.com'. I have tried: <pre class="prettyprint"><code>import pyspark.sql.functions as sf df.filter(sf.col('location').contains('google.com')).show(5) </code></pre> but this throws a <pre class="prettyprint"><code>TypeError: _TypeError: 'Column' object is not callable' </code></pre> How do I go around and filter my df properly? Many thanks in advance!

<h3>Spark 2.2 onwards</h3> <blockquote> <pre class="prettyprint"><code>df.filter(df.location.contains('google.com')) </code></pre> Spark 2.2 documentation link </blockquote> <hr> <h3>Spark 2.1 and before</h3> <blockquote> You can use plain SQL in <code>filter</code> <pre class="prettyprint"><code>df.filter("location like '%google.com%'") </code></pre> or with DataFrame column methods <pre class="prettyprint"><code>df.filter(df.location.like('%google.com%')) </code></pre> Spark 2.1 documentation link </blockquote>

<code>pyspark.sql.Column.contains()</code> is only available in pyspark version 2.2 and above. <pre class="prettyprint"><code>df.where(df.location.contains('google.com')) </code></pre>

Filter df when values matches part of a string in pyspark

import pyspark.sql.functions as sf df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!

399

asked Jan 27 '17 08:01

gaatjeniksaan

2 Answers

Spark 2.2 onwards

df.filter(df.location.contains('google.com')) 
Spark 2.2 documentation link

Spark 2.1 and before

You can use plain SQL in filter
df.filter("location like '%google.com%'") 
or with DataFrame column methods
df.filter(df.location.like('%google.com%')) 
Spark 2.1 documentation link

answered Sep 27 '22 20:09

mrsrinivas

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))

answered Sep 27 '22 20:09

joaofbsm

Related questions
                            
                                Perform a string operation for every element in a Python list
                            
                                How to set default value to all keys of a dict object in python?
                            
                                Concatenate strings in python in multiline
                            
                                R expand.grid() function in Python
                            
                                Randint doesn't always follow uniform distribution
                            
                                select pandas rows by excluding index number
                            
                                How do I send a custom header with urllib2 in a HTTP Request?
                            
                                "Asyncio Event Loop is Closed" when getting loop
                            
                                Django - Reverse for '' not found. '' is not a valid view function or pattern name
                            
                                Reading from a frequently updated file
                            
                                General decorator to wrap try except in python?
                            
                                Python: How to run unittest.main() for all source files in a subdirectory?
                            
                                Why is adding attributes to an already instantiated object allowed?
                            
                                Do CSRF attacks apply to API's?
                            
                                Python dictionary keys. "In" complexity
                            
                                pandas get the row-wise minimum value of two or more columns
                            
                                Uploading multiple files with Flask
                            
                                How do I plot list of tuples in Python?
                            
                                Why can't easy_install find MySQLdb?
                            
                                Convert timedelta64[ns] column to seconds in Python Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter df when values matches part of a string in pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark