I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion). As an example: <pre class="prettyprint"><code>df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar')) </code></pre> I get the data frame: <pre class="prettyprint"><code>+---+---+ | id|bar| +---+---+ | 1| a| | 2| b| | 3| b| | 4| c| | 5| d| +---+---+ </code></pre> I only want to exclude rows where bar is ('a' or 'b'). Using an SQL expression string it would be: <pre class="prettyprint"><code>df.filter('bar not in ("a","b")').show() </code></pre> Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time? Edit: I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it. <pre class="prettyprint"><code>df.filter(~col('bar').isin(['a','b'])).show() +---+---+ | id|bar| +---+---+ | 4| c| | 5| d| +---+---+ </code></pre>

Filtering a pyspark dataframe using isin by exclusion [duplicate]

Tags:

python

apache-spark

pyspark

pyspark-sql

I am trying to get all rows within a dataframe where a columns value is not within a list (so filtering by exclusion).

As an example:

df = sqlContext.createDataFrame([('1','a'),('2','b'),('3','b'),('4','c'),('5','d')] ,schema=('id','bar'))

I get the data frame:

+---+---+ | id|bar| +---+---+ |  1|  a| |  2|  b| |  3|  b| |  4|  c| |  5|  d| +---+---+

I only want to exclude rows where bar is ('a' or 'b').

Using an SQL expression string it would be:

df.filter('bar not in ("a","b")').show()

Is there a way of doing it without using the string for the SQL expression, or excluding one item at a time?

Edit:

I am likely to have a list, ['a','b'], of the excluded values that I would like to use.

651

asked Jan 21 '17 02:01

gabrown86

1 Answers

It looks like the ~ gives the functionality that I need, but I am yet to find any appropriate documentation on it.

df.filter(~col('bar').isin(['a','b'])).show()    +---+---+ | id|bar| +---+---+ |  4|  c| |  5|  d| +---+---+

150

answered Oct 05 '22 00:10

gabrown86

Related questions
                            
                                How to share variables across scripts in python?
                            
                                How to find common elements in list of lists?
                            
                                What is the relationship between __getattr__ and getattr?
                            
                                Sanitizing HTML in submitted form data
                            
                                "'cc' failed with exit status 1" error when install python library
                            
                                Lua as a general-purpose scripting language?
                            
                                Python format size application (converting B to KB, MB, GB, TB)
                            
                                Pycharm - no tests were found?
                            
                                Random Forest Feature Importance Chart using Python
                            
                                How to automatically reflect database to sqlalchemy declarative?
                            
                                Delete final line in file with python
                            
                                Inverse of Tan in python (tan-1)
                            
                                Composing functions in python
                            
                                What's the easiest way to add commas to an integer? [duplicate]
                            
                                Adding backslashes without escaping [duplicate]
                            
                                ImportError: No module named datetime
                            
                                Conditional command line arguments in Python using argparse
                            
                                Installing PIL with JPEG support on Mac OS X
                            
                                "cryptography is required for sha256_password or caching_sha2_password"
                            
                                Convert enum to int in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering a pyspark dataframe using isin by exclusion [duplicate]

Tags:

python

apache-spark

pyspark

pyspark-sql

gabrown86

People also ask

1 Answers

gabrown86

Recent Activity

Donate For Us