I would like to rewrite this from R to Pyspark, any nice looking suggestions? <pre class="prettyprint"><code>array <- c(1,2,3) dataset <- filter(!(column %in% array)) </code></pre>

In pyspark you can do it like this: <pre class="prettyprint"><code>array = [1, 2, 3] dataframe.filter(dataframe.column.isin(array) == False) </code></pre> Or using the binary NOT operator: <pre class="prettyprint"><code>dataframe.filter(~dataframe.column.isin(array)) </code></pre>

Take the operator ~ which means contrary : <pre class="prettyprint"><code>df_filtered = df.filter(~df["column_name"].isin([1, 2, 3])) </code></pre>

Pyspark dataframe operator "IS NOT IN"

Tags:

pyspark

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3) dataset <- filter(!(column %in% array))

762

asked Oct 27 '16 14:10

Babu

2 Answers

In pyspark you can do it like this:

array = [1, 2, 3] dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))

136

answered Sep 28 '22 10:09

Ryan Widmaier

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

answered Sep 28 '22 08:09

LaSul

Related questions
                            
                                Spark SQL Row_number() PartitionBy Sort Desc
                            
                                Reading csv files with quoted fields containing embedded commas
                            
                                Applying UDFs on GroupedData in PySpark (with functioning python example)
                            
                                GroupBy column and filter rows with maximum value in Pyspark
                            
                                AttributeError: 'DataFrame' object has no attribute 'map'
                            
                                Number of partitions in RDD and performance in Spark
                            
                                Pyspark: Convert column to lowercase
                            
                                Python Spark Cumulative Sum by Group Using DataFrame
                            
                                Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
                            
                                spark 2.1.0 session config settings (pyspark)
                            
                                Python/pyspark data frame rearrange columns
                            
                                Pyspark: Parse a column of json strings
                            
                                Spark RDD to DataFrame python
                            
                                How do I unit test PySpark programs?
                            
                                Spark 1.4 increase maxResultSize memory
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause
                            
                                What is the Spark DataFrame method `toPandas` actually doing?
                            
                                Spark Window Functions - rangeBetween dates
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                get datatype of column using pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With