Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Filter dataframe based on multiple conditions

I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3).

If the original dataframe DF is as follows:

+----+----+----+----+---+ |col1|col2|col3|col4|  d| +----+----+----+----+---+ |   A|  xx|   D|  vv|  4| |   C| xxx|   D|  vv| 10| |   A|   x|   A|  xx|  3| |   E| xxx|   B|  vv|  3| |   E| xxx|   F| vvv|  6| |   F|xxxx|   F| vvv|  4| |   G| xxx|   G| xxx|  4| |   G| xxx|   G|  xx|  4| |   G| xxx|   G| xxx| 12| |   B|xxxx|   B|  xx| 13| +----+----+----+----+---+ 

The desired Dataframe is:

+----+----+----+----+---+ |col1|col2|col3|col4|  d| +----+----+----+----+---+ |   A|  xx|   D|  vv|  4| |   A|   x|   A|  xx|  3| |   E| xxx|   B|  vv|  3| |   F|xxxx|   F| vvv|  4| |   G| xxx|   G|  xx|  4| +----+----+----+----+---+ 

Code I have tried that did not work as expected:

cols=[('A','xx','D','vv',4),('C','xxx','D','vv',10),('A','x','A','xx',3),('E','xxx','B','vv',3),('E','xxx','F','vvv',6),('F','xxxx','F','vvv',4),('G','xxx','G','xxx',4),('G','xxx','G','xx',4),('G','xxx','G','xxx',12),('B','xxxx','B','xx',13)] df=spark.createDataFrame(cols,['col1','col2','col3','col4','d'])  df.filter((df.d<5)& (df.col2!=df.col4) & (df.col1==df.col3)).show()  +----+----+----+----+---+ |col1|col2|col3|col4|  d| +----+----+----+----+---+ |   A|   x|   A|  xx|  3| |   F|xxxx|   F| vvv|  4| |   G| xxx|   G|  xx|  4| +----+----+----+----+---+ 

What should I do to achieve the desired result?

like image 268
Sidhom Avatar asked Mar 15 '18 13:03

Sidhom


People also ask

How do I filter multiple conditions in PySpark?

In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.

How do you subset a Dataframe in PySpark?

To subset or filter the data from the dataframe we are using the filter() function. The filter function is used to filter the data from the dataframe on the basis of the given condition it should be single or multiple. where df is the dataframe from which the data is subset or filtered.


1 Answers

Your logic condition is wrong. IIUC, what you want is:

import pyspark.sql.functions as f  df.filter((f.col('d')<5))\     .filter(         ((f.col('col1') != f.col('col3')) |           (f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))     )\     .show() 

I broke the filter() step into 2 calls for readability, but you could equivalently do it in one line.

Output:

+----+----+----+----+---+ |col1|col2|col3|col4|  d| +----+----+----+----+---+ |   A|  xx|   D|  vv|  4| |   A|   x|   A|  xx|  3| |   E| xxx|   B|  vv|  3| |   F|xxxx|   F| vvv|  4| |   G| xxx|   G|  xx|  4| +----+----+----+----+---+ 
like image 125
pault Avatar answered Sep 20 '22 21:09

pault