Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark filter dataframe by columns of another dataframe

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1: +------+----------+--------------------+ |userid|   group  |      all_picks     | +------+----------+--------------------+ |   348|         2|[225, 2235, 2225]   | |   567|         1|[1110, 1150]        | |   595|         1|[1150, 1150, 1150]  | |   580|         2|[2240, 2225]        | |   448|         1|[1130]              | +------+----------+--------------------+  df2: +------+----------+---------+ |userid|   group  |   pick  | +------+----------+---------+ |   348|         2|     2270| |   595|         1|     2125| +------+----------+---------+  Result I want: +------+----------+--------------------+ |userid|   group  |      all_picks     | +------+----------+--------------------+ |   567|         1|[1110, 1150]        | |   580|         2|[2240, 2225]        | |   448|         1|[1130]              | +------+----------+--------------------+ 

EDIT: I've tried many join() and filter() functions, I believe the closest I got was:

cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows 

I tried a bunch of different join types, and I also tried different

cond values:     cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows     cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows 

However, it seems like the joins are adding additional rows, rather than deleting.

I'm using python 2.7 and spark 2.1.0

like image 946
drewyupdrew Avatar asked Feb 09 '17 23:02

drewyupdrew


People also ask

How do you filter columns in PySpark Dataframe?

Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.


1 Answers

Left anti join is what you're looking for:

df1.join(df2, ["userid", "group"], "leftanti") 

but the same thing can be done with left outer join:

(df1     .join(df2, ["userid", "group"], "leftouter")     .where(df2["pick"].isNull())     .drop(df2["pick"])) 
like image 194
zero323 Avatar answered Oct 04 '22 07:10

zero323