Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas()
loads all the data into the driver’s memory in pyspark.
I have 2 dataframes: df1
and df2
. I want to filter df1
(remove all rows) where df1.userid = df2.userid
AND df1.group = df2.group
. I wasn't sure if I should use filter()
, join()
, or sql
For example:
df1: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 348| 2|[225, 2235, 2225] | | 567| 1|[1110, 1150] | | 595| 1|[1150, 1150, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+ df2: +------+----------+---------+ |userid| group | pick | +------+----------+---------+ | 348| 2| 2270| | 595| 1| 2125| +------+----------+---------+ Result I want: +------+----------+--------------------+ |userid| group | all_picks | +------+----------+--------------------+ | 567| 1|[1110, 1150] | | 580| 2|[2240, 2225] | | 448| 1|[1130] | +------+----------+--------------------+
EDIT: I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
I tried a bunch of different join types, and I also tried different
cond values: cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
However, it seems like the joins are adding additional rows, rather than deleting.
I'm using python 2.7
and spark 2.1.0
Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
but the same thing can be done with left outer join:
(df1 .join(df2, ["userid", "group"], "leftouter") .where(df2["pick"].isNull()) .drop(df2["pick"]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With