Pyspark filter dataframe by columns of another dataframe

Tags:

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1: +------+----------+--------------------+ |userid|   group  |      all_picks     | +------+----------+--------------------+ |   348|         2|[225, 2235, 2225]   | |   567|         1|[1110, 1150]        | |   595|         1|[1150, 1150, 1150]  | |   580|         2|[2240, 2225]        | |   448|         1|[1130]              | +------+----------+--------------------+  df2: +------+----------+---------+ |userid|   group  |   pick  | +------+----------+---------+ |   348|         2|     2270| |   595|         1|     2125| +------+----------+---------+  Result I want: +------+----------+--------------------+ |userid|   group  |      all_picks     | +------+----------+--------------------+ |   567|         1|[1110, 1150]        | |   580|         2|[2240, 2225]        | |   448|         1|[1130]              | +------+----------+--------------------+

EDIT: I've tried many join() and filter() functions, I believe the closest I got was:

cond = [df1.userid == df2.userid, df2.group == df2.group] df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows

I tried a bunch of different join types, and I also tried different

cond values:     cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows     cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows

However, it seems like the joins are adding additional rows, rather than deleting.

I'm using python 2.7 and spark 2.1.0

946

asked Feb 09 '17 23:02

drewyupdrew

1 Answers

Left anti join is what you're looking for:

df1.join(df2, ["userid", "group"], "leftanti")

but the same thing can be done with left outer join:

(df1     .join(df2, ["userid", "group"], "leftouter")     .where(df2["pick"].isNull())     .drop(df2["pick"]))

194

answered Oct 04 '22 07:10

zero323

Related questions
                            
                                Python DNS module import error
                            
                                Installing MySQL-python on mac
                            
                                How to convert integer into date object python?
                            
                                How to save a list as a .csv file with python with new lines?
                            
                                mac - pip install pymssql error
                            
                                How to print a list with integers without the brackets, commas and no quotes? [duplicate]
                            
                                "ValueError: unknown locale: UTF-8" when importing pandas in python 2.7 [duplicate]
                            
                                Print range of numbers on same line
                            
                                Why is float() faster than int()?
                            
                                str() vs repr() functions in python 2.7.5 [duplicate]
                            
                                Installing nose using pip, but bash doesn't recognize command on mac
                            
                                What is the equivalent of python any() and all() functions in JavaScript?
                            
                                Python 2.7 not working anymore: cannot import name md5
                            
                                Duplicated rows when merging dataframes in Python
                            
                                What does the slice() function do in Python?
                            
                                How to use `GridSpec()` with `subplots()`
                            
                                How to Create a form from a json-schema? [closed]
                            
                                What is the most pythonic way to iterate over OrderedDict
                            
                                (fields.E300) Field defines a relation with model which is either not installed, or is abstract
                            
                                Python: Difference between kwargs.pop() and kwargs.get()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark filter dataframe by columns of another dataframe

Tags:

dataframe

python-2.7

apache-spark

apache-spark-sql

pyspark

drewyupdrew

People also ask

1 Answers

zero323

Recent Activity

Donate For Us