Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: Filter out rows where column value appears multiple times in dataframe

Tags:

python

pyspark

Given a PySpark dataframe with two columns, I want to split the data set into two dataframes: One where the combination of ColA and ColB is unique, and one where it is non-unique.

So for example:

ColA | ColB | ColCDoesntMatter
1    | 2    | 4
1    | 3    | 3
2    | 2    | 2
1    | 2    | 6
2    | 2    | 8
2    | 3    | 9

The unique A/B pairings are:

1-3
2-3

The nonunique A/B pairings are:

1-2
2-2

So one dataframe would have all rows with the unique AB values and the other would contain the nonunique AB values. ColC doesn't matter for the filter, but needs to be retained. How can I run this filter?

Edit: Note that I can't use dropDuplicates because it's really the combination of A and B that needs to be unique, not merely A or B.

like image 661
nao Avatar asked Dec 04 '25 04:12

nao


1 Answers

# count rows per A-B
dfWithRowCount = df.selectExpr("*", "count('*') over (partition by (ColA, ColB)) as nrows")

# filter
uniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows == 1).drop('nrows')
nonUniqueAB = dfWithRowCount.filter(dfWithRowCount.nrows > 1).drop('nrows')

nonUniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   2|   2|   2|
|   2|   2|   8|
|   1|   2|   4|
|   1|   2|   6|
+----+----+----+

uniqueAB.show()
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|   1|   3|   3|
|   2|   3|   9|
+----+----+----+
like image 173
Psidom Avatar answered Dec 05 '25 16:12

Psidom