Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get distinct rows in dataframe using pyspark?

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:

I have a interim dataframe:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |
like image 629
mdivk Avatar asked Jul 29 '16 02:07

mdivk


2 Answers

If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct()

or

df2 = df.drop_duplicates()
like image 141
Milos Milovanovic Avatar answered Nov 15 '22 14:11

Milos Milovanovic


The normal distinct not so user friendly, because you cant set the column. In this case enough for you:

df = df.distinct()

but if you have other value in date column, you wont get back the distinct elements from host:

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
|     uplherc.upl.com|  1|
+--------------------+---+

after distinct you will get back as follows:

df.distinct().show()

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
+--------------------+---+

thus you should use this:

df = df.dropDuplicates(['host'])

it will keep the first value of day

If you familiar with SQL language it will also works for you:

df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")

 +--------------------+-----------------+
|  first(host, false)|first(day, false)|
+--------------------+-----------------+
|   in24.inetnebr.com|                1|
|ix-esc-ca2-07.ix....|                1|
|     uplherc.upl.com|                1|
+--------------------+-----------------+
like image 22
Aron Asztalos Avatar answered Nov 15 '22 13:11

Aron Asztalos