I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:
I have a interim dataframe:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |
What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:
+----------------------------+---+
|host |day|
+----------------------------+---+
|in24.inetnebr.com |1 |
|uplherc.upl.com |1 |
|ix-esc-ca2-07.ix.netcom.com |1 |
|uplherc.upl.com |1 |
If df is the name of your DataFrame, there are two ways to get unique rows:
df2 = df.distinct()
or
df2 = df.drop_duplicates()
The normal distinct not so user friendly, because you cant set the column. In this case enough for you:
df = df.distinct()
but if you have other value in date column, you wont get back the distinct elements from host:
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+---+
after distinct you will get back as follows:
df.distinct().show()
+--------------------+---+
| host|day|
+--------------------+---+
| in24.inetnebr.com| 1|
| uplherc.upl.com| 2|
| uplherc.upl.com| 1|
|ix-esc-ca2-07.ix....| 1|
+--------------------+---+
thus you should use this:
df = df.dropDuplicates(['host'])
it will keep the first value of day
If you familiar with SQL language it will also works for you:
df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")
+--------------------+-----------------+
| first(host, false)|first(day, false)|
+--------------------+-----------------+
| in24.inetnebr.com| 1|
|ix-esc-ca2-07.ix....| 1|
| uplherc.upl.com| 1|
+--------------------+-----------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With