Let's say I have the following table:
+--------------------+--------------------+------+------------+--------------------+
| host| path|status|content_size| time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...| 404| 0|1995-08-01 00:07:...|
| tia1.eskimo.com |/pub/winvn/releas...| 404| 0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...| 404| 0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm| 404| 0|1995-08-01 01:04:...|
| ras38.srv.net |/elv/DELTA/uncons...| 404| 0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net | | 404| 0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...| 404| 0|1995-08-01 01:33:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:35:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:36:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
| h96-158.ccnet.com |/history/apollo/a...| 404| 0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...| 404| 0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...| 404| 0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...| 404| 0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+
How I would filter this table to have only distinct paths in PySpark? But the table should contains all columns.
Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
We are going to create a dataframe from pyspark list bypassing the list to the createDataFrame() method from pyspark, then by using distinct() function we will get the distinct rows from the dataframe.
PySpark filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where() clause instead of the filter() if you are coming from an SQL background, both these functions operate exactly the same.
If you want to save rows where all values in specific column are distinct, you have to call dropDuplicates
method on DataFrame.
Like this in my example:
dataFrame = ...
dataFrame.dropDuplicates(['path'])
where path is column name
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With