Filter rows by distinct values in one column in PySpark

Tags:

Let's say I have the following table:

+--------------------+--------------------+------+------------+--------------------+
|                host|                path|status|content_size|                time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...|   404|           0|1995-08-01 00:07:...|
|    tia1.eskimo.com |/pub/winvn/releas...|   404|           0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...|   404|           0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm|   404|           0|1995-08-01 01:04:...|
|      ras38.srv.net |/elv/DELTA/uncons...|   404|           0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net |                    |   404|           0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...|   404|           0|1995-08-01 01:33:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:35:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...|   404|           0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...|   404|           0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...|   404|           0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+

How I would filter this table to have only distinct paths in PySpark? But the table should contains all columns.

215

asked Sep 02 '16 08:09

likern

1 Answers

If you want to save rows where all values in specific column are distinct, you have to call dropDuplicates method on DataFrame. Like this in my example:

dataFrame = ... 
dataFrame.dropDuplicates(['path'])

where path is column name

answered Oct 18 '22 08:10

likern

Related questions
                            
                                iOS 10.0 UIRefreshControl not showing indicator
                            
                                Huge files in Docker containers
                            
                                How do I do a patch request using HttpClient in dotnet core?
                            
                                Is stack unwinding with exceptions guaranteed by c++ standard?
                            
                                Trying to attach mdf file to localDb throws error at least one file is required
                            
                                Deleting multiple items based on global secondary index in DynamoDB
                            
                                RxJS - .subscribe() vs .publish().connect()
                            
                                mongodb connection timed out error
                            
                                Is it safe to use parallelstream() to populate a Map in Java 8
                            
                                Is it possible to mock accessors by Mockito in Kotlin?
                            
                                Where is the Postgres username/password being created in this Dockerfile?
                            
                                How to define JSON Schema for Map<String, Integer>?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter rows by distinct values in one column in PySpark

Tags:

likern

People also ask

1 Answers

likern

Recent Activity

Donate For Us