I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.
Csv structure(without header):
id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2
to load .csv I typed:
lines = sc.textFile("test.txt")
then a distinct count on lines
returned 3 as expected:
lines.distinct().count()
But I have no idea how to make a distinct count based on lets say id
and country
.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.
Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.
In this case you would select the columns you want to consider, and then count:
sc.textFile("test.txt")\
.map(lambda line: (line.split(',')[0], line.split(',')[1]))\
.distinct()\
.count()
This is for clarity, you can optimize the lambda to avoid calling line.split
two times.
The split line can be optimized as follows:
sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With