Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark distinct().count() on a csv file

I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.

Csv structure(without header):

id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2

to load .csv I typed:

lines = sc.textFile("test.txt")

then a distinct count on lines returned 3 as expected:

lines.distinct().count()

But I have no idea how to make a distinct count based on lets say id and country.

like image 240
dimzak Avatar asked Jan 16 '15 15:01

dimzak


People also ask

How do you count distinct in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you count duplicates in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do you select distinct records in PySpark DataFrame?

Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.


2 Answers

In this case you would select the columns you want to consider, and then count:

sc.textFile("test.txt")\
  .map(lambda line: (line.split(',')[0], line.split(',')[1]))\
  .distinct()\
  .count()

This is for clarity, you can optimize the lambda to avoid calling line.split two times.

like image 130
elyase Avatar answered Oct 22 '22 16:10

elyase


The split line can be optimized as follows:

sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()
like image 2
rami Avatar answered Oct 22 '22 17:10

rami