I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file. Csv structure(without header): <pre class="prettyprint"><code>id,country,type 01,AU,s1 02,AU,s2 03,GR,s2 03,GR,s2 </code></pre> to load .csv I typed: <pre class="prettyprint"><code>lines = sc.textFile("test.txt") </code></pre> then a distinct count on <code>lines</code> returned 3 as expected: <pre class="prettyprint"><code>lines.distinct().count() </code></pre> But I have no idea how to make a distinct count based on lets say <code>id</code> and <code>country</code>.

The split line can be optimized as follows: <pre class="prettyprint"><code>sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count() </code></pre>

PySpark distinct().count() on a csv file

I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.

Csv structure(without header):

id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2

to load .csv I typed:

lines = sc.textFile("test.txt")

then a distinct count on lines returned 3 as expected:

lines.distinct().count()

But I have no idea how to make a distinct count based on lets say id and country.

How do you count distinct in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you count duplicates in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do you select distinct records in PySpark DataFrame?

Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.

In this case you would select the columns you want to consider, and then count:

sc.textFile("test.txt")\
  .map(lambda line: (line.split(',')[0], line.split(',')[1]))\
  .distinct()\
  .count()

This is for clarity, you can optimize the lambda to avoid calling line.split two times.

The split line can be optimized as follows:

sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()

PySpark distinct().count() on a csv file

Tags:

python

apache-spark

pyspark

dimzak

People also ask

2 Answers

elyase

rami

Recent Activity

Donate For Us

PySpark distinct().count() on a csv file

Tags:

python

apache-spark

pyspark

dimzak

People also ask

2 Answers

elyase

rami

Related questions

Recent Activity

Donate For Us