Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

 import sqlContext.implicits._  preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)   preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {    val applicationId = x.getAs[String](ApplicationId)    val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)    // DO SOME TASK PER applicationId  })   preProcessedData.unpersist()   
like image 289
Kazhiyur Avatar asked Aug 14 '16 20:08

Kazhiyur


People also ask

How do I get unique values of a column in spark DataFrame?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do I select distinct records from spark DataFrame?

The distinct() method Returns a new DataFrame containing the distinct rows in this DataFrame . Now if you need to consider only a subset of the columns when dropping duplicates, then you first have to make a column selection before calling distinct() as shown below.

How do you select distinct values from a column in PySpark?

To select unique values from a specific single column use dropDuplicates(), since this function returns all columns, use the select() method to get the single column. Once you have the distinct unique values from columns you can also convert them to a list by collecting the data.

How can I get unique values from a column in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.


2 Answers

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

For example:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")  // I obtain all different values. If you show you must see only {1, 3} val distinctValuesDF = df.select(df("age")).distinct  // Define your udf. In this case I defined a simple function, but they can get complicated. val myTransformationUDF = udf(value => value / 10)  // Run that transformation "over" your DataFrame val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age"))) 
like image 159
Alberto Bonsanto Avatar answered Oct 04 '22 13:10

Alberto Bonsanto


In Pyspark try this,

df.select('col_name').distinct().show()

like image 27
der Fotik Avatar answered Oct 04 '22 11:10

der Fotik