Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect()
will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId).distinct.collect().foreach(x => { val applicationId = x.getAs[String](ApplicationId) val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId) // DO SOME TASK PER applicationId }) preProcessedData.unpersist()
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
The distinct() method Returns a new DataFrame containing the distinct rows in this DataFrame . Now if you need to consider only a subset of the columns when dropping duplicates, then you first have to make a column selection before calling distinct() as shown below.
To select unique values from a specific single column use dropDuplicates(), since this function returns all columns, use the select() method to get the single column. Once you have the distinct unique values from columns you can also convert them to a list by collecting the data.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
Well to obtain all different values in a Dataframe
you can use distinct. As you can see in the documentation that method returns another DataFrame
. After that you can create a UDF
in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary") // I obtain all different values. If you show you must see only {1, 3} val distinctValuesDF = df.select(df("age")).distinct // Define your udf. In this case I defined a simple function, but they can get complicated. val myTransformationUDF = udf(value => value / 10) // Run that transformation "over" your DataFrame val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With