Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a <code>distinct.collect()</code> will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach? <pre class="prettyprint"><code> import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId).distinct.collect().foreach(x => { val applicationId = x.getAs[String](ApplicationId) val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId) // DO SOME TASK PER applicationId }) preProcessedData.unpersist() </code></pre>

In Pyspark try this, <code>df.select('col_name').distinct().show()</code>

Fetching distinct values on a column using Spark DataFrame

Tags:

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

 import sqlContext.implicits._  preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)   preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {    val applicationId = x.getAs[String](ApplicationId)    val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)    // DO SOME TASK PER applicationId  })   preProcessedData.unpersist()

289

asked Aug 14 '16 20:08

Kazhiyur

2 Answers

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

For example:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")  // I obtain all different values. If you show you must see only {1, 3} val distinctValuesDF = df.select(df("age")).distinct  // Define your udf. In this case I defined a simple function, but they can get complicated. val myTransformationUDF = udf(value => value / 10)  // Run that transformation "over" your DataFrame val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))

159

answered Oct 04 '22 13:10

Alberto Bonsanto

In Pyspark try this,

df.select('col_name').distinct().show()

answered Oct 04 '22 11:10

der Fotik

Related questions
                            
                                View SQL query in Slick
                            
                                What replaces class variables in scala?
                            
                                Automated Java to Scala source code conversion? [closed]
                            
                                Scala: short form of pattern matching that returns Boolean
                            
                                Run a function periodically in Scala
                            
                                How to check constructor arguments and throw an exception or make an assertion in a default constructor in Scala?
                            
                                How to declare a byte array in Scala?
                            
                                Two ways of defining functions in Scala. What is the difference?
                            
                                Scala: convert string to Int or None
                            
                                Scala console won't work, IntelliJ
                            
                                Scala: InputStream to Array[Byte]
                            
                                Why no i++ in Scala?
                            
                                VerifyError: Uninitialized object exists on backward branch / JVM Spec 4.10.2.4
                            
                                Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns
                            
                                Generating a class from string and instantiating it in Scala 2.10
                            
                                Meaning of additional colon in Scala class parametrization
                            
                                Using scala constructor to set variable defined in trait
                            
                                Why does sbt compile fail with StackOverflowError?
                            
                                How to use Scala's this typing, abstract types, etc. to implement a Self type?
                            
                                Multiple packages definition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fetching distinct values on a column using Spark DataFrame

Tags:

dataframe

scala

apache-spark

apache-spark-sql

spark-dataframe

Kazhiyur

People also ask

2 Answers

Alberto Bonsanto

der Fotik

Recent Activity

Donate For Us