I'm trying to figure out the best way to get the largest value in a Spark dataframe column.
Consider the following example:
df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show()
Which creates:
+---+---+ | A| B| +---+---+ |1.0|4.0| |2.0|5.0| |3.0|6.0| +---+---+
My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:
# Method 1: Use describe() float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A']) # Method 2: Use SQL df.registerTempTable("df_table") spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval'] # Method 3: Use groupby() df.groupby().max('A').first().asDict()['max(A)'] # Method 4: Convert to RDD df.select("A").rdd.max()[0]
Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.
Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?
Using the Spark filter(), just select row == 1, which returns the maximum salary of each group. Finally, if a row column is not needed, just drop it.
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).
>df1.show() +-----+--------------------+--------+----------+-----------+ |floor| timestamp| uid| x| y| +-----+--------------------+--------+----------+-----------+ | 1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418| | 1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393| | 1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585| | 1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073| >row1 = df1.agg({"x": "max"}).collect()[0] >print row1 Row(max(x)=110.33613) >print row1["max(x)"] 110.33613
The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With