Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() 

Which creates:

+---+---+ |  A|  B| +---+---+ |1.0|4.0| |2.0|5.0| |3.0|6.0| +---+---+ 

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

# Method 1: Use describe() float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])  # Method 2: Use SQL df.registerTempTable("df_table") spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']  # Method 3: Use groupby() df.groupby().max('A').first().asDict()['max(A)']  # Method 4: Convert to RDD df.select("A").rdd.max()[0] 

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

like image 912
xenocyon Avatar asked Oct 19 '15 22:10

xenocyon


People also ask

How does Spark calculate max salary?

Using the Spark filter(), just select row == 1, which returns the maximum salary of each group. Finally, if a row column is not needed, just drop it.

How can I show more than 20 rows in Spark?

By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.

How do you select top 5 rows in PySpark?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).


1 Answers

>df1.show() +-----+--------------------+--------+----------+-----------+ |floor|           timestamp|     uid|         x|          y| +-----+--------------------+--------+----------+-----------+ |    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418| |    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393| |    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585| |    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|  >row1 = df1.agg({"x": "max"}).collect()[0] >print row1 Row(max(x)=110.33613) >print row1["max(x)"] 110.33613 

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

like image 84
Burt Avatar answered Sep 21 '22 18:09

Burt