Best way to get the max value in a Spark dataframe column

Tags:

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show()

Which creates:

+---+---+ |  A|  B| +---+---+ |1.0|4.0| |2.0|5.0| |3.0|6.0| +---+---+

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

# Method 1: Use describe() float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])  # Method 2: Use SQL df.registerTempTable("df_table") spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']  # Method 3: Use groupby() df.groupby().max('A').first().asDict()['max(A)']  # Method 4: Convert to RDD df.select("A").rdd.max()[0]

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

912

asked Oct 19 '15 22:10

xenocyon

1 Answers

>df1.show() +-----+--------------------+--------+----------+-----------+ |floor|           timestamp|     uid|         x|          y| +-----+--------------------+--------+----------+-----------+ |    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418| |    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393| |    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585| |    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|  >row1 = df1.agg({"x": "max"}).collect()[0] >print row1 Row(max(x)=110.33613) >print row1["max(x)"] 110.33613

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

answered Sep 21 '22 18:09

Burt

Related questions
                            
                                FileNotFoundError: [Errno 2] No such file or directory [duplicate]
                            
                                Passing IPython variables as arguments to bash commands
                            
                                What is this odd sorting algorithm?
                            
                                How can I use redis with Django?
                            
                                prevent scientific notation in matplotlib.pyplot [duplicate]
                            
                                Block scope in Python
                            
                                How do I sort unicode strings alphabetically in Python?
                            
                                In Python, how to display current time in readable format
                            
                                Pandas: how to change all the values of a column?
                            
                                Set Django's FileField to an existing file
                            
                                List of dicts to/from dict of lists
                            
                                Defining the midpoint of a colormap in matplotlib
                            
                                Can I make an admin field not required in Django without creating a form?
                            
                                Python's lambda with underscore for an argument?
                            
                                Declare function at end of file in Python
                            
                                matplotlib y-axis label on right side
                            
                                Scatter plot and Color mapping in Python
                            
                                Ambiguity in Pandas Dataframe / Numpy Array "axis" definition
                            
                                How to read HDF5 files in Python
                            
                                Python pandas: how to specify data types when reading an Excel file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to get the max value in a Spark dataframe column

Tags:

python

apache-spark

apache-spark-sql

pyspark

xenocyon

People also ask

1 Answers

Burt

Recent Activity

Donate For Us