First, I am very new to SPARK I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset. <pre class="prettyprint"><code>Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age"); resultset.show(1000,false); </code></pre> I am getting only name and max(age) in my resultset dataset.

For your solution you have to try different approach. You was almost there for solution but let me help you understand. <pre class="prettyprint"><code>Dataset<Row> resultset = studentDataSet.groupBy("name").max("age"); </code></pre> now what you can do is you can join the <code>resultset</code> with <code>studentDataSet</code> <pre class="prettyprint"><code>Dataset<Row> joinedDS = studentDataset.join(resultset, "name"); </code></pre> The problem with <code>groupBy</code> this that after applying groupBy you get <code>RelationalGroupedDataset</code> so it depends on what next operation you perform like <code>sum, min, mean, max</code> etc then the result of these operation joined with <code>groupBy</code> As in you case <code>name</code> column is joined with the <code>max</code> of <code>age</code> so it will return only two columns but if use apply <code>groupBy</code> on <code>age</code> and then apply <code>max</code> on 'age' column you will get two column one is <code>age</code> and second is <code>max(age)</code>. Note :- code is not tested please make changes if needed Hope this clears you query

The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly. Let's create a sample data set and test the code: <pre class="prettyprint"><code>val df = Seq( ("bob", 20, "blah"), ("bob", 40, "blah"), ("karen", 21, "hi"), ("monica", 43, "candy"), ("monica", 99, "water") ).toDF("name", "age", "another_column") </code></pre> This code should run faster with large DataFrames. <pre class="prettyprint"><code>df .groupBy("name") .agg( max("name").as("name1_dup"), max("another_column").as("another_column"), max("age").as("age") ).drop( "name1_dup" ).show() +------+--------------+---+ | name|another_column|age| +------+--------------+---+ |monica| water| 99| | karen| hi| 21| | bob| blah| 40| +------+--------------+---+ </code></pre>

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

Tags:

apache-spark

apache-spark-sql

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.

Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);

I am getting only name and max(age) in my resultset dataset.

494

asked Jan 05 '17 07:01

Anup Sapkale

2 Answers

For your solution you have to try different approach. You was almost there for solution but let me help you understand.

Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");

now what you can do is you can join the resultset with studentDataSet

Dataset<Row> joinedDS = studentDataset.join(resultset, "name");

The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy

As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).

Note :- code is not tested please make changes if needed Hope this clears you query

104

answered Sep 20 '22 03:09

Akash Sethi

The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.

Let's create a sample data set and test the code:

val df = Seq(
  ("bob", 20, "blah"),
  ("bob", 40, "blah"),
  ("karen", 21, "hi"),
  ("monica", 43, "candy"),
  ("monica", 99, "water")
).toDF("name", "age", "another_column")

This code should run faster with large DataFrames.

df
  .groupBy("name")
  .agg(
    max("name").as("name1_dup"), 
    max("another_column").as("another_column"),  
    max("age").as("age")
  ).drop(
    "name1_dup"
  ).show()

+------+--------------+---+
|  name|another_column|age|
+------+--------------+---+
|monica|         water| 99|
| karen|            hi| 21|
|   bob|          blah| 40|
+------+--------------+---+

answered Sep 18 '22 03:09

Powers

Related questions
                            
                                Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)
                            
                                Hello world in zeppelin failed
                            
                                Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator
                            
                                Empty output for Watermarked Aggregation Query in Append Mode
                            
                                How to save models from ML Pipeline to S3 or HDFS?
                            
                                create empty array-column of given schema in Spark
                            
                                Spark : check your cluster UI to ensure that workers are registered
                            
                                Spark Task not serializable with lag Window function
                            
                                Spark and Java: Exception thrown in awaitResult
                            
                                Apache Spark Dataframe Groupby agg() for multiple columns
                            
                                How to append an element to an array column of a Spark Dataframe?
                            
                                Does join parallelise well in Spark?
                            
                                error: not found: type SparkConf
                            
                                How to submit a spark job on a remote master node in yarn client mode?
                            
                                How to read Avro file in PySpark
                            
                                Spark: coalesce very slow even the output data is very small
                            
                                Convert Dataframe to a Map(Key-Value) in Spark
                            
                                Why does df.limit keep changing in Pyspark?
                            
                                argmax in Spark DataFrames: how to retrieve the row with the maximum value
                            
                                How can I save an RDD into HDFS and later read it back?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With