First, I am very new to SPARK
I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.
Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);
I am getting only name and max(age) in my resultset dataset.
1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.
You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .
Spark pivot() function is used to pivot/rotate the data from one DataFrame/Dataset column into multiple columns (transform row to column) and unpivot is used to transform it back (transform columns to rows).
For your solution you have to try different approach. You was almost there for solution but let me help you understand.
Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");
now what you can do is you can join the resultset
with studentDataSet
Dataset<Row> joinedDS = studentDataset.join(resultset, "name");
The problem with groupBy
this that after applying groupBy you get RelationalGroupedDataset
so it depends on what next operation you perform like sum, min, mean, max
etc then the result of these operation joined with groupBy
As in you case name
column is joined with the max
of age
so it will return only two columns but if use apply groupBy
on age
and then apply max
on 'age' column you will get two column one is age
and second is max(age)
.
Note :- code is not tested please make changes if needed Hope this clears you query
The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.
Let's create a sample data set and test the code:
val df = Seq(
("bob", 20, "blah"),
("bob", 40, "blah"),
("karen", 21, "hi"),
("monica", 43, "candy"),
("monica", 99, "water")
).toDF("name", "age", "another_column")
This code should run faster with large DataFrames.
df
.groupBy("name")
.agg(
max("name").as("name1_dup"),
max("another_column").as("another_column"),
max("age").as("age")
).drop(
"name1_dup"
).show()
+------+--------------+---+
| name|another_column|age|
+------+--------------+---+
|monica| water| 99|
| karen| hi| 21|
| bob| blah| 40|
+------+--------------+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With