Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am getting correct results but I need all columns in my resultset.

Dataset<Row> resultset = studentDataSet.select("*").groupBy("name").max("age");
resultset.show(1000,false);

I am getting only name and max(age) in my resultset dataset.

like image 494
Anup Sapkale Avatar asked Jan 05 '17 07:01

Anup Sapkale


People also ask

How do you get all the columns after groupBY in PySpark?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.

How do I get columns from a data frame Spark?

You can get the all columns of a Spark DataFrame by using df. columns , it returns an array of column names as Array[Stirng] .

How do I convert columns to rows in Spark?

Spark pivot() function is used to pivot/rotate the data from one DataFrame/Dataset column into multiple columns (transform row to column) and unpivot is used to transform it back (transform columns to rows).


2 Answers

For your solution you have to try different approach. You was almost there for solution but let me help you understand.

Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");

now what you can do is you can join the resultset with studentDataSet

Dataset<Row> joinedDS = studentDataset.join(resultset, "name");

The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy

As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).

Note :- code is not tested please make changes if needed Hope this clears you query

like image 104
Akash Sethi Avatar answered Sep 20 '22 03:09

Akash Sethi


The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.

Let's create a sample data set and test the code:

val df = Seq(
  ("bob", 20, "blah"),
  ("bob", 40, "blah"),
  ("karen", 21, "hi"),
  ("monica", 43, "candy"),
  ("monica", 99, "water")
).toDF("name", "age", "another_column")

This code should run faster with large DataFrames.

df
  .groupBy("name")
  .agg(
    max("name").as("name1_dup"), 
    max("another_column").as("another_column"),  
    max("age").as("age")
  ).drop(
    "name1_dup"
  ).show()

+------+--------------+---+
|  name|another_column|age|
+------+--------------+---+
|monica|         water| 99|
| karen|            hi| 21|
|   bob|          blah| 40|
+------+--------------+---+
like image 38
Powers Avatar answered Sep 18 '22 03:09

Powers