Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Java - Collect multiple columns into array column

I have a dataframe with multiple columns:

| a | b | c | d |
-----------------
| 0 | 4 | 3 | 6 |
| 1 | 7 | 0 | 4 |
| 2 | 4 | 3 | 6 |
| 3 | 9 | 5 | 9 |

I would now like to combine [b,c,d] into a single column. However, I do not know, how big the list of columns will be, otherwise I could just use a UDF3 to combine the three.

So the desired outcome is:

| a | combined  |
-----------------
| 0 | [4, 3, 6] |
| 1 | [7, 0, 4] |
| 2 | [4, 3, 6] |
| 3 | [9, 5, 9] |

How can I achieve this?

Non-working pseudo-code:

public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
   return ds.withColumn("combined", collectAsList(columns))
}

Worst-case workaround would be a switch statement on the number of input columns and then write a UDF each for, i.e. 2-20 input columns and throw an error, if more input columns are supplied.

like image 941
Carl Ambroselli Avatar asked Jul 05 '18 06:07

Carl Ambroselli


People also ask

How do I select multiple columns in Spark?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.


1 Answers

As Ramesh mentioned in his comment, you can use array function. You only need to convert your columns list to Column array.

public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
    return ds.withColumn("combined", functions.array(columns.stream().map(functions::col).toArray(Column[]::new)))
}
like image 100
Grisha Weintraub Avatar answered Oct 13 '22 21:10

Grisha Weintraub