I have a dataframe with multiple columns:
| a | b | c | d |
-----------------
| 0 | 4 | 3 | 6 |
| 1 | 7 | 0 | 4 |
| 2 | 4 | 3 | 6 |
| 3 | 9 | 5 | 9 |
I would now like to combine [b,c,d]
into a single column. However, I do not know, how big the list of columns will be, otherwise I could just use a UDF3 to combine the three.
So the desired outcome is:
| a | combined |
-----------------
| 0 | [4, 3, 6] |
| 1 | [7, 0, 4] |
| 2 | [4, 3, 6] |
| 3 | [9, 5, 9] |
How can I achieve this?
Non-working pseudo-code:
public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
return ds.withColumn("combined", collectAsList(columns))
}
Worst-case workaround would be a switch statement on the number of input columns and then write a UDF each for, i.e. 2-20 input columns and throw an error, if more input columns are supplied.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
As Ramesh mentioned in his comment, you can use array
function. You only need to convert your columns list to Column
array.
public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
return ds.withColumn("combined", functions.array(columns.stream().map(functions::col).toArray(Column[]::new)))
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With