I have a DataFrame with column a. I would like to create two additional columns (b and c) based on column a. I could solve this problem doing the same thing twice:
df = df.withColumn('b', when(df.a == 'something', 'x'))\
.withColumn('c', when(df.a == 'something', 'y'))
I would like to avoid doing the same thing twice, as the condition on which b and c are updated are the same, and also there are a lot of cases for column a. Is there a smarter solution to this problem? Could "withColumn" accept multiple columns perhaps?
A struct
is best suited in such a case. See below example.
spark.sparkContext.parallelize([('something',), ('foobar',)]).toDF(['a']). \
withColumn('b_c_struct',
func.when(func.col('a') == 'something',
func.struct(func.lit('x').alias('b'), func.lit('y').alias('c'))
)
). \
select('*', 'b_c_struct.*'). \
show()
# +---------+----------+----+----+
# | a|b_c_struct| b| c|
# +---------+----------+----+----+
# |something| {x, y}| x| y|
# | foobar| null|null|null|
# +---------+----------+----+----+
Just use a drop('b_c_struct')
after the select
to remove the struct column and keep the individual fields.
By using withColumn
, you can only create or modify one column at each time. You can achieve by using rdd mapping with user defined functions, however it's not recommended:
temp = spark.createDataFrame(
[(1, )],
schema=['col']
)
temp.show(10, False)
+---+
|col|
+---+
|1 |
+---+
#You can create your own logic in your UDF
def user_defined_function(val, col_name):
if col_name == 'col2':
val += 1
elif col_name == 'col3':
val += 2
else:
pass
return val
temp = temp.rdd.map(lambda row: (row[0], user_defined_function(row[0], 'col2'), user_defined_function(row[0], 'col3'))).toDF(['col', 'col2', 'col3'])
temp.show(3, False)
+---+----+----+
|col|col2|col3|
+---+----+----+
|1 |2 |3 |
+---+----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With