def column_concat(a,b):
return concat(a,b)
searches_df = searches_df.withColumn('unique_id',reduce(column_concat,(searches_df[col] for col in search_parameters)))
This works except when a column contains a null value, then the whole concatenated string is null. I'd like the a place holder or some character instead in the concatenated string.
Given the following dataframe:
df = sqlContext.createDataFrame([("foo", "bar"), ("baz", None)],
('a', 'b'))
df.show()
+---+----+
| a| b|
+---+----+
|foo| bar|
|baz|null|
+---+----+
One solution is to use a udf to filter/replace the null(s), e.g.:
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
concat_udf = F.udf(lambda cols: "".join([x if x is not None else "*" for x in cols]), StringType())
df.withColumn("unique_id", concat_udf(F.array("a", "b"))).show()
which yields:
+---+----+---------+
| a| b|unique_id|
+---+----+---------+
|foo| bar| foobar|
|baz|null| baz*|
+---+----+---------+
Alternatively:
import pyspark.sql.functions as F
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
df.withColumn("unique_id", myConcat("a", "b")).show()
which also yields:
+---+----+---------+
| a| b|unique_id|
+---+----+---------+
|foo| bar| foobar|
|baz|null| baz*|
+---+----+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With