I used the following code to replace the None
value in a DataFrame row to an empty string:
def replaceNone(row):
row_len = len(row)
for i in range(0, row_len):
if row[i] is None:
row[i] = ""
return row
in my pyspark code:
data_out = df.rdd.map(lambda row : replaceNone(row)).map(
lambda row : "\t".join( [x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row])
)
Then I got the following errors:
File "<ipython-input-10-8e5d8b2c3a7f>", line 1, in <lambda>
File "<ipython-input-2-d1153a537442>", line 6, in replaceNone
TypeError: 'Row' object does not support item assignment
Does anyone have any idea about the error? How do I replace a "None" value in a row to an empty string? Thanks!
Row
is a subclass of tuple
and tuples
in Python are immutable hence don't support item assignment. If you want to replace an item stored in a tuple you have rebuild it from scratch:
## replace "" with placeholder of your choice
tuple(x if x is not None else "" for x in row)
If you want to simply concatenate flat schema replacing null with empty string you can use concat_ws
:
from pyspark.sql.functions import concat_ws
df.select(concat_ws("\t", *df.columns)).rdd.flatMap(lambda x: x)
To prepare output it makes more sense to use spark-csv
and specify nullValue
, delimiter
and quoteMode
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With