I have a pyspark dataframe similar to the following:
df = sql_context.createDataFrame([
Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
])
Where one of the values for column e
contains the UTF null character \u0000
. If I try to load this df
into a postgresql database, I get the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?
I have tried using some of the pyspark.sql.functions
to clean the data first without success. encode
, decode
, and regex_replace
did not work:
df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))
Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.
I am using a postgres 9.4.9 database with UTF8
encoding.
Ah wait - I think I have it. If I do something like this, it seems to work:
null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))
And then mapping to all string columns:
string_columns = ['d','e']
new_df = df.select(
*(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
c in df.columns)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With