Loading a dataframe with foreign characters (åäö) into Spark using spark.read.csv
, with encoding='utf-8'
and trying to do a simple show().
>>> df.show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 579: ordinal not in range(128)
I figure this is probably related to Python itself but I cannot understand how any of the tricks that are mentioned here for example can be applied in the context of PySpark and the show()-function.
Encoding Strings In order to get rid of the error, you should explicitly specify the desired encoding. This can be achieved with the use of encode() method, as demonstrated below. In most of the cases, utf-8 encoding will do the trick.
In PySpark DataFrame you can calculate the count of Null, None, NaN or Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
https://issues.apache.org/jira/browse/SPARK-11772 talks about this issue and gives a solution that runs:
export PYTHONIOENCODING=utf8
before running pyspark
. I wonder why above works, because sys.getdefaultencoding()
returned utf-8
for me even without it.
How to set sys.stdout encoding in Python 3? also talks about this and gives the following solution for Python 3:
import sys
sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
This works for me, I am setting the encoding upfront and it is valid throughout the script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With