I am writing a python spark utility to read files and do some transformation. File has large amount of data ( upto 12GB ). I use sc.textFile to create a RDD and logic is to pass each line from RDD to a map function which in turn split's the line by "," and run some data transformation( changing fields value based on a mapping ).
Sample line from the file. 0014164,02,031270,09,1,,0,0,0000000000,134314,Mobile,ce87862158eb0dff3023e16850f0417a-cs31,584e2cd63057b7ed,Privé,Gossip
Due to values "Privé" I get UnicodeDecodeError. I tried to following to parse this value:
if isinstance(v[12],basestring):
v[12] = v[12].encode('utf8')
else:
v[12] = unicode(v[12]).encode('utf8')
but when I write data back to file this field gets translated as 'Priv�'. on Linux source file type is shown as "ISO-8859 text, with very long lines, with CRLF line terminators".
Could someone let me know right way in Spark to read/write files with mixed encoding please.
You can set use_unicode
to False
when calling textFile
. It will give you RDD of str
objects (Python 2.x) or bytes
objects (Python 3.x) which can further processed using desired encoding, for example
sc.textFile(path, use_unicode=False).map(lambda x: x.decode("iso-8859-1"))
If that's not sufficient data can be loaded as-is using binaryFiles
sc.binaryFiles(path).values().flatMap(lambda x: x.decode("iso-8859-1").splitlines())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With