Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove unicode when reading data?

I have the following line of Python code:

trans = data.map(lambda line: line.strip().split())

That produces Unicode strings , for example :

u'Hello',u'word'

I'd like to get normal UTF-8 or ASCII strings

'Hello','word' 

I tried to convert string to UTF-8 such as

trans = data.map(lambda line: line.strip().split().encode("utf-8"))

or

trans = data.map(lambda line: line.strip().split().encode('ascii','ignore'))

But that gives an error :

AttributeError: 'list' object has no attribute 'encode'

Can anybody tell me how I can do this?

UPDATE :

data is scv file , trans is RDD

like image 946
Toren Avatar asked Jan 07 '23 22:01

Toren


1 Answers

Why not simply encode and split:

data = sc.textFile("README.md")
trans = data.map(lambda x: x.encode("ascii", "ignore").split())
trans.first()
## ['#', 'Apache', 'Spark']
like image 153
zero323 Avatar answered Jan 10 '23 03:01

zero323