Encoding tweets to UTF-8 creates weird characters in Python

Question

I am downloading all of a user's tweets, using the twitter API.

When I download the tweets, I encode them in utf-8, before placing them in a CSV file.

tweet.text.encode("utf-8")

I'm using python 3

The issue is that this creates really weird characters in my files. For example, the tweet which reads

"But I’ve been talkin' to God for so long that if you look at my life, I guess he talkin' back."

Gets turned into

"b""But I\xe2\x80\x99ve been talkin' to God for so long that if you look at my life, I guess he talkin' back. """

(I see this when I open the CSV file that I wrote this encoded text to).

So my question is, how can I stop these weird characters from being created.

Also, if someone can explain what the b' which starts every line, means, that would be super helpful.

Here is the full code:

    outtweets = [ [tweet.text.encode('utf-8')] for tweet in alltweets]

#write the csv  
with open('%s_tweets.csv' % screen_name, 'wt') as f:
    writer = csv.writer(f)
    writer.writerow(["text"])
    writer.writerows(outtweets)

Anthon · Accepted Answer

That is not a strange character, that is a RIGHT SINGLE QUOTATION MARK (U+2019). You can often see that character in submits done from OSX based browsers.

If you need ASCII for everything you can try:

import unicodedata
unicodedata.normalize('NFKD', tweet.text).encode('ascii','ignore')

If you encode a string in to bytes sequence, and then output that bytes sequence, you should expect the b"..." that indicates a byte sequence and not a normal string.

Encoding tweets to UTF-8 creates weird characters in Python

Tags:

python

csv

encoding

utf-8

James Dorfman

1 Answers

Anthon

Recent Activity

Donate For Us

Encoding tweets to UTF-8 creates weird characters in Python

Tags:

python

csv

encoding

utf-8

James Dorfman

1 Answers

Anthon

Related questions

Recent Activity

Donate For Us