UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

Tags:

I am trying to read twitter data from json file using python 2.7.12.

Code I used is such:

    import json
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    def get_tweets_from_file(file_name):
        tweets = []
        with open(file_name, 'rw') as twitter_file:
            for line in twitter_file:
                if line != '\r\n':
                    line = line.encode('ascii', 'ignore')
                    tweet = json.loads(line)
                    if u'info' not in tweet.keys():
                        tweets.append(tweet)
    return tweets

Result I got:

    Traceback (most recent call last):
      File "twitter_project.py", line 100, in <module>
        main()                  
      File "twitter_project.py", line 95, in main
        tweets = get_tweets_from_dir(src_dir, dest_dir)
      File "twitter_project.py", line 59, in get_tweets_from_dir
        new_tweets = get_tweets_from_file(file_name)
      File "twitter_project.py", line 71, in get_tweets_from_file
        line = line.encode('ascii', 'ignore')
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.

215

asked Jul 22 '16 04:07

2 Answers

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

163

answered Oct 12 '22 14:10

It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code. See https://stackoverflow.com/a/34378962/1554386 for more information

The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.

0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.

The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.

Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.

with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
    for line in twitter_file:
        # line is now a <type 'unicode'>
        tweet = json.loads(line)

The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.

answered Oct 12 '22 13:10

Alastair McCormack

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

Tags:

json

python-unicode

ascii

utf-8

python-2.7

wannabhappy

People also ask

2 Answers

Sung-Ho_Ahn

Alastair McCormack

Recent Activity

Donate For Us

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

Tags:

json

python-unicode

ascii

utf-8

python-2.7

wannabhappy

People also ask

2 Answers

Sung-Ho_Ahn

Alastair McCormack

Related questions

Recent Activity

Donate For Us