Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Question

I'm reading a bunch of RTF files into python strings. On SOME texts, I get this error:

Traceback (most recent call last):
  File "11.08.py", line 47, in <module>
    X = vectorizer.fit_transform(texts)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line
716, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line
398, in fit_transform
    term_count_current = Counter(analyze(doc))
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line
313, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line
224, in decode
    doc = doc.decode(self.charset, self.charset_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid
 start byte

I've tried:

Copying and pasting the text of the files to new files
saving the rtf files as txt files
Openin the txt files in Notepad++ and choosing 'convert to utf-8' and also setting the encoding to utf-8
Opening the files with Microsoft Word and saving them as new files

Nothing works. Any ideas?

It's probably not related, but here's the code incase you are wondering:

f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
X = vectorizer.fit_transform(texts)

Shalini Baranwal · Accepted Answer

Keep this line :

vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')

encoding = 'latin-1' worked for me.

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Tags:

python

encoding

utf-8

scikit-learn

Zach

1 Answers

Shalini Baranwal

Recent Activity

Donate For Us

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Tags:

python

encoding

utf-8

scikit-learn

Zach

1 Answers

Shalini Baranwal

Related questions

Recent Activity

Donate For Us