Tokenizing text with scikit-learn

Question

I have the following code to extract features from a set of files (folder name is the category name) for text classification.

import sklearn.datasets
from sklearn.feature_extraction.text import TfidfVectorizer

train = sklearn.datasets.load_files('./train', description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
print len(train.data)
print train.target_names

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train.data)

It throws the following stack trace:

Traceback (most recent call last):
  File "C:\EclipseWorkspace\TextClassifier\main.py", line 16, in <module>
    X_train = vectorizer.fit_transform(train.data)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 1285, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 804, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 739, in _count_vocab
    for feature in analyze(doc):
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 236, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 113, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 32054: invalid start byte

I run Python 2.7. How can I get this to work?

EDIT: I have just discovered that this works perfectly well for files with utf-8 encoding (my files are ANSI encoded). Is there any way I can get sklearn.datasets.load_files() to work with ANSI encoding?

cfh · Accepted Answer

ANSI is a strict subset of UTF-8, so it should work just fine. However, from the stack trace, it seems that your input contains the byte 0xFF somewhere, which is not a valid ANSI character.

Tokenizing text with scikit-learn

Tags:

python

machine-learning

scikit-learn

text-classification

scikits

raul

1 Answers

cfh

Recent Activity

Donate For Us

Tokenizing text with scikit-learn

Tags:

python

machine-learning

scikit-learn

text-classification

scikits

raul

1 Answers

cfh

Related questions

Recent Activity

Donate For Us