Vectorization: Not a valid collection

Question

I wanna vectorize a txt file containing my training corpus for the OneClassSVM classifier. For that I'm using CountVectorizer from the scikit-learn library. Here's below my code:

def file_to_corpse(file_name, stop_words):
    array_file = []
    with open(file_name) as fd:
        corp = fd.readlines()
    array_file = np.array(corp)
    stwf = stopwords.words('french')
    for w in stop_words:
        stwf.append(w)
    vectorizer = CountVectorizer(decode_error = 'replace', stop_words=stwf, min_df=1)
    X = vectorizer.fit_transform(array_file)
    return X

When I run my function on my file (around 206346 line) I get the following error and I can't seem to understand it:

Traceback (most recent call last):
  File "svm.py", line 93, in <module>
    clf_svm.fit(training_data)
  File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/svm/classes.py", line 1028, in fit
    super(OneClassSVM, self).fit(X, np.ones(_num_samples(X)), sample_weight=sample_weight,
  File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 122, in _num_samples
    " a valid collection." % x)
TypeError: Singleton array array(<536172x13800 sparse matrix of type '<type 'numpy.int64'>'
    with 1952637 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

Can somebody please help me solve this problem? I've been stuck for a while :).

Pax Vobiscum · Accepted Answer

If you look at the source, you can find it here for instance, you can find that it checks for this condition to be true (x being your array)

if len(x.shape) == 0:

if so, it will raise this exception

TypeError("Singleton array %r cannot be considered a valid collection." % x)

What I would suggest is that you try to find out if array_file or your return value from this function has a shape length > 0

Vectorization: Not a valid collection

Tags:

python

vectorization

machine-learning

python-2.7

scikit-learn

Imane.r

1 Answers

Pax Vobiscum

Recent Activity

Donate For Us

Vectorization: Not a valid collection

Tags:

python

vectorization

machine-learning

python-2.7

scikit-learn

Imane.r

1 Answers

Pax Vobiscum

Related questions

Recent Activity

Donate For Us