Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add additional feature to CountVectorizer matrix

I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code:

#list of tweets
texts = [(list of tweets)]

#list of average word length of every tweet
average_lengths = word_length(tweets)

#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)

The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this:

x = zip(x_counts, average_lengths)

but then I get an error when I try to fit my model:

ValueError: setting an array element with a sequence.   

Anyone have any idea how to solve this problem?

like image 986
Tim Avatar asked Dec 21 '15 14:12

Tim


People also ask

What is Max features in CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

What does CountVectorizer fit () do?

As stated by the documentation, the fit method "learn(s) a vocabulary dictionary of all tokens in the raw documents", i.e. it creates a dictionary of tokens (by default the tokens are words separated by spaces and punctuation) that maps each single token to a position in the output matrix.

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

Does CountVectorizer remove punctuation?

We can use CountVectorizer of the scikit-learn library. It by default remove punctuation and lower the documents. It turns each vector into the sparse matrix. It will make sure the word present in the vocabulary and if present it prints the number of occurrences of the word in the vocabulary.


1 Answers

You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion:

vectorizer = FeatureUnion([
        ('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
        ('av_len', AverageLenVectizer(...))
    ])
like image 89
Andrei Avatar answered Sep 30 '22 15:09

Andrei