I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code:
#list of tweets
texts = [(list of tweets)]
#list of average word length of every tweet
average_lengths = word_length(tweets)
#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)
The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this:
x = zip(x_counts, average_lengths)
but then I get an error when I try to fit my model:
ValueError: setting an array element with a sequence.
Anyone have any idea how to solve this problem?
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
As stated by the documentation, the fit method "learn(s) a vocabulary dictionary of all tokens in the raw documents", i.e. it creates a dictionary of tokens (by default the tokens are words separated by spaces and punctuation) that maps each single token to a position in the output matrix.
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
We can use CountVectorizer of the scikit-learn library. It by default remove punctuation and lower the documents. It turns each vector into the sparse matrix. It will make sure the word present in the vocabulary and if present it prints the number of occurrences of the word in the vocabulary.
You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion:
vectorizer = FeatureUnion([
('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
('av_len', AverageLenVectizer(...))
])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With