I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code: <pre class="prettyprint"><code>#list of tweets texts = [(list of tweets)] #list of average word length of every tweet average_lengths = word_length(tweets) #tokenizer count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1)) x_counts = count_vect.fit_transform(texts) </code></pre> The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this: <pre class="prettyprint"><code>x = zip(x_counts, average_lengths) </code></pre> but then I get an error when I try to fit my model: <pre class="prettyprint"><code>ValueError: setting an array element with a sequence. </code></pre> Anyone have any idea how to solve this problem?

You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion: <pre class="prettyprint"><code>vectorizer = FeatureUnion([ ('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))), ('av_len', AverageLenVectizer(...)) ]) </code></pre>

Add additional feature to CountVectorizer matrix

Tags:

python

scikit-learn

I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code:

#list of tweets
texts = [(list of tweets)]

#list of average word length of every tweet
average_lengths = word_length(tweets)

#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)

The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this:

x = zip(x_counts, average_lengths)

but then I get an error when I try to fit my model:

ValueError: setting an array element with a sequence.

Anyone have any idea how to solve this problem?

986

asked Dec 21 '15 14:12

Tim

1 Answers

You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion:

vectorizer = FeatureUnion([
        ('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
        ('av_len', AverageLenVectizer(...))
    ])

answered Sep 30 '22 15:09

Andrei

Related questions
                            
                                random.randint(2, 12) returns same results every time it's run in Python
                            
                                Shapely intersections vs shapely relationships - inexact?
                            
                                PJSUA Error on sip registration with c
                            
                                How to speed up the plot of a large number of rectangles with Matplotlib?
                            
                                Difference between filter and list comprehension
                            
                                Matplotlib Legends for barh
                            
                                Crop Image from all sides after edge detection
                            
                                Jupyter notebook kernel dies when creating dummy variables with pandas
                            
                                Rotating a list without using collection.deque
                            
                                How to write at a particular position in text file without erasing original contents?
                            
                                Mirroring rows in matrix with loops/recursion?
                            
                                numpy / scipy: Making one series converge towards another after a period of time
                            
                                Can't optimize multivariate linear regression in Tensorflow
                            
                                Remove parentheses around integers in a string
                            
                                Error : Could not find a version that satisfies the requirement webdriver (from versions: )
                            
                                What is a tensorflow float ref?
                            
                                Unhandled pending operations for models when trying to perform migration
                            
                                python - execute command and get output
                            
                                Python regex find and replace inplace
                            
                                Flask-admin inline modelling passing form arguments throws AttributeError

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With