scikit-learn, add features to a vectorized set of documents

Tags:

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.

However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.

How can I add features to each document vector generated by the vectorizing function?

612

asked Mar 06 '13 20:03

Mortimer

1 Answers

You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.

117

answered Nov 03 '22 20:11

ogrisel

Related questions
                            
                                matplotlib not working anymore due to interactive issue
                            
                                NLTK makes it easy to compute bigrams of words. What about letters?
                            
                                tweepy stops after a few hours
                            
                                Need more than 32 USB sound cards on my system [closed]
                            
                                Django form to query database (models)
                            
                                Binary Tree in Python
                            
                                django prevent delete of model instance
                            
                                Reductions down a column in Pandas
                            
                                Django: Can't change default language
                            
                                Visualize a clickable graph in an HTML page
                            
                                How to get orthogonal distances of vectors from plane in Numpy/Scipy?
                            
                                How to register new client on Instagram API
                            
                                Is there a more elegant pythonic way of expressing the following condional expression?
                            
                                Python: split list of integers based on step between them
                            
                                How to use Python left outer join using FOR/LIST/DICTIONARY comprehensions (not SQL)?
                            
                                Python (numpy): drop columns by index
                            
                                Installation of biopython - python 3.3 not found in registry
                            
                                Python: access objects from another module
                            
                                How to run Python from Windows cmd [duplicate]
                            
                                Python convert Excel File (xls or xlsx) to/from ODS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit-learn, add features to a vectorized set of documents

Tags:

python

machine-learning

nlp

scikit-learn

Mortimer

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us