Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn, add features to a vectorized set of documents

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.

However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.

How can I add features to each document vector generated by the vectorizing function?

like image 612
Mortimer Avatar asked Mar 06 '13 20:03

Mortimer


People also ask

What is difference between Fit_transform and transform?

The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method. The fit_transform() method does both fits and transform.

What is the difference between Feature Extraction and feature selection?

Feature Selection & Feature Extraction The main difference:- Feature Extraction transforms an arbitrary data, such as text or images, into numerical features that is understood by machine learning algorithms. Feature Selection on the other hand is a machine learning technique applied on these (numerical) features.

What is Sklearn Feature_extraction?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

What does CountVectorizer Fit_transform do?

It transforms text to numbers. So with other functions you will be able to count how many times each word existed in the given data set.


1 Answers

You could use the DictVectorizer for the extra categorical data and then use scipy.sparse.hstack to combine them.

like image 117
ogrisel Avatar answered Nov 03 '22 20:11

ogrisel