I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the files and index their vocabularies.
However, I have extra metadata for each documents, such as the authors, the division that was responsible, list of topics, etc.
How can I add features to each document vector generated by the vectorizing function?
The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method. The fit_transform() method does both fits and transform.
Feature Selection & Feature Extraction The main difference:- Feature Extraction transforms an arbitrary data, such as text or images, into numerical features that is understood by machine learning algorithms. Feature Selection on the other hand is a machine learning technique applied on these (numerical) features.
The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
It transforms text to numbers. So with other functions you will be able to count how many times each word existed in the given data set.
You could use the DictVectorizer
for the extra categorical data and then use scipy.sparse.hstack to combine them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With