I'm exploring the different feature extraction classes that scikit-learn
provides. Reading the documentation I did not understand very well what DictVectorizer
can be used for? Other questions come to mind. For example, how can DictVectorizer
be used for text classification?, i.e. how does this class help handle labelled textual data? Could anybody provide a short example apart from the example that I already read at the documentation web page?
The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy. sparse matrices, using a hash function to compute the matrix column corresponding to a name.
CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.
say your feature space is length, width and height and you have had 3 observations; i.e. you measure length, width & height of 3 objects:
length width height
obs.1 1 0 2
obs.2 0 1 1
obs.3 3 2 1
another way to show this is to use a list of dictionaries:
[{'height': 1, 'length': 0, 'width': 1}, # obs.2
{'height': 2, 'length': 1, 'width': 0}, # obs.1
{'height': 1, 'length': 3, 'width': 2}] # obs.3
DictVectorizer
goes the other way around; i.e given the list of dictionaries builds the top frame:
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> d = [{'height': 1, 'length': 0, 'width': 1},
... {'height': 2, 'length': 1, 'width': 0},
... {'height': 1, 'length': 3, 'width': 2}]
>>> v.fit_transform(d)
array([[ 1., 0., 1.], # obs.2
[ 2., 1., 0.], # obs.1
[ 1., 3., 2.]]) # obs.3
# height, len., width
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With