Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding DictVectorizer in scikit-learn?

I'm exploring the different feature extraction classes that scikit-learn provides. Reading the documentation I did not understand very well what DictVectorizer can be used for? Other questions come to mind. For example, how can DictVectorizer be used for text classification?, i.e. how does this class help handle labelled textual data? Could anybody provide a short example apart from the example that I already read at the documentation web page?

like image 809
tumbleweed Avatar asked Dec 14 '14 20:12

tumbleweed


People also ask

What is Sklearn Feature_extraction?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Which hashing technique does Featurehasher class in scikit-learn adopts sha256?

Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy. sparse matrices, using a hash function to compute the matrix column corresponding to a name.

What does CountVectorizer analyzer do?

CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.


Video Answer


1 Answers

say your feature space is length, width and height and you have had 3 observations; i.e. you measure length, width & height of 3 objects:

       length  width  height
obs.1       1      0       2
obs.2       0      1       1
obs.3       3      2       1

another way to show this is to use a list of dictionaries:

[{'height': 1, 'length': 0, 'width': 1},   # obs.2
 {'height': 2, 'length': 1, 'width': 0},   # obs.1
 {'height': 1, 'length': 3, 'width': 2}]   # obs.3

DictVectorizer goes the other way around; i.e given the list of dictionaries builds the top frame:

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> d = [{'height': 1, 'length': 0, 'width': 1},
...      {'height': 2, 'length': 1, 'width': 0},
...      {'height': 1, 'length': 3, 'width': 2}]
>>> v.fit_transform(d)
array([[ 1.,  0.,  1.],   # obs.2
       [ 2.,  1.,  0.],   # obs.1
       [ 1.,  3.,  2.]])  # obs.3
   # height, len., width   
like image 167
behzad.nouri Avatar answered Oct 11 '22 08:10

behzad.nouri