I'm exploring the different feature extraction classes that <code>scikit-learn</code> provides. Reading the documentation I did not understand very well what <code>DictVectorizer</code> can be used for? Other questions come to mind. For example, how can <code>DictVectorizer</code> be used for text classification?, i.e. how does this class help handle labelled textual data? Could anybody provide a short example apart from the example that I already read at the documentation web page?

say your feature space is length, width and height and you have had 3 observations; i.e. you measure length, width & height of 3 objects: <pre class="prettyprint"><code> length width height obs.1 1 0 2 obs.2 0 1 1 obs.3 3 2 1 </code></pre> another way to show this is to use a list of dictionaries: <pre class="prettyprint"><code>[{'height': 1, 'length': 0, 'width': 1}, # obs.2 {'height': 2, 'length': 1, 'width': 0}, # obs.1 {'height': 1, 'length': 3, 'width': 2}] # obs.3 </code></pre> <code>DictVectorizer</code> goes the other way around; i.e given the list of dictionaries builds the top frame: <pre class="prettyprint"><code>>>> from sklearn.feature_extraction import DictVectorizer >>> v = DictVectorizer(sparse=False) >>> d = [{'height': 1, 'length': 0, 'width': 1}, ... {'height': 2, 'length': 1, 'width': 0}, ... {'height': 1, 'length': 3, 'width': 2}] >>> v.fit_transform(d) array([[ 1., 0., 1.], # obs.2 [ 2., 1., 0.], # obs.1 [ 1., 3., 2.]]) # obs.3 # height, len., width </code></pre>

Understanding DictVectorizer in scikit-learn?

Tags:

python

machine-learning

nlp

scikit-learn

I'm exploring the different feature extraction classes that scikit-learn provides. Reading the documentation I did not understand very well what DictVectorizer can be used for? Other questions come to mind. For example, how can DictVectorizer be used for text classification?, i.e. how does this class help handle labelled textual data? Could anybody provide a short example apart from the example that I already read at the documentation web page?

809

asked Dec 14 '14 20:12

tumbleweed

Video Answer

1 Answers

say your feature space is length, width and height and you have had 3 observations; i.e. you measure length, width & height of 3 objects:

       length  width  height
obs.1       1      0       2
obs.2       0      1       1
obs.3       3      2       1

another way to show this is to use a list of dictionaries:

[{'height': 1, 'length': 0, 'width': 1},   # obs.2
 {'height': 2, 'length': 1, 'width': 0},   # obs.1
 {'height': 1, 'length': 3, 'width': 2}]   # obs.3

DictVectorizer goes the other way around; i.e given the list of dictionaries builds the top frame:

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> d = [{'height': 1, 'length': 0, 'width': 1},
...      {'height': 2, 'length': 1, 'width': 0},
...      {'height': 1, 'length': 3, 'width': 2}]
>>> v.fit_transform(d)
array([[ 1.,  0.,  1.],   # obs.2
       [ 2.,  1.,  0.],   # obs.1
       [ 1.,  3.,  2.]])  # obs.3
   # height, len., width

167

answered Oct 11 '22 08:10

behzad.nouri

Related questions
                            
                                Identifying the data type of an input
                            
                                How do I remove identical items from a list and sort it in Python?
                            
                                Vim: Highlight a Single Character at Column 80 [duplicate]
                            
                                Nested Django tags
                            
                                cv2.threshold() error (-210)
                            
                                WinError 10049: The requested address is not valid in its context
                            
                                pygame: current time millis and delta time
                            
                                Numpy: get values from array where indices are in another array
                            
                                How to stop NLTK stemmer from removing the trailing "e"?
                            
                                Does Python's tarfile.open need close()?
                            
                                PIL and pygame.image
                            
                                Using an import and a for loop when passing a program as a string to Python
                            
                                Upload to Amazon S3 using tinys3
                            
                                how to run a django python file from command line
                            
                                Sierpinski triangle recursion using turtle graphics
                            
                                set rgba color of points in matplotlib
                            
                                Creating a transparent overlay with qt
                            
                                Determining Hypernym or Hyponym using wordnet nltk
                            
                                How do you get Python to detect for no input
                            
                                Missing Spanish wordnet from NLTK

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With