Say I have a dataframe in Pandas like the following: <pre class="prettyprint"><code>> my_dataframe col1 col2 A foo B bar C something A foo A bar B foo </code></pre> where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of <code>my_dataframe</code>. How can I vectorize this efficiently using e.g. <code>DictVectorizer</code> ? Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?

First, I don't get where in your sample array are features, and where observations. Second, <code>DictVectorizer</code> holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to <code>features count</code> x <code>number of observations</code>, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like. In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with <code>to_dict</code> applied to transposed dataframe: <pre class="prettyprint"><code>>>> df col1 col2 0 A foo 1 B bar 2 C foo 3 A bar 4 A foo 5 B bar >>> df.T.to_dict().values() [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}] </code></pre> <hr> Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter <code>'records'</code> for the <code>to_dict()</code> method available, so now you can simple use this method without additional manipulations: <pre class="prettyprint"><code>>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']}) >>> df col1 col2 0 A foo 1 B bar 2 C foo 3 A bar 4 A foo 5 B bar >>> df.to_dict('records') [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}] </code></pre>

Take a look at <code>sklearn-pandas</code> which provides exactly what you're looking for. The corresponding Github repo is here.

Vectorizing a Pandas dataframe for Scikit-Learn

Tags:

Say I have a dataframe in Pandas like the following:

> my_dataframe

col1   col2
A      foo
B      bar
C      something
A      foo
A      bar
B      foo

where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of my_dataframe.

How can I vectorize this efficiently using e.g. DictVectorizer ?

Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?

793

asked Nov 16 '13 22:11

Amelio Vazquez-Reina

Video Answer

2 Answers

First, I don't get where in your sample array are features, and where observations.

Second, DictVectorizer holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count x number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict applied to transposed dataframe:

>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records' for the to_dict() method available, so now you can simple use this method without additional manipulations:

>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

answered Sep 17 '22 17:09

alko

Take a look at sklearn-pandas which provides exactly what you're looking for. The corresponding Github repo is here.

answered Sep 19 '22 17:09

Matt

Related questions
                            
                                Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?
                            
                                QuickBooks API (php) Integration
                            
                                Excel VBA to Export Selected Sheets to PDF
                            
                                Angular ng-include cshtml page
                            
                                Multiple "could not be resolved" problems using Eclipse with minGW
                            
                                Lodash: Constructing single object from many - Merging/overriding properties
                            
                                openssl fails to produce a pfx with a valid alias
                            
                                Remove :focus with jquery
                            
                                Is (int *)0 a null pointer?
                            
                                Cannot convert type 'System.Collections.Generic.List<string>' to 'System.Web.Mvc.SelectList'
                            
                                async and await: are they bad?
                            
                                How to make incoming call in Genymotion emulator for Android?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With