Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing a Pandas dataframe for Scikit-Learn

Tags:

Say I have a dataframe in Pandas like the following:

> my_dataframe

col1   col2
A      foo
B      bar
C      something
A      foo
A      bar
B      foo

where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of my_dataframe.

How can I vectorize this efficiently using e.g. DictVectorizer ?

Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?

like image 793
Amelio Vazquez-Reina Avatar asked Nov 16 '13 22:11

Amelio Vazquez-Reina


People also ask

Can SciKit-learn use Pandas DataFrame?

Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

Does Pandas use vectorization?

Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.

What are the difference between SciKit-learn and Pandas in Python?

Just like Pandas and Numpy, it's a Python library, but SciKit more specific for Machine Learning. SciKit Learn includes everything from dataset manipulation to processing metrics.


Video Answer


2 Answers

First, I don't get where in your sample array are features, and where observations.

Second, DictVectorizer holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count x number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict applied to transposed dataframe:

>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records' for the to_dict() method available, so now you can simple use this method without additional manipulations:

>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
  col1 col2
0    A  foo
1    B  bar
2    C  foo
3    A  bar
4    A  foo
5    B  bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]
like image 80
alko Avatar answered Sep 17 '22 17:09

alko


Take a look at sklearn-pandas which provides exactly what you're looking for. The corresponding Github repo is here.

like image 43
Matt Avatar answered Sep 19 '22 17:09

Matt