Say I have a dataframe in Pandas like the following:
> my_dataframe
col1 col2
A foo
B bar
C something
A foo
A bar
B foo
where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of my_dataframe
.
How can I vectorize this efficiently using e.g. DictVectorizer
?
Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.
Just like Pandas and Numpy, it's a Python library, but SciKit more specific for Machine Learning. SciKit Learn includes everything from dataset manipulation to processing metrics.
First, I don't get where in your sample array are features, and where observations.
Second, DictVectorizer
holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count
x number of observations
, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.
In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict
applied to transposed dataframe:
>>> df
col1 col2
0 A foo
1 B bar
2 C foo
3 A bar
4 A foo
5 B bar
>>> df.T.to_dict().values()
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]
Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records'
for the to_dict()
method available, so now you can simple use this method without additional manipulations:
>>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
>>> df
col1 col2
0 A foo
1 B bar
2 C foo
3 A bar
4 A foo
5 B bar
>>> df.to_dict('records')
[{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]
Take a look at sklearn-pandas
which provides exactly what you're looking for. The corresponding Github repo is here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With