Can we run scikit-learn models on Pandas DataFrames or do we need to convert DataFrames into NumPy arrays?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
To convert Pandas DataFrame to Numpy Array, use the function DataFrame. to_numpy() . to_numpy() is applied on this DataFrame and the method returns object of type Numpy ndarray. Usually the returned ndarray is 2-dimensional.
You can use pandas.DataFrame
with sklearn
, for example:
import pandas as pd
from sklearn.cluster import KMeans
data = [(0.2, 10),
(0.3, 12),
(0.24, 14),
(0.8, 30),
(0.9, 32),
(0.85, 33.3),
(0.91, 31),
(0.1, 15),
(-0.23, 45)]
p_df = pd.DataFrame(data)
kmeans = KMeans(init='k-means++', n_clusters=3, n_init=10)
kmeans.fit(p_df)
Result:
>>> kmeans.labels_
array([0, 0, 0, 2, 2, 2, 2, 0, 1], dtype=int32)
Pandas DataFrames are very good at acting like Numpy arrays when they need to. If in doubt, you can always use the values
attribute to get a Numpy representation (df.values
will give you a Numpy array of the values in DataFrame df
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With