I'm using the excellent read_csv()
function from pandas, which gives:
In [31]: data = pandas.read_csv("lala.csv", delimiter=",")
In [32]: data
Out[32]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12083 entries, 0 to 12082
Columns: 569 entries, REGIONC to SCALEKER
dtypes: float64(51), int64(518)
but when i apply a function from scikit-learn i loose the informations about columns:
from sklearn import preprocessing
preprocessing.scale(data)
gives numpy array.
Is there a way to apply scikit or numpy function to DataFrames without loosing the information?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
Pandas is built on top of NumPy, which means the Python pandas package depends on the NumPy package and also pandas intended with many other 3rd party libraries. So we can say that Numpy is required for operating the Pandas.
Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
Pandas is built on top of NumPy. You could roughly define a Series as a wrapper around a NumPy array, and a DataFrame as a collection of Series with a shared index.
This can be done by wrapping the returned data in a dataframe, with index
and columns
information in.
import pandas as pd
pd.DataFrame(preprocessing.scale(data), index = data.index, columns = data.columns)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With