According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names
Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray
X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>
m = X.as_matrix()
m.dtype.names = list(X.columns)
I get
ValueError: there are no fields defined
UPDATE:
I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)
Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.
You can convert pandas dataframe to numpy array using the df. to_numpy() method. Numpy arrays provide fast and versatile ways to normalize data that can be used to clean and scale the data during the training of the machine learning models.
DataFrames and Series in Pandas Series are similar to one-dimensional NumPy arrays, with a single dtype, although with an additional index (list of row labels). DataFrames are an ordered sequence of Series, sharing the same index, with labeled columns.
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
2 Answers. Show activity on this post. The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change.
Pandas dataframe also has a handy to_records
method. Demo:
X = pd.DataFrame(dict(age=[40., 50., 60.],
sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)
Returns:
rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)],
dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age
in addition to m['age']
.
You can pass this to a cython function as a regular float array by constructing a view:
m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)
Which gives:
rec.array([[ 40., 140.],
[ 50., 150.],
[ 60., 160.]],
dtype=float64)
Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With