Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep column names when converting from pandas to numpy

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names

Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray

X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>

m = X.as_matrix()
m.dtype.names = list(X.columns)

I get

ValueError: there are no fields defined

UPDATE:

I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)

Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

like image 588
user48956 Avatar asked Nov 11 '16 18:11

user48956


People also ask

Can we convert pandas DataFrame to NumPy array?

You can convert pandas dataframe to numpy array using the df. to_numpy() method. Numpy arrays provide fast and versatile ways to normalize data that can be used to clean and scale the data during the training of the machine learning models.

Is a pandas DataFrame the same as a NumPy array?

DataFrames and Series in Pandas Series are similar to one-dimensional NumPy arrays, with a single dtype, although with an additional index (list of row labels). DataFrames are an ordered sequence of Series, sharing the same index, with labeled columns.

Can pandas have same column names?

Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.

Does pandas Tolist preserve order?

2 Answers. Show activity on this post. The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change.


1 Answers

Pandas dataframe also has a handy to_records method. Demo:

X = pd.DataFrame(dict(age=[40., 50., 60.], 
                      sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)

Returns:

rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
          dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])

This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].

You can pass this to a cython function as a regular float array by constructing a view:

m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)

Which gives:

rec.array([[  40.,  140.],
           [  50.,  150.],
           [  60.,  160.]], 
          dtype=float64)

Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

like image 86
user7138814 Avatar answered Sep 28 '22 03:09

user7138814