According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray <pre class="prettyprint"><code>X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.])) print X print type(X.as_matrix())# <type 'numpy.ndarray'> print type(X.as_matrix()[0]) # <type 'numpy.ndarray'> m = X.as_matrix() m.dtype.names = list(X.columns) </code></pre> I get <pre class="prettyprint"><code>ValueError: there are no fields defined </code></pre> UPDATE: I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.) Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

Pandas dataframe also has a handy <code>to_records</code> method. Demo: <pre class="prettyprint"><code>X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.])) m = X.to_records(index=False) print repr(m) </code></pre> Returns: <pre class="prettyprint"><code>rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')]) </code></pre> This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. <code>m.age</code> in addition to <code>m['age']</code>. You can pass this to a cython function as a regular float array by constructing a view: <pre class="prettyprint"><code>m_float = m.view(float).reshape(m.shape + (-1,)) print repr(m_float) </code></pre> Which gives: <pre class="prettyprint"><code>rec.array([[ 40., 140.], [ 50., 150.], [ 60., 160.]], dtype=float64) </code></pre> Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use <code>m = X.astype(float, copy=False).to_records(index=False)</code>.

How to keep column names when converting from pandas to numpy

Tags:

python

pandas

numpy

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names

Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray

X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>

m = X.as_matrix()
m.dtype.names = list(X.columns)

I get

ValueError: there are no fields defined

UPDATE:

I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)

Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

588

asked Nov 11 '16 18:11

user48956

1 Answers

Pandas dataframe also has a handy to_records method. Demo:

X = pd.DataFrame(dict(age=[40., 50., 60.], 
                      sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)

Returns:

rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
          dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])

This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].

You can pass this to a cython function as a regular float array by constructing a view:

m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)

Which gives:

rec.array([[  40.,  140.],
           [  50.,  150.],
           [  60.,  160.]], 
          dtype=float64)

Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

answered Sep 28 '22 03:09

user7138814

Related questions
                            
                                /usr/local/bin/python: No module named pip
                            
                                Bulk Partial Upsert in Elasticseach with python
                            
                                Django query expression for calculated fields that require conditions and casting
                            
                                Numpy: Check if float array contains whole numbers
                            
                                Django ORM - confusion about Router.allow_relation()
                            
                                Purpose of pool.join, pool.close in multiprocessing?
                            
                                Multiple pipelines that merge within a sklearn Pipeline?
                            
                                How to use Python to read one column from Excel file?
                            
                                Drawing phase space trajectories with arrows in matplotlib
                            
                                How do I set label for an already plotted line in matplotlib?
                            
                                How can I get an oauth2 access_token using Python
                            
                                multithreading for data from dataframe pandas
                            
                                Pandas, DataFrame: Splitting one column into multiple columns
                            
                                Adding New Text to Sklearn TFIDIF Vectorizer (Python)
                            
                                How to extend the logging.Logger Class?
                            
                                What are some ways to post python pandas dataframes to slack?
                            
                                Select rows of DataFrame with datetime index based on date
                            
                                how can i fix AttributeError: 'dict_values' object has no attribute 'count'?
                            
                                How to fix "TypeError: len() of unsized object"
                            
                                Using json.dumps with ensure_ascii=True

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With