I am interested in knowing how to convert a pandas dataframe into a NumPy array. dataframe: <pre class="prettyprint"><code>import numpy as np import pandas as pd index = [1, 2, 3, 4, 5, 6, 7] a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1] b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan] c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan] df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index) df = df.rename_axis('ID') </code></pre> gives <pre class="prettyprint"><code>label A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN </code></pre> I would like to convert this to a NumPy array, as so: <pre class="prettyprint"><code>array([[ nan, 0.2, nan], [ nan, nan, 0.5], [ nan, 0.2, 0.5], [ 0.1, 0.2, nan], [ 0.1, 0.2, 0.5], [ 0.1, nan, 0.5], [ 0.1, nan, nan]]) </code></pre> How can I do this? <hr> As a bonus, is it possible to preserve the dtypes, like this? <pre class="prettyprint"><code>array([[ 1, nan, 0.2, nan], [ 2, nan, nan, 0.5], [ 3, nan, 0.2, 0.5], [ 4, 0.1, 0.2, nan], [ 5, 0.1, 0.2, 0.5], [ 6, 0.1, nan, 0.5], [ 7, 0.1, nan, nan]], dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')]) </code></pre> or similar?

<h3> <code>df.to_numpy()</code> is better than <code>df.values</code>, here's why.* </h3> It's time to deprecate your usage of <code>values</code> and <code>as_matrix()</code>. pandas <code>v0.24.0</code> introduced two new methods for obtaining NumPy arrays from pandas objects: <ol> <li> <code>to_numpy()</code>, which is defined on <code>Index</code>, <code>Series</code>, and <code>DataFrame</code> objects, and</li> <li> <code>array</code>, which is defined on <code>Index</code> and <code>Series</code> objects only.</li> </ol> If you visit the v0.24 docs for <code>.values</code>, you will see a big red warning that says: <blockquote> <h3>Warning: We recommend using <code>DataFrame.to_numpy()</code> instead.</h3> </blockquote> See this section of the v0.24.0 release notes, and this answer for more information. * - <code>to_numpy()</code> is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using <code>.values</code> to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later. <hr> <hr> <h3>Towards Better Consistency: <code>to_numpy()</code></h3> In the spirit of better consistency throughout the API, a new method <code>to_numpy</code> has been introduced to extract the underlying NumPy array from DataFrames. <pre class="prettyprint"><code># Setup df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index=['a', 'b', 'c']) # Convert the entire DataFrame df.to_numpy() # array([[1, 4, 7], # [2, 5, 8], # [3, 6, 9]]) # Convert specific columns df[['A', 'C']].to_numpy() # array([[1, 7], # [2, 8], # [3, 9]]) </code></pre> As mentioned above, this method is also defined on <code>Index</code> and <code>Series</code> objects (see here). <pre class="prettyprint"><code>df.index.to_numpy() # array(['a', 'b', 'c'], dtype=object) df['A'].to_numpy() # array([1, 2, 3]) </code></pre> By default, a view is returned, so any modifications made will affect the original. <pre class="prettyprint"><code>v = df.to_numpy() v[0, 0] = -1 df A B C a -1 4 7 b 2 5 8 c 3 6 9 </code></pre> If you need a copy instead, use <code>to_numpy(copy=True)</code>. <hr> <h3>pandas >= 1.0 update for ExtensionTypes</h3> If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted. <pre class="prettyprint"><code>a = pd.array([1, 2, None], dtype="Int64") a <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64 # Wrong a.to_numpy() # array([1, 2, <NA>], dtype=object) # yuck, objects # Correct a.to_numpy(dtype='float', na_value=np.nan) # array([ 1., 2., nan]) # Also correct a.to_numpy(dtype='int', na_value=-1) # array([ 1, 2, -1]) </code></pre> This is called out in the docs. <hr> <h3>If you need the <code>dtypes</code> in the result...</h3> As shown in another answer, <code>DataFrame.to_records</code> is a good way to do this. <pre class="prettyprint"><code>df.to_records() # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], # dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')]) </code></pre> This cannot be done with <code>to_numpy</code>, unfortunately. However, as an alternative, you can use <code>np.rec.fromrecords</code>: <pre class="prettyprint"><code>v = df.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], # dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')]) </code></pre> Performance wise, it's nearly the same (actually, using <code>rec.fromrecords</code> is a bit faster). <pre class="prettyprint"><code>df2 = pd.concat([df] * 10000) %timeit df2.to_records() %%timeit v = df2.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) 12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> <hr> <hr> <h3>Rationale for Adding a New Method</h3> <code>to_numpy()</code> (in addition to <code>array</code>) was added as a result of discussions under two GitHub issues GH19954 and GH23623. Specifically, the docs mention the rationale: <blockquote> [...] with <code>.values</code> it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like <code>Categorical</code>). For example, with <code>PeriodIndex</code>, <code>.values</code> generates a new <code>ndarray</code> of period objects each time. [...] </blockquote> <code>to_numpy</code> aims to improve the consistency of the API, which is a major step in the right direction. <code>.values</code> will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can. <hr> <hr> <h3>Critique of Other Solutions</h3> <code>DataFrame.values</code> has inconsistent behaviour, as already noted. <code>DataFrame.get_values()</code> is simply a wrapper around <code>DataFrame.values</code>, so everything said above applies. <code>DataFrame.as_matrix()</code> is deprecated now, do NOT use!

To convert a pandas dataframe (df) to a numpy ndarray, use this code: <pre class="prettyprint"><code>df.values array([[nan, 0.2, nan], [nan, nan, 0.5], [nan, 0.2, 0.5], [0.1, 0.2, nan], [0.1, 0.2, 0.5], [0.1, nan, 0.5], [0.1, nan, nan]]) </code></pre>

Convert pandas dataframe to NumPy array

Tags:

python

arrays

pandas

dataframe

numpy

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np import pandas as pd  index = [1, 2, 3, 4, 5, 6, 7] a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1] b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan] c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan] df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index) df = df.rename_axis('ID')

gives

label   A    B    C ID                                  1   NaN  0.2  NaN 2   NaN  NaN  0.5 3   NaN  0.2  0.5 4   0.1  0.2  NaN 5   0.1  0.2  0.5 6   0.1  NaN  0.5 7   0.1  NaN  NaN

I would like to convert this to a NumPy array, as so:

array([[ nan,  0.2,  nan],        [ nan,  nan,  0.5],        [ nan,  0.2,  0.5],        [ 0.1,  0.2,  nan],        [ 0.1,  0.2,  0.5],        [ 0.1,  nan,  0.5],        [ 0.1,  nan,  nan]])

How can I do this?

As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],        [ 2, nan,  nan,  0.5],        [ 3, nan,  0.2,  0.5],        [ 4, 0.1,  0.2,  nan],        [ 5, 0.1,  0.2,  0.5],        [ 6, 0.1,  nan,  0.5],        [ 7, 0.1,  nan,  nan]],      dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?

303

asked Nov 02 '12 00:11

Mister Nobody

2 Answers

`df.to_numpy()` is better than `df.values`, here's why.^*

It's time to deprecate your usage of values and as_matrix().

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

to_numpy(), which is defined on Index, Series, and DataFrame objects, and
array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.

_{* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.}

Towards Better Consistency: `to_numpy()`

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},                    index=['a', 'b', 'c'])  # Convert the entire DataFrame df.to_numpy() # array([[1, 4, 7], #        [2, 5, 8], #        [3, 6, 9]])  # Convert specific columns df[['A', 'C']].to_numpy() # array([[1, 7], #        [2, 8], #        [3, 9]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy() # array(['a', 'b', 'c'], dtype=object)  df['A'].to_numpy() #  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy() v[0, 0] = -1   df    A  B  C a -1  4  7 b  2  5  8 c  3  6  9

If you need a copy instead, use to_numpy(copy=True).

pandas >= 1.0 update for ExtensionTypes

If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                   a                                                                            <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64   # Wrong a.to_numpy()                                                                # array([1, 2, <NA>], dtype=object)  # yuck, objects  # Correct a.to_numpy(dtype='float', na_value=np.nan)                                  # array([ 1.,  2., nan])  # Also correct a.to_numpy(dtype='int', na_value=-1) # array([ 1,  2, -1])

This is called out in the docs.

If you need the `dtypes` in the result...

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records() # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], #           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], #           dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

Performance wise, it's nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)  %timeit df2.to_records() %%timeit v = df2.reset_index() np.rec.fromrecords(v, names=v.columns.tolist())  12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]

to_numpy aims to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.

Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!

123

answered Oct 05 '22 10:10

cs95

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

df.values  array([[nan, 0.2, nan],        [nan, nan, 0.5],        [nan, 0.2, 0.5],        [0.1, 0.2, nan],        [0.1, 0.2, 0.5],        [0.1, nan, 0.5],        [0.1, nan, nan]])

answered Oct 05 '22 11:10

User456898

Related questions
                            
                                How can I find script's directory? [duplicate]
                            
                                Why use def main()? [duplicate]
                            
                                Disable output buffering
                            
                                How to avoid Python/Pandas creating an index in a saved csv?
                            
                                Fixed digits after decimal with f-strings
                            
                                Import multiple csv files into pandas and concatenate into one DataFrame
                            
                                How do I log a Python error with debug information?
                            
                                How do I use itertools.groupby()?
                            
                                How can I see normal print output created during pytest run?
                            
                                Getting a map() to return a list in Python 3.x
                            
                                Determine function name from within that function (without using traceback)
                            
                                How to install psycopg2 with "pip" on Python?
                            
                                Simple argparse example wanted: 1 argument, 3 results
                            
                                What is the difference between re.search and re.match?
                            
                                How to split a string into a list?
                            
                                How to import a module given its name as string?
                            
                                Return JSON response from Flask view
                            
                                Getting the index of the returned max or min item using max()/min() on a list
                            
                                Convert nested Python dict to object?
                            
                                Why does python use 'else' after for and while loops?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert pandas dataframe to NumPy array

Tags:

python

arrays

pandas

dataframe

numpy

Mister Nobody

People also ask

2 Answers

`df.to_numpy()` is better than `df.values`, here's why.^*

Warning: We recommend using `DataFrame.to_numpy()` instead.

Towards Better Consistency: `to_numpy()`

pandas >= 1.0 update for ExtensionTypes

If you need the `dtypes` in the result...

Rationale for Adding a New Method

Critique of Other Solutions

cs95

User456898

Recent Activity

Donate For Us

Convert pandas dataframe to NumPy array

Tags:

python

arrays

pandas

dataframe

numpy

Mister Nobody

People also ask

2 Answers

df.to_numpy() is better than df.values, here's why.*

Warning: We recommend using DataFrame.to_numpy() instead.

Towards Better Consistency: to_numpy()

pandas >= 1.0 update for ExtensionTypes

If you need the dtypes in the result...

Rationale for Adding a New Method

Critique of Other Solutions

cs95

User456898

Related questions

Recent Activity

Donate For Us

`df.to_numpy()` is better than `df.values`, here's why.^*

Warning: We recommend using `DataFrame.to_numpy()` instead.

Towards Better Consistency: `to_numpy()`

If you need the `dtypes` in the result...