I am interested in knowing how to convert a pandas dataframe into a NumPy array.
dataframe:
import numpy as np import pandas as pd index = [1, 2, 3, 4, 5, 6, 7] a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1] b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan] c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan] df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index) df = df.rename_axis('ID')
gives
label A B C ID 1 NaN 0.2 NaN 2 NaN NaN 0.5 3 NaN 0.2 0.5 4 0.1 0.2 NaN 5 0.1 0.2 0.5 6 0.1 NaN 0.5 7 0.1 NaN NaN
I would like to convert this to a NumPy array, as so:
array([[ nan, 0.2, nan], [ nan, nan, 0.5], [ nan, 0.2, 0.5], [ 0.1, 0.2, nan], [ 0.1, 0.2, 0.5], [ 0.1, nan, 0.5], [ 0.1, nan, nan]])
How can I do this?
As a bonus, is it possible to preserve the dtypes, like this?
array([[ 1, nan, 0.2, nan], [ 2, nan, nan, 0.5], [ 3, nan, 0.2, 0.5], [ 4, 0.1, 0.2, nan], [ 5, 0.1, 0.2, 0.5], [ 6, 0.1, nan, 0.5], [ 7, 0.1, nan, nan]], dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
or similar?
Pandas dataframe is a two-dimensional data structure to store and retrieve data in rows and columns format. You can convert pandas dataframe to numpy array using the df. to_numpy() method.
DataFrames and Series in PandasSeries are similar to one-dimensional NumPy arrays, with a single dtype, although with an additional index (list of row labels). DataFrames are an ordered sequence of Series, sharing the same index, with labeled columns.
A two-dimensional rectangular array to store data in rows and columns is called python matrix. Matrix is a Numpy array to store data in rows and columns. Using dataframe. to_numpy() method we can convert dataframe to Numpy Matrix.
To convert an array to a dataframe with Python you need to 1) have your NumPy array (e.g., np_array), and 2) use the pd. DataFrame() constructor like this: df = pd. DataFrame(np_array, columns=['Column1', 'Column2']) . Remember, that each column in your NumPy array needs to be named with columns.
df.to_numpy()
is better than df.values
, here's why.* It's time to deprecate your usage of values
and as_matrix()
.
pandas v0.24.0
introduced two new methods for obtaining NumPy arrays from pandas objects:
to_numpy()
, which is defined on Index
, Series
, and DataFrame
objects, andarray
, which is defined on Index
and Series
objects only.If you visit the v0.24 docs for .values
, you will see a big red warning that says:
Warning: We recommend using
DataFrame.to_numpy()
instead.
See this section of the v0.24.0 release notes, and this answer for more information.
* - to_numpy()
is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values
to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.
to_numpy()
In the spirit of better consistency throughout the API, a new method to_numpy
has been introduced to extract the underlying NumPy array from DataFrames.
# Setup df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index=['a', 'b', 'c']) # Convert the entire DataFrame df.to_numpy() # array([[1, 4, 7], # [2, 5, 8], # [3, 6, 9]]) # Convert specific columns df[['A', 'C']].to_numpy() # array([[1, 7], # [2, 8], # [3, 9]])
As mentioned above, this method is also defined on Index
and Series
objects (see here).
df.index.to_numpy() # array(['a', 'b', 'c'], dtype=object) df['A'].to_numpy() # array([1, 2, 3])
By default, a view is returned, so any modifications made will affect the original.
v = df.to_numpy() v[0, 0] = -1 df A B C a -1 4 7 b 2 5 8 c 3 6 9
If you need a copy instead, use to_numpy(copy=True)
.
If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.
a = pd.array([1, 2, None], dtype="Int64") a <IntegerArray> [1, 2, <NA>] Length: 3, dtype: Int64 # Wrong a.to_numpy() # array([1, 2, <NA>], dtype=object) # yuck, objects # Correct a.to_numpy(dtype='float', na_value=np.nan) # array([ 1., 2., nan]) # Also correct a.to_numpy(dtype='int', na_value=-1) # array([ 1, 2, -1])
This is called out in the docs.
dtypes
in the result...As shown in another answer, DataFrame.to_records
is a good way to do this.
df.to_records() # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], # dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
This cannot be done with to_numpy
, unfortunately. However, as an alternative, you can use np.rec.fromrecords
:
v = df.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) # rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)], # dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
Performance wise, it's nearly the same (actually, using rec.fromrecords
is a bit faster).
df2 = pd.concat([df] * 10000) %timeit df2.to_records() %%timeit v = df2.reset_index() np.rec.fromrecords(v, names=v.columns.tolist()) 12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
to_numpy()
(in addition to array
) was added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with
.values
it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (likeCategorical
). For example, withPeriodIndex
,.values
generates a newndarray
of period objects each time. [...]
to_numpy
aims to improve the consistency of the API, which is a major step in the right direction. .values
will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
DataFrame.values
has inconsistent behaviour, as already noted.
DataFrame.get_values()
is simply a wrapper around DataFrame.values
, so everything said above applies.
DataFrame.as_matrix()
is deprecated now, do NOT use!
To convert a pandas dataframe (df) to a numpy ndarray, use this code:
df.values array([[nan, 0.2, nan], [nan, nan, 0.5], [nan, 0.2, 0.5], [0.1, 0.2, nan], [0.1, 0.2, 0.5], [0.1, nan, 0.5], [0.1, nan, nan]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With