Pandas dataframe is a two-dimensional data structure to store and retrieve data in rows and columns format. You can convert pandas dataframe to numpy array using the df. to_numpy() method.
to_numpy() function is used to return a NumPy ndarray representing the values in given Series or Index. This function will explain how we can convert the pandas Series to numpy Array. Although it's very simple, but the concept behind this technique is very unique. Because we know the Series having index in the output.
The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
To get a NumPy array, you should use the values
attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there's no need for a conversion.
Note: This attribute is also available for many other pandas' objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist
:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index
to access the index object and then get the values in a list using df.index.tolist()
. Similarly, you can use df['col'].tolist()
for Series.
.values
in favour of these methods!From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index
, Series
, and DataFrame
objects: they are to_numpy()
, and .array
. Regarding usage, the docs mention:
We haven’t removed or deprecated
Series.values
orDataFrame.values
, but we highly recommend and using.array
or.to_numpy()
instead.
See this section of the v0.24.0 release notes for more information.
to_numpy()
Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True
);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array
does not).
array
Attribute
This attribute returns an ExtensionArray
object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list
:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist()
:
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For
Series
andIndex
es backed by normal NumPy arrays,Series.array
will return a newarrays.PandasArray
, which is a thin (no-copy) wrapper around anumpy.ndarray
.arrays.PandasArray
isn’t especially useful on its own, but it does provide the same interface as any extension array defined in pandas or by a third-party library.
So, to summarise, .array
will return either
ExtensionArray
backing the Index/Series, or ExtensionArray
object is created as a thin wrapper over the underlying array. Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with
.values
it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (likeCategorical
). For example, withPeriodIndex
,.values
generates a newndarray
of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values
will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index
must be an element of the FrozenList
df.index.names
Since pandas v0.13 you can also use get_values
:
df.index.get_values()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With