We have both code popping up in our codebase
pandas.DataFrame.columns.values.tolist()
pandas.DataFrame.columns.tolist()
Are these always identical? I'm not sure why the values
variant pops up in the places it does, seems like the direct columns.tolist()
is all that's needed to get the column names. I'm looking to clean up the code a bit if this is the case.
Introspecting a bit seems to suggest values is just some implementation detail being a numpy.ndarray
>>> import pandas
>>> d = pandas.DataFrame( { 'a' : [1,2,3], 'b' : [0,1,3]} )
>>> d
a b
0 1 0
1 2 1
2 3 3
>>> type(d.columns)
<class 'pandas.core.indexes.base.Index'>
>>> type(d.columns.values)
<class 'numpy.ndarray'>
>>> type(d.columns.tolist())
<class 'list'>
>>> type(d.columns.values.tolist())
<class 'list'>
>>> d.columns.values
array(['a', 'b'], dtype=object)
>>> d.columns.values.tolist()
['a', 'b']
>>> d.columns
Index(['a', 'b'], dtype='object')
>>> d.columns.tolist()
['a', 'b']
Output is same, but if really big df
timings are different:
np.random.seed(23)
df = pd.DataFrame(np.random.randint(3, size=(5,10000)))
df.columns = df.columns.astype(str)
print (df)
In [90]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 79.5 µs per loop
In [91]: %timeit df.columns.tolist()
10000 loops, best of 3: 173 µs per loop
Also uses different functions:
Index.values
with numpy.ndarray.tolist
Index.tolist
Thanks Mitch
for another solution:
In [93]: %timeit list(df.columns.values)
1000 loops, best of 3: 169 µs per loop
d = pandas.DataFrame( { 'a' : [1,2,3], 'b' : [0,1,3]} )
or you can simply do
list(d)# it is same with d.columns.tolist()
Out[327]: ['a', 'b']
# Time
% timeit list(df) # after run the time , this is the slowest on my side .
10000 loops, best of 3: 135 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With