I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called.
For example, if I'm given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
To access the names of a Pandas dataframe, we can the method columns(). For example, if our dataframe is called df we just type print(df. columns) to get all the columns of the Pandas dataframe.
In most cases, the duplicate headers error is referring to duplicate blank column headers. The solution is to remove any additional blank columns in your . csv file. Also, ensure when you are uploading that all data types are set to the “string” data type.
Make a list of columns that have to be extracted. Use read_csv() method to extract the csv file into data frame. Print the exracted data. Plot the data frame using plot() method.
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum's answer):
list(my_dataframe)
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns
returns an Index
, .columns.values
returns an array and this has a helper function .tolist
to return a list.
If performance is not as important to you, Index
objects define a .tolist()
method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list
on df
, as so:
list(df)
[*df]
and FriendsUnpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list
....
[*df]
# ['A', 'B', 'C']
Or, if you want a set
,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple
,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)
P.S.: if performance is important, you will want to ditch the solutions above in favour of
df.columns.to_numpy().tolist() # ['A', 'B', 'C']
This is similar to Ed Chum's answer, but updated for v0.24 where
.to_numpy()
is preferred to the use of.values
. See this answer (by me) for more information.
Visual Check
Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='\n')
A
B
C
Don't use an explicit for
loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df)
does not preserve the original order of the columns. For that, you should use list(df)
instead.
Next, list(df.columns)
and list(df.columns.values)
are poor suggestions (as of the current version, v0.24). Both Index
(returned from df.columns
) and NumPy arrays (returned by df.columns.values
) define .tolist()
method which is faster and more idiomatic.
Lastly, listification i.e., list(df)
should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist()
is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe)
though, so thanks EdChum!)
The simplest option would be:
list(my_dataframe.columns)
or my_dataframe.columns.tolist()
No need for the complex stuff above :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With