I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called. For example, if I'm given a DataFrame like this: <pre class="prettyprint"><code>>>> my_dataframe y gdp cap 0 1 2 5 1 2 3 9 2 8 7 2 3 3 4 7 4 6 7 7 5 4 8 3 6 8 2 8 7 9 9 10 8 6 6 4 9 10 10 7 </code></pre> I would get a list like this: <pre class="prettyprint"><code>>>> header_list ['y', 'gdp', 'cap'] </code></pre>

You can get the values as a list by doing: <pre class="prettyprint"><code>list(my_dataframe.columns.values) </code></pre> Also you can simply use (as shown in Ed Chum's answer): <pre class="prettyprint"><code>list(my_dataframe) </code></pre>

There is a built-in method which is the most performant: <pre class="prettyprint"><code>my_dataframe.columns.values.tolist() </code></pre> <code>.columns</code> returns an <code>Index</code>, <code>.columns.values</code> returns an array and this has a helper function <code>.tolist</code> to return a list. If performance is not as important to you, <code>Index</code> objects define a <code>.tolist()</code> method that you can call directly: <pre class="prettyprint"><code>my_dataframe.columns.tolist() </code></pre> The difference in performance is obvious: <pre class="prettyprint lang-none prettyprint-override"><code>%timeit df.columns.tolist() 16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %timeit df.columns.values.tolist() 1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) </code></pre> <hr> For those who hate typing, you can just call <code>list</code> on <code>df</code>, as so: <pre class="prettyprint"><code>list(df) </code></pre>

<h3>Extended Iterable Unpacking (Python 3.5+): <code>[*df]</code> and Friends</h3> Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible. <pre class="prettyprint"><code>df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5)) df A B C 0 x x x 1 x x x 2 x x x 3 x x x 4 x x x </code></pre> If you want a <code>list</code>.... <pre class="prettyprint"><code>[*df] # ['A', 'B', 'C'] </code></pre> Or, if you want a <code>set</code>, <pre class="prettyprint"><code>{*df} # {'A', 'B', 'C'} </code></pre> Or, if you want a <code>tuple</code>, <pre class="prettyprint"><code>*df, # Please note the trailing comma # ('A', 'B', 'C') </code></pre> Or, if you want to store the result somewhere, <pre class="prettyprint"><code>*cols, = df # A wild comma appears, again cols # ['A', 'B', 'C'] </code></pre> ... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;) <blockquote> P.S.: if performance is important, you will want to ditch the solutions above in favour of <pre class="prettyprint"><code>df.columns.to_numpy().tolist() # ['A', 'B', 'C'] </code></pre> This is similar to Ed Chum's answer, but updated for v0.24 where <code>.to_numpy()</code> is preferred to the use of <code>.values</code>. See this answer (by me) for more information. </blockquote> Visual Check Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops). <pre class="prettyprint"><code>print(*df) A B C print(*df, sep='\n') A B C </code></pre> <hr> <h3>Critique of Other Methods</h3> Don't use an explicit <code>for</code> loop for an operation that can be done in a single line (list comprehensions are okay). Next, using <code>sorted(df)</code> does not preserve the original order of the columns. For that, you should use <code>list(df)</code> instead. Next, <code>list(df.columns)</code> and <code>list(df.columns.values)</code> are poor suggestions (as of the current version, v0.24). Both <code>Index</code> (returned from <code>df.columns</code>) and NumPy arrays (returned by <code>df.columns.values</code>) define <code>.tolist()</code> method which is faster and more idiomatic. Lastly, listification i.e., <code>list(df)</code> should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.

The simplest option would be: <code>list(my_dataframe.columns)</code> or <code>my_dataframe.columns.tolist()</code> No need for the complex stuff above :)

get column names from csv file using pandas [duplicate]

I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won't know how many columns there will be or what they will be called.

For example, if I'm given a DataFrame like this:

>>> my_dataframe
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

I would get a list like this:

>>> header_list
['y', 'gdp', 'cap']

Does Pandas allow duplicate column names?

Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.

How do I retrieve column names in Pandas?

To access the names of a Pandas dataframe, we can the method columns(). For example, if our dataframe is called df we just type print(df. columns) to get all the columns of the Pandas dataframe.

Can CSV have duplicate headers?

In most cases, the duplicate headers error is referring to duplicate blank column headers. The solution is to remove any additional blank columns in your . csv file. Also, ensure when you are uploading that all data types are set to the “string” data type.

How do I extract two columns from a CSV file in Python?

Make a list of columns that have to be extracted. Use read_csv() method to extract the csv file into data frame. Print the exracted data. Plot the data frame using plot() method.

You can get the values as a list by doing:

list(my_dataframe.columns.values)

Also you can simply use (as shown in Ed Chum's answer):

list(my_dataframe)

There is a built-in method which is the most performant:

my_dataframe.columns.values.tolist()

.columns returns an Index, .columns.values returns an array and this has a helper function .tolist to return a list.

If performance is not as important to you, Index objects define a .tolist() method that you can call directly:

my_dataframe.columns.tolist()

The difference in performance is obvious:

%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For those who hate typing, you can just call list on df, as so:

list(df)

Extended Iterable Unpacking (Python 3.5+): `[*df]` and Friends

Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.

df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df

   A  B  C
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

If you want a list....

[*df]
# ['A', 'B', 'C']

Or, if you want a set,

{*df}
# {'A', 'B', 'C'}

Or, if you want a tuple,

*df,  # Please note the trailing comma
# ('A', 'B', 'C')

Or, if you want to store the result somewhere,

*cols, = df  # A wild comma appears, again
cols
# ['A', 'B', 'C']

... if you're the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently ;)

P.S.: if performance is important, you will want to ditch the solutions above in favour of
df.columns.to_numpy().tolist()
# ['A', 'B', 'C']
This is similar to Ed Chum's answer, but updated for v0.24 where .to_numpy() is preferred to the use of .values. See this answer (by me) for more information.

Visual Check

Since I've seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).

print(*df)
A B C

print(*df, sep='\n')
A
B
C

Critique of Other Methods

Don't use an explicit for loop for an operation that can be done in a single line (list comprehensions are okay).

Next, using sorted(df) does not preserve the original order of the columns. For that, you should use list(df) instead.

Next, list(df.columns) and list(df.columns.values) are poor suggestions (as of the current version, v0.24). Both Index (returned from df.columns) and NumPy arrays (returned by df.columns.values) define .tolist() method which is faster and more idiomatic.

Lastly, listification i.e., list(df) should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.

I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist() is the fastest:

In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop

In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop

In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop

In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop

(I still really like the list(dataframe) though, so thanks EdChum!)

The simplest option would be: list(my_dataframe.columns) or my_dataframe.columns.tolist()

No need for the complex stuff above :)

get column names from csv file using pandas [duplicate]

Tags:

python

pandas

dataframe

natsuki_2002

People also ask

5 Answers

Simeon Visser

EdChum

Extended Iterable Unpacking (Python 3.5+): `[*df]` and Friends

Critique of Other Methods

cs95

tegan

Grégoire

Recent Activity

Donate For Us

get column names from csv file using pandas [duplicate]

Tags:

python

pandas

dataframe

natsuki_2002

People also ask

5 Answers

Simeon Visser

EdChum

Extended Iterable Unpacking (Python 3.5+): [*df] and Friends

Critique of Other Methods

cs95

tegan

Grégoire

Related questions

Recent Activity

Donate For Us

Extended Iterable Unpacking (Python 3.5+): `[*df]` and Friends