Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ambiguity in Pandas Dataframe / Numpy Array "axis" definition

I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:

>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"]) >>> df    col1  col2  col3  col4 0     1     1     1     1 1     2     2     2     2 2     3     3     3     3 

So if we call df.mean(axis=1), we'll get a mean across the rows:

>>> df.mean(axis=1) 0    1 1    2 2    3 

However, if we call df.drop(name, axis=1), we actually drop a column, not a row:

>>> df.drop("col4", axis=1)    col1  col2  col3 0     1     1     1 1     2     2     2 2     3     3     3 

Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?

A side note, DataFrame.mean just might be defined wrong. It says in the documentation for DataFrame.mean that axis=1 is supposed to mean a mean over the columns, not the rows...

like image 986
hlin117 Avatar asked Sep 10 '14 19:09

hlin117


People also ask

What is the axis of a NumPy array?

NumPy axes are the directions along the rows and columns. Just like coordinate systems, NumPy arrays also have axes. In a 2-dimensional NumPy array, the axes are the directions along the rows and columns.

How do you define axis in pandas?

A DataFrame object has two axes: “axis 0” and “axis 1”. “axis 0” represents rows and “axis 1” represents columns. Now it's clear that Series and DataFrame share the same direction for “axis 0” – it goes along rows direction.

What is the distinction between a NumPy array and a Pandas data frame?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.


2 Answers

It's perhaps simplest to remember it as 0=down and 1=across.

This means:

  • Use axis=0 to apply a method down each column, or to the row labels (the index).
  • Use axis=1 to apply a method across each row, or to the column labels.

Here's a picture to show the parts of a DataFrame that each axis refers to:

It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]

So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.

Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.

like image 192
Alex Riley Avatar answered Oct 10 '22 19:10

Alex Riley


There are already proper answers, but I give you another example with > 2 dimensions.

The parameter axis means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.

  • df.mean(axis=1) returns a dataframe with dimenstion a x 1 x c.
  • df.drop("col4", axis=1) returns a dataframe with dimension a x (b-1) x c.

Here, axis=1 means the second axis which is b, so b value will be changed in these examples.

like image 26
jeongmin.cha Avatar answered Oct 10 '22 18:10

jeongmin.cha