I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:
>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"]) >>> df col1 col2 col3 col4 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3
So if we call df.mean(axis=1)
, we'll get a mean across the rows:
>>> df.mean(axis=1) 0 1 1 2 2 3
However, if we call df.drop(name, axis=1)
, we actually drop a column, not a row:
>>> df.drop("col4", axis=1) col1 col2 col3 0 1 1 1 1 2 2 2 2 3 3 3
Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?
A side note, DataFrame.mean
just might be defined wrong. It says in the documentation for DataFrame.mean
that axis=1
is supposed to mean a mean over the columns, not the rows...
NumPy axes are the directions along the rows and columns. Just like coordinate systems, NumPy arrays also have axes. In a 2-dimensional NumPy array, the axes are the directions along the rows and columns.
A DataFrame object has two axes: “axis 0” and “axis 1”. “axis 0” represents rows and “axis 1” represents columns. Now it's clear that Series and DataFrame share the same direction for “axis 0” – it goes along rows direction.
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
axis=0
to apply a method down each column, or to the row labels (the index).axis=1
to apply a method across each row, or to the column labels.Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis
. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1)
, seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0)
would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1)
refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0
would make the method act on rows instead.
There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis
means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1)
returns a dataframe with dimenstion a x 1 x c. df.drop("col4", axis=1)
returns a dataframe with dimension a x (b-1) x c.Here, axis=1
means the second axis which is b
, so b
value will be changed in these examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With