Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas vs. Numpy Dataframes

Look at these few lines of code:

df2 = df.copy()
df2[1:] = df[1:]/df[:-1].values -1
df2.ix[0, :] = 0

Our instructor said we need to use the .values attribute to access the underlying numpy array, otherwise, our code wouldn't work.

I understand that a pandas DataFrame does have an underlying representation as a numpy array, but I didn't understand why we cannot operate directly on the pandas DataFrame using just slicing.

May you elucidate me about that?

like image 719
MadHatter Avatar asked May 07 '17 14:05

MadHatter


People also ask

Is Pandas better than NumPy?

Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays. Indexing of numpy Arrays is very fast.

What is the difference between Pandas DataFrame and NumPy array?

Numpy arrays can be multi-dimensional whereas DataFrame can only be two-dimensional. Arrays contain similar types of objects or elements whereas DataFrame can have objects or multiple or similar data types. Both array and DataFrames are mutable.

Is NumPy faster than Pandas?

So, Which Python Library Is Better? Pandas is more user-friendly, but NumPy is faster. Pandas has a lot more options for handling missing data, but NumPy has better performance on large datasets. Pandas uses Python objects internally, making it easier to work with than NumPy (which uses C arrays).

Does Pandas DataFrame use NumPy?

Pandas expands on NumPy by providing easy to use methods for data analysis to operate on the DataFrame and Series classes, which are built on NumPy's powerful ndarray class.


1 Answers

pandas focuses on tabular data structures and when doing the operations (addition, subtraction etc.) it looks at the labels - not positions.

Consider the following DataFrame:

df = pd.DataFrame(np.random.randn(5, 3), index=list('abcde'), columns=list('xyz'))

Here, df[1:] is:

df[1:]
Out: 
          x         y         z
b  1.003035  0.172960  1.160033
c  0.117608 -1.114294 -0.557413
d -1.312315  1.171520 -1.034012
e -0.380719 -0.422896  1.073535

And df[:-1] is:

df[:-1]
Out: 
          x         y         z
a  1.367916  1.087607 -0.625777
b  1.003035  0.172960  1.160033
c  0.117608 -1.114294 -0.557413
d -1.312315  1.171520 -1.034012

If you do df[1:] / df[:-1] it will divide row b's by row b's, row c's by row c's and row d's by row d's. For row a and e, it will not be able to find corresponding rows in the other DataFrame (either in the first one or in the second one) so it will return nan:

df[1:] / df[:-1]
Out: 
     x    y    z
a  NaN  NaN  NaN
b  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0
e  NaN  NaN  NaN

If you just want to do element-wise division ignoring the labels, accessing the underlying numpy array by .values for one of the frames is a way of telling pandas to ignore labels. Since numpy arrays don't have labels, pandas will just do element-wise operations:

df[1:]/df[:-1].values
Out: 
           x         y         z
b   0.733258  0.159028 -1.853749
c   0.117252 -6.442482 -0.480515
d -11.158359 -1.051357  1.855018
e   0.290112 -0.360981 -1.038223
like image 157
ayhan Avatar answered Nov 25 '22 18:11

ayhan