Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - What are the major improvement of Pandas over Numpy/Scipy

I have been using numpy/scipy for data analysis. I recently started to learn Pandas.

I have gone through a few tutorials and I am trying to understand what are the major improvement of Pandas over Numpy/Scipy.

It seems to me that the key idea of Pandas is to wrap up different numpy arrays in a Data Frame, with some utility functions around it.

Is there something revolutionary about Pandas that I just stupidly missed?

like image 264
CuriousMind Avatar asked May 06 '15 03:05

CuriousMind


People also ask

Which functions are in pandas but not in NumPy?

The Pandas provides some sets of powerful tools like DataFrame and Series that mainly used for analyzing the data, whereas in NumPy module offers a powerful object called Array. Instacart, SendGrid, and Sighten are some of the famous companies that work on the Pandas module, whereas NumPy is used by SweepSouth.

What is the difference between NumPy and pandas libraries in Python?

Pandas is mostly used for data analysis tasks in Python. NumPy is mostly used for working with Numerical values as it makes it easy to apply mathematical functions. Pandas library works well for numeric, alphabets, and heterogeneous types of data simultaneously.

What is the difference between pandas series and NumPy array?

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.


1 Answers

Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.

For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):

  1. Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
  2. Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
  3. Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
  4. Split-Apply-Combine is a powerful way of thinking about and implementing data-processing

However, there are downsides to Pandas:

  1. Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
  2. Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
  3. If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.
like image 184
Travis Oliphant Avatar answered Nov 15 '22 20:11

Travis Oliphant