Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy.ndarray vs pandas.DataFrame

I need to make a strategic decision about choice of the basis for data structure holding statistical data frames in my program.

I store hundreds of thousands of records in one big table. Each field would be of a different type, including short strings. I'd perform multiple regression analysis and manipulations on the data that need to be done quick, in real time. I also need to use something, that is relatively popular and well supported.

I know about the following contestants:

list of array.array

That is the most basic thing to do. Unfortunately it doesn't support strings. And I need to use numpy anyway for its statistical part, so this one is out of question.

numpy.ndarray

The ndarray has ability to hold arrays of different types in each column (e.g. np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])). It seems a natural winner, but...

pandas.DataFrame

This one is built with statistical use in mind, but is it efficient enough?

I read, that the pandas.DataFrame is no longer based on the numpy.ndarray (although it shares the same interface). Can anyone shed some light on it? Or maybe there is an even better data structure out there?

like image 230
Adam Ryczkowski Avatar asked Aug 08 '14 10:08

Adam Ryczkowski


People also ask

Are NumPy arrays faster than Pandas DataFrame?

The indexing of NumPy arrays is faster than that of the Pandas Series.

How is Pandas series different from NumPy array or Ndarray?

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

Which is better NumPy or Pandas?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Is Pandas DataFrame a NumPy array?

Pandas dataframe is a two-dimensional data structure to store and retrieve data in rows and columns format. You can convert pandas dataframe to numpy array using the df. to_numpy() method.


1 Answers

pandas.DataFrame is awesome, and interacts very well with much of numpy. Much of the DataFrame is written in Cython and is quite optimized. I suspect the ease of use and the richness of the Pandas API will greatly outweigh any potential benefit you could obtain by rolling your own interfaces around numpy.

like image 185
daniel Avatar answered Sep 20 '22 13:09

daniel