I am new to learning Python, and some of its libraries (numpy, pandas).
I have found a lot of documentation on how numpy ndarrays, pandas series and python dictionaries work.
But owing to my inexperience with Python, I have had a really hard time determining when to use each one of them. And I haven't found any best-practices that will help me understand and decide when it is better to use each type of data structure.
As a general matter, are there any best practices for deciding which, if any, of these three data structures a specific data set should be loaded into?
Thanks!
Pandas in general is used for financial time series data/economics data (it has a lot of built in helpers to handle financial data). Numpy is a fast way to handle large arrays multidimensional arrays for scientific computing (scipy also helps).
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.
The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:
So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:
Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:
If you want to an answer which tells you to stick with just one type of data structures, here goes one: use pandas series/dataframe structures.
The pandas series object can be seen as an enhanced numpy 1D array and the pandas dataframe can be seen as an enhanced numpy 2D array. The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like
import numpy as np a = np.array([1,2,3])
you can just use
import pandas as pd a = pd.Series([1,2,3])
All the functions and methods from numpy arrays will work with pandas series. In analogy, the same can be done with dataframes and numpy 2D arrays.
A further question you might have can be about the performance differences between a numpy array and pandas series. Here is a post that shows the differences in performance using these two tools: performance of pandas series vs numpy arrays.
Please note that even in a explicy way pandas series has a subtle worse in performance when compared to numpy, you can solve this by just calling the values method on a pandas series:
a.values
The result of apply the values method on a pandas series will be a numpy array!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With