Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing between pandas, OOP classes, and dicts (Python)

I have written a program that reads a couple of .csv files (they are not large, a couple of thousands rows each), I do some data cleaning and wrangling and this is the final structure of each .csv file looks (fake data for illustration purposes only).

import pandas as pd
data = [[112233, 'Rob', 99], [445566, 'John', 88]]
managers = pd.DataFrame(data)
managers.columns = ['ManagerId', 'ManagerName', 'ShopId']
print managers

   ManagerId ManagerName  ShopId
0     112233         Rob      99
1     445566        John      88


data = [[99, 'Shop1'], [88, 'Shop2']]
shops = pd.DataFrame(data)
shops.columns = ['ShopId', 'ShopName']
print shops

   ShopId ShopName
0      99    Shop1
1      88    Shop2

data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]]
sales = pd.DataFrame(data)
sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012']
print sales

   ShopId  Year2010  Year2011  Year2012
0      99      2000      3000      4000
1      88      2500      3500      4500

Then I use xlsxwriter and reportlab Python packages for creating custom Excel sheets and .pdf reports while iterating the data frames. Everything looks great, and all of the named packages do their job really well.

My concern though is that I feel that my code gets hard to maintain as I need to access the same data frame rows multiple times in multiple calls.

Say I need to get manager names that are responsible for shops which had sales more than 1500 in year 2010. My code is filled with this kind of calls:

managers[managers['ShopId'].isin(
    sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)

I think it is hard to see what is going on while reading this line of code. I could create multiple intermediate variables, but this would add multiple lines of code.

How common is it to sacrifice database normalization ideology and merge all the pieces into a single data frame to get a more maintainable code? There are obviously cons of having a single data frame as it might get messy when trying to merge other data frames that might be needed later on. Merging them of course leads to data redundancy as the same manager can be assigned to multiple shops.

df = managers.merge(sales, how='left', on='ShopId').
    merge(shops, how='left', on='ShopId')
print df

   ManagerId ManagerName  ShopId  Year2010  Year2011  Year2012 ShopName
0     112233         Rob      99      2000      3000      4000    Shop1
1     445566        John      88      2500      3500      4500    Shop2

At least this call gets smaller:

df[df['Year2010'] > 1500]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)

Maybe pandas is a wrong tool for this kind of job?

C# developers at office frown at me and tell me use the classes, but then I will have a bunch of methods like get_manager_sales(managerid) and so forth. Iterating class instances for reporting also sounds troublesome as I would need to implement some sorting and indexing (which I get for free with pandas).

Dictionary would work, but it makes it also difficult to modify existing data, doing merges etc. The syntax doesn't get much better either.

data_dict = df.to_dict('records')
[{'ManagerId': 112233L,
  'ManagerName': 'Rob',
  'ShopId': 99L,
  'ShopName': 'Shop1',
  'Year2010': 2000L,
  'Year2011': 3000L,
  'Year2012': 4000L},
 {'ManagerId': 445566L,
  'ManagerName': 'John',
  'ShopId': 88L,
  'ShopName': 'Shop2',
  'Year2010': 2500L,
  'Year2011': 3500L,
  'Year2012': 4500L}]

Get manager names that are responsible for shops which had sales more than 1500 in year 2010.

[row['ManagerName'] for row in data_dict if row['Year2010'] > 1500]
>>> ['Rob', 'John']

In this particular case with the data I operate with, should I go all the way with pandas or is there another way to write cleaner code while taking advantage of the power of pandas?

like image 666
Alex Tereshenkov Avatar asked Oct 14 '16 11:10

Alex Tereshenkov


People also ask

Are Dicts faster than Pandas?

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute.

When should I use Pandas vs DataFrame?

Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

Are Pandas DataFrames faster than lists?

Results. From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.

Are Pandas DataFrames efficient?

In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.


Video Answer


1 Answers

I would choose Pandas, because it's much faster, has an excellent and extremely rich API, a source code looks much cleaner and better, etc.

BTW the following line can be easily rewritten:

managers[managers['ShopId'].isin(sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values

as:

ShopIds = sales.ix[sales['Year2010'] > 1500, 'ShopId']
managers.query('ShopId in @ShopIds')['ManagerName'].values

IMO it's pretty easy to read and understand

PS you may also want to store your data in a SQL-able database and use SQL or to store it in HDF Store and use where parameter - in both cases you can benefit from indexing "search" columns

like image 139
MaxU - stop WAR against UA Avatar answered Sep 30 '22 01:09

MaxU - stop WAR against UA