I have written a program that reads a couple of .csv files (they are not large, a couple of thousands rows each), I do some data cleaning and wrangling and this is the final structure of each .csv file looks (fake data for illustration purposes only).
import pandas as pd
data = [[112233, 'Rob', 99], [445566, 'John', 88]]
managers = pd.DataFrame(data)
managers.columns = ['ManagerId', 'ManagerName', 'ShopId']
print managers
ManagerId ManagerName ShopId
0 112233 Rob 99
1 445566 John 88
data = [[99, 'Shop1'], [88, 'Shop2']]
shops = pd.DataFrame(data)
shops.columns = ['ShopId', 'ShopName']
print shops
ShopId ShopName
0 99 Shop1
1 88 Shop2
data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]]
sales = pd.DataFrame(data)
sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012']
print sales
ShopId Year2010 Year2011 Year2012
0 99 2000 3000 4000
1 88 2500 3500 4500
Then I use xlsxwriter
and reportlab
Python packages for creating custom Excel sheets and .pdf reports while iterating the data frames. Everything looks great, and all of the named packages do their job really well.
My concern though is that I feel that my code gets hard to maintain as I need to access the same data frame rows multiple times in multiple calls.
Say I need to get manager names that are responsible for shops which had sales more than 1500 in year 2010. My code is filled with this kind of calls:
managers[managers['ShopId'].isin(
sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)
I think it is hard to see what is going on while reading this line of code. I could create multiple intermediate variables, but this would add multiple lines of code.
How common is it to sacrifice database normalization ideology and merge all the pieces into a single data frame to get a more maintainable code? There are obviously cons of having a single data frame as it might get messy when trying to merge other data frames that might be needed later on. Merging them of course leads to data redundancy as the same manager can be assigned to multiple shops.
df = managers.merge(sales, how='left', on='ShopId').
merge(shops, how='left', on='ShopId')
print df
ManagerId ManagerName ShopId Year2010 Year2011 Year2012 ShopName
0 112233 Rob 99 2000 3000 4000 Shop1
1 445566 John 88 2500 3500 4500 Shop2
At least this call gets smaller:
df[df['Year2010'] > 1500]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)
Maybe pandas is a wrong tool for this kind of job?
C# developers at office frown at me and tell me use the classes, but then I will have a bunch of methods like get_manager_sales(managerid)
and so forth. Iterating class instances for reporting also sounds troublesome as I would need to implement some sorting and indexing (which I get for free with pandas
).
Dictionary would work, but it makes it also difficult to modify existing data, doing merges etc. The syntax doesn't get much better either.
data_dict = df.to_dict('records')
[{'ManagerId': 112233L,
'ManagerName': 'Rob',
'ShopId': 99L,
'ShopName': 'Shop1',
'Year2010': 2000L,
'Year2011': 3000L,
'Year2012': 4000L},
{'ManagerId': 445566L,
'ManagerName': 'John',
'ShopId': 88L,
'ShopName': 'Shop2',
'Year2010': 2500L,
'Year2011': 3500L,
'Year2012': 4500L}]
Get manager names that are responsible for shops which had sales more than 1500 in year 2010.
[row['ManagerName'] for row in data_dict if row['Year2010'] > 1500]
>>> ['Rob', 'John']
In this particular case with the data I operate with, should I go all the way with pandas
or is there another way to write cleaner code while taking advantage of the power of pandas
?
For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute.
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
Results. From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.
In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.
I would choose Pandas, because it's much faster, has an excellent and extremely rich API, a source code looks much cleaner and better, etc.
BTW the following line can be easily rewritten:
managers[managers['ShopId'].isin(sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values
as:
ShopIds = sales.ix[sales['Year2010'] > 1500, 'ShopId']
managers.query('ShopId in @ShopIds')['ManagerName'].values
IMO it's pretty easy to read and understand
PS you may also want to store your data in a SQL
-able database and use SQL or to store it in HDF Store and use where
parameter - in both cases you can benefit from indexing "search" columns
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With