Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working in Pandas with variable names with a common suffix

Tags:

python

pandas

I do most of my data work in SAS but need to use python for a particular project (I'm not very competent in python). I have a dataframe like this:

    values = ['a_us', 'b_us', 'c_us', 'a_ww','b_ww','c_ww']
    df = pd.DataFrame(np.random.rand(1, 6), columns=values[:6])

One thing I need to do is calculate the ratio of US to WW for each of companies a, b and c. I know how to do it the long way in python-- I'd just do this for each company:

    df['*company*_ratio'] = df['*company*_us']/df['*company*_ww']

But, how would can I do this without having to write out each equation? I am thinking I could do something like

    for x in [a,b,c]:

or I could define a function. However, I don't know enough to implement either of those options or even what to search to find an answer (as I'm sure it's been asked before). In SAS I would just write a macro that fills in company.

Thanks.

like image 830
dataryne Avatar asked Mar 22 '16 19:03

dataryne


2 Answers

You can first find unique values by first char of columns by indexing with str:

print df.columns.str[0].unique()
['a' 'b' 'c']

Or by first substring if columns are splited by _ (better for real data).

print df.columns.str.split('_').str[0].unique()
['a' 'b' 'c']

for x in df.columns.str[0].unique():
    df[x + '_ratio'] = df[x + '_us']/df[x + '_ww']

Comparing:

import pandas as pd
import numpy as np

np.random.seed(0)
values = ['a_us', 'b_us', 'c_us', 'a_ww','b_ww','c_ww']
df = pd.DataFrame(np.random.rand(1, 6), columns=values[:6])

df['a_ratio'] = df['a_us']/df['a_ww']
df['b_ratio'] = df['b_us']/df['b_ww']
df['c_ratio'] = df['c_us']/df['c_ww']
print df
       a_us      b_us      c_us      a_ww      b_ww      c_ww   a_ratio  \
0  0.548814  0.715189  0.602763  0.544883  0.423655  0.645894  1.007213   

    b_ratio   c_ratio  
0  1.688142  0.933223  

is same as:

import pandas as pd
import numpy as np

np.random.seed(0)
values = ['a_us', 'b_us', 'c_us', 'a_ww','b_ww','c_ww']
df = pd.DataFrame(np.random.rand(1, 6), columns=values[:6])

for x in df.columns.str[0].unique():
    df[x + '_ratio'] = df[x+'_us']/df[x+'_ww']
print df
       a_us      b_us      c_us      a_ww      b_ww      c_ww   a_ratio  \
0  0.548814  0.715189  0.602763  0.544883  0.423655  0.645894  1.007213   

    b_ratio   c_ratio  
0  1.688142  0.933223  
like image 104
jezrael Avatar answered Nov 14 '22 22:11

jezrael


You should use MultiIndex http://pandas.pydata.org/pandas-docs/stable/advanced.html

you should read the section, but your specific case can be:

df = pandas.DataFrame(np.random.rand(10, 6), columns=pandas.MultiIndex.from_product([['us', 'ww'], ['a', 'b', 'c']]))

ratio = df['us']/ df['ww']

the result is a data frame with 3 columns a,b,c the 3 requested ratios

like image 44
Ophir Yoktan Avatar answered Nov 14 '22 23:11

Ophir Yoktan