Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying multiple columns names with the same prefix efficiently

I am running a regression with my observation being at the company level. I want to control for the type of company [what does it produce]. I have this information in an object variable which I turn into categorical and then get the dummies out of it.

df['Product Type'] = df['Product Type'].astype('category')
df =  pd.get_dummies(df, columns=['Product Type']).head()  

My sample is quite large and I end up getting a lot of dummy variables. It is quite a lot of work to introduce them into my model one by one (there might be 10-15 of them).

reg = sm.OLS(endog=df['Y'], exog= df[['X1', 'Number of workers', 'X2', "Product Type_Jewellery", "Product_Type_Apparel", (all the other product dummies) ]], missing='drop')

Is there a more efficient way to do this? In stata, I used the prefix i.Product_Type which would signal to the software that the String variable had to be considered as a categorical one... anything similar?

like image 677
Filippo Sebastio Avatar asked Apr 19 '26 23:04

Filippo Sebastio


1 Answers

Use str.contains to find the columns that contain "Product_*", and accessing them becomes easy.

c = df.columns[df.columns.str.contains('Product')]

If regex is not needed, you can initialise c as

c = df.columns[df.columns.str.contains('Product', regex=False)]

Or, using str.startswith:

c = df.columns[df.columns.str.startswith('Product')]

Or, a list comprehension:

c = [c_ for c_ in df if c_.startswith('Product')]

Finally, access the subset by unpacking c:

subset = df[['X1', 'Number of workers', 'X2', *c]]
reg = sm.OLS(endog=df['Y'], exog=subset, missing='drop')
like image 150
cs95 Avatar answered Apr 22 '26 13:04

cs95