Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:

drinks = pandas.read_csv(data_url)

Where data_url is a string URL to a CSV file

When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:

drinks.light_drinker[drinks.light_drinker == 1]

Is there a more DRY-like way to self-reference the "parent"? I.e. something like:

drinks.light_drinker[self == 1]
like image 894
James Graham Avatar asked Jan 23 '15 00:01

James Graham


People also ask

How do I reference a column in a pandas DataFrame?

You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.

How do you reference a column in a DataFrame by index?

Use DataFrame. loc[] and DataFrame. iloc[] to select a single column or multiple columns from pandas DataFrame by column names/label or index position respectively. where loc[] is used with column labels/names and iloc[] is used with column index/position.

How do you assign a column to a data frame?

We can use a Python dictionary to add a new column in pandas DataFrame. Use an existing column as the key values and their respective values will be the values for a new column.


3 Answers

You can now use query or assign depending on what you need:

drinks.query('light_drinker == 1')

or to mutate the the df:

df.assign(strong_drinker = lambda x: x.light_drinker + 100)

Old answer

Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:

df.set(new_column=lambda self: self.light_drinker*2)
like image 145
elyase Avatar answered Oct 23 '22 09:10

elyase


In the most current version of pandas, .where() also accepts a callable!

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where

So, the following is now possible:

drinks.light_drinker.where(lambda x: x == 1)

which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.

To get a filtered DataFrame, use:

drinks.where(lambda x: x.light_drinker == 1)

Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).

If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:

drinks.query('light_drinker == 1')

Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.

like image 44
WindChimes Avatar answered Oct 23 '22 09:10

WindChimes


I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().

drinks.where(drinks.light_drinker == 1, inplace=True)
like image 43
alacy Avatar answered Oct 23 '22 11:10

alacy