Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:

k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product)) 

Now, I want to do similar stuff in Python. this is what I have got so far:

import pandas as pd data = pd.read_csv("../data/monthly_prod_sales.csv")   #first, index the dataset by Product. And, get all that matches a given 'p.id' and time.  data.set_index('Product')  k = data.ix[[p.id, 'Time']]  # then, index this subset with Time and do more subsetting.. 

I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:

k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product)) 

thanks.

like image 432
user1717931 Avatar asked Oct 08 '13 02:10

user1717931


People also ask

What is subsetting a DataFrame?

Subsetting a data frame is the process of selecting a set of desired rows and columns from the data frame. You can select: all rows and limited columns. all columns and limited rows. limited rows and limited columns.


1 Answers

I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:

For now, you'll have to reference the DataFrame instance:

k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']] 

The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.

In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:

With query() you'd do it like this:

df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr') 

Here's a simple example:

In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})  In [10]: df Out[10]:   gender  price 0      m     89 1      f    123 2      f    100 3      m    104 4      m     98 5      m    103 6      f    100 7      f    109 8      f     95 9      m     87  In [11]: df.query('gender == "m" and price < 100') Out[11]:   gender  price 0      m     89 4      m     98 9      m     87 

The final query that you're interested will even be able to take advantage of chained comparisons, like this:

k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time') 
like image 107
Phillip Cloud Avatar answered Sep 23 '22 02:09

Phillip Cloud