Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas OR statement ending in series contains

Tags:

python

pandas

I have a DataFrame df that has columns type and subtype and about 100k rows, I'm trying to classify what kind of data df contains by checking type / subtype combinations. While df can contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:

typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))
A = typeA.sum()

Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.

Thanks for any suggestions.

like image 329
TristanMatthews Avatar asked Nov 19 '13 03:11

TristanMatthews


People also ask

How do you check if a series contains a string?

contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function returns boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

What are the features of series in Pandas?

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type.

How do you check if an element is in a series Pandas?

isin() function check whether values are contained in Series. It returns a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

How do I find the last 10 entries in Pandas?

Use pandas. DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end).


2 Answers

use Pandas crosstab:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"])
counts = pd.crosstab(df.type, df.subtype)

print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()

the result is same as:

a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8))))
a.sum()
like image 148
HYRY Avatar answered Oct 19 '22 12:10

HYRY


In pandas 0.13 (soon to be released) you can pass this as a query, which will use numexpr, which should be more efficient for your usecase:

df.query("((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) | 
         (df.subtype == 5) | (df.subtype == 6))) | 
         ((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) | 
         (df.subtype ==  8)))")

Note: I would probably clean up the indentation to make this more readable (you can also replace df.type with type in most cases:

df.query("((type == 0) & ((subtype == 2)"
                        "|(subtype == 3)"
                        "|(subtype == 5)"
                        "|(subtype == 6)))"
        "|((type == 5) & ((subtype == 3)"
                        "|(subtype == 4)"
                        "|(subtype == 7)"
                        "|(subtype ==  8)))")

Update: It may be able to do this more efficiently, certainly more concisely, using the "in" syntax:

df.query("(type == 0) & (subtype in [2, 3, 5, 6])"
        "|(type == 5) & (subtype in [3, 4, 7, 8])")
like image 30
Andy Hayden Avatar answered Oct 19 '22 11:10

Andy Hayden