I have a DataFrame df
that has columns type
and subtype
and about 100k rows, I'm trying to classify what kind of data df
contains by checking type
/ subtype
combinations. While df
can contain many different combinations there are particular combinations that only appear in certain data types. To check if my objects contains any of these combinations I'm currently doing:
typeA = ((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8)))
A = typeA.sum()
Where typeA is a long Series of Falses that might have some Trues, if A > 0 then I know it contained a True. The problem with this scheme is that if the first row of the df produces a True it still has to check everything else. Checking the whole DataFrame is faster then using a for loop with a break, but I'm wondering if there is a better way to do it.
Thanks for any suggestions.
contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function returns boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type.
isin() function check whether values are contained in Series. It returns a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
Use pandas. DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end).
use Pandas crosstab
:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 10, size=(100, 2)), columns=["type", "subtype"])
counts = pd.crosstab(df.type, df.subtype)
print counts.loc[0, [2, 3, 5, 6]].sum() + counts.loc[5, [3, 4, 7, 8]].sum()
the result is same as:
a = (((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8))))
a.sum()
In pandas 0.13 (soon to be released) you can pass this as a query, which will use numexpr, which should be more efficient for your usecase:
df.query("((df.type == 0) & ((df.subtype == 2) | (df.subtype == 3) |
(df.subtype == 5) | (df.subtype == 6))) |
((df.type == 5) & ((df.subtype == 3) | (df.subtype == 4) | (df.subtype == 7) |
(df.subtype == 8)))")
Note: I would probably clean up the indentation to make this more readable (you can also replace df.type with type in most cases:
df.query("((type == 0) & ((subtype == 2)"
"|(subtype == 3)"
"|(subtype == 5)"
"|(subtype == 6)))"
"|((type == 5) & ((subtype == 3)"
"|(subtype == 4)"
"|(subtype == 7)"
"|(subtype == 8)))")
Update: It may be able to do this more efficiently, certainly more concisely, using the "in" syntax:
df.query("(type == 0) & (subtype in [2, 3, 5, 6])"
"|(type == 5) & (subtype in [3, 4, 7, 8])")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With