Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan

The problem

I have a dataframe, with a certain number of observations as columns, measurements as rows. The results of the observations are A, B, C, D ... . It also has a category column, which denote the category of the measurement. Categories: a, b, c, d .... If a column contains a nan in a row, that means that the observation during that measurement has not been made (so nan is not an observation, it is lack of it). An MRE:

data = {'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']}
df = pd.DataFrame.from_dict(data)

df looks like this:

enter image description here

I would like to count how many times each observational result (ie A, B, C, D...) is observed using each category of measurement (ie a, b, c, d ...).

I would like to get:

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0

Observation A appears in rows with index 0 and 3 (see above df) while the measurement category is a, so obs_A_in_cat_a is 2. Observation A appears only once (row index 1) in a measurement with category: b, so obs_A_in_cat_b is 1, and so on.


My solution

First I gather the outcomes of observations, taking care not to include nans:

observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())

The different categories they belong to:

categories = pd.unique(df['category'])

Then, iterate through observations. If it is relying on this,

for observation in observations:
    for category in categories:
        df['obs_'+observation+'_in_cat_'+category]=\
        df.apply(lambda row: int(observation in [row[col]
                                                 for col in df.columns
                                                 if 'observation' in col]
                                 and row['category'] == category),axis=1)

The lambda function checks if observation appears in each row, and that the measurement is in the category which is currently considered in the iteration. New columns are created, with headers obs_OBSERVATION_in_cat_CATEGORY, where OBSERVATION is A, B, C, D ..., CATEGORY is a, b, c, d ... If an observationX in a categoryY was made during a measurement, obs_OBSERVATIONX_in_cat_CATEGORYY is 1 in the row corresponding to that measurement, otherwise it is 0.

The resulting df (parts of it) looks like this:

enter image description here

Finish using sum()ming the values of the newly created columns, selecting those with a conditional list comprehension:

df[[col for col in df.columns if '_in_cat_' in col]].sum()

This gives me the output which I'd like to get, shown above. Whole notebook here.


The question

This method seem to work, but it is too slow to be easily applicable in real life. How could I make it quicker? I am looking for something like:

how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)
like image 626
zabop Avatar asked Aug 04 '20 12:08

zabop


People also ask

How do I count rows in pandas based on conditions?

Using count() method in Python Pandas we can count the rows and columns. Count method requires axis information, axis=1 for column and axis=0 for row. To count the rows in Python Pandas type df. count(axis=1) , where df is the dataframe and axis=1 refers to column.

How do I count values in one column based on another column in pandas?

Use pandas. DataFrame. query() to get a column value based on another column.

How do you find the number of rows and total number of columns in a data frame in R?

The ncol() function in R programming R programming helps us with ncol() function by which we can get the information on the count of the columns of the object. That is, ncol() function returns the total number of columns present in the object.


1 Answers

Solutuion with MultiIndex with DataFrame.melt, GroupBy.size for count values, add 0 for missing combinations by Series.reindex:

s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value  category
A      a           2
       b           1
       c           1
B      a           1
       b           2
       c           1
C      a           1
       b           0
       c           0
dtype: int64

Last is possible flatten it by f-strings:

s.index = s.index.map(lambda x: f'obs_{x[0]}_in_cat_{x[1]}')   
print (s)
obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64
like image 178
jezrael Avatar answered Nov 14 '22 03:11

jezrael