How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

Tags:

# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan

The problem

I have a dataframe, with a certain number of observations as columns, measurements as rows. The results of the observations are A, B, C, D ... . It also has a category column, which denote the category of the measurement. Categories: a, b, c, d .... If a column contains a nan in a row, that means that the observation during that measurement has not been made (so nan is not an observation, it is lack of it). An MRE:

data = {'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']}
df = pd.DataFrame.from_dict(data)

df looks like this:

enter image description here

I would like to count how many times each observational result (ie A, B, C, D...) is observed using each category of measurement (ie a, b, c, d ...).

I would like to get:

obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0

Observation A appears in rows with index 0 and 3 (see above df) while the measurement category is a, so obs_A_in_cat_a is 2. Observation A appears only once (row index 1) in a measurement with category: b, so obs_A_in_cat_b is 1, and so on.

My solution

First I gather the outcomes of observations, taking care not to include nans:

observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())

The different categories they belong to:

categories = pd.unique(df['category'])

Then, iterate through observations. If it is relying on this,

for observation in observations:
    for category in categories:
        df['obs_'+observation+'_in_cat_'+category]=\
        df.apply(lambda row: int(observation in [row[col]
                                                 for col in df.columns
                                                 if 'observation' in col]
                                 and row['category'] == category),axis=1)

The lambda function checks if observation appears in each row, and that the measurement is in the category which is currently considered in the iteration. New columns are created, with headers obs_OBSERVATION_in_cat_CATEGORY, where OBSERVATION is A, B, C, D ..., CATEGORY is a, b, c, d ... If an observationX in a categoryY was made during a measurement, obs_OBSERVATIONX_in_cat_CATEGORYY is 1 in the row corresponding to that measurement, otherwise it is 0.

The resulting df (parts of it) looks like this:

enter image description here

Finish using sum()ming the values of the newly created columns, selecting those with a conditional list comprehension:

df[[col for col in df.columns if '_in_cat_' in col]].sum()

This gives me the output which I'd like to get, shown above. Whole notebook here.

The question

This method seem to work, but it is too slow to be easily applicable in real life. How could I make it quicker? I am looking for something like:

how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)

626

asked Aug 04 '20 12:08

zabop

1 Answers

Solutuion with MultiIndex with DataFrame.melt, GroupBy.size for count values, add 0 for missing combinations by Series.reindex:

s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value  category
A      a           2
       b           1
       c           1
B      a           1
       b           2
       c           1
C      a           1
       b           0
       c           0
dtype: int64

Last is possible flatten it by f-strings:

s.index = s.index.map(lambda x: f'obs_{x[0]}_in_cat_{x[1]}')   
print (s)
obs_A_in_cat_a    2
obs_A_in_cat_b    1
obs_A_in_cat_c    1
obs_B_in_cat_a    1
obs_B_in_cat_b    2
obs_B_in_cat_c    1
obs_C_in_cat_a    1
obs_C_in_cat_b    0
obs_C_in_cat_c    0
dtype: int64

178

answered Nov 14 '22 03:11

jezrael

Related questions
                            
                                WARNING:tensorflow with constraint is deprecated and will be removed in a future version
                            
                                Is there the equivalent of to_markdown to read data?
                            
                                How to have plotly graph as PyQt5 widget? [duplicate]
                            
                                Getting the indexes of each element in a list of lists and making a dictionary
                            
                                Python: How to split and re-join first and last item in series of strings
                            
                                Threads is not executing in parallel python with ThreadPoolExecutor
                            
                                VS Code pylint(import-error) "Unable to import" subsub-module from custom directory
                            
                                I can't import xml.dom.minidom in PyCharm. What could I Try?
                            
                                How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains
                            
                                how to improve exception handling in python/django
                            
                                How to print a pandas.io.formats.style.Styler object
                            
                                Postgres ON CONFLICT DO UPDATE only non null values in python
                            
                                Testing argument using Python Click
                            
                                for loop - Not enough values to unpack (expected 3, got 2) but I am providing it with 3
                            
                                Python: pip install wheel dependencies from a folder
                            
                                Does Ansible shell module need python on target server?
                            
                                Pytorch: RuntimeError: expected dtype Float but got dtype Long
                            
                                Stopping Python container is slow - SIGTERM not passed to python process?
                            
                                Fill the area of intersection of two Circles in Pygame
                            
                                pandas series repeat n time and change column value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count the number of rows containing both a value in a set of columns and another value in another column in a Pandas dataframe?

Tags:

python

python-3.x

pandas

dataframe

lambda