# import packages, set nan
import pandas as pd
import numpy as np
nan = np.nan
I have a dataframe, with a certain number of observations as columns, measurements as rows. The results of the observations are A, B, C, D ...
. It also has a category column, which denote the category of the measurement. Categories: a, b, c, d ...
. If a column contains a nan
in a row, that means that the observation during that measurement has not been made (so nan
is not an observation
, it is lack of it). An MRE:
data = {'observation0': ['A','A','A','A','B'],'observation1': ['B','B','B','C',nan], 'category': ['a', 'b', 'c','a','b']}
df = pd.DataFrame.from_dict(data)
df
looks like this:
I would like to count how many times each observational result (ie A, B, C, D...
) is observed using each category of measurement (ie a, b, c, d ...
).
I would like to get:
obs_A_in_cat_a 2
obs_A_in_cat_b 1
obs_A_in_cat_c 1
obs_B_in_cat_a 1
obs_B_in_cat_b 2
obs_B_in_cat_c 1
obs_C_in_cat_a 1
obs_C_in_cat_b 0
obs_C_in_cat_c 0
Observation A
appears in rows with index
0
and 3
(see above df) while the measurement category
is a
, so obs_A_in_cat_a
is 2
. Observation A
appears only once (row index
1
) in a measurement with category
: b
, so obs_A_in_cat_b
is 1
, and so on.
First I gather the outcomes of observations, taking care not to include nans:
observations = pd.unique(pd.concat([df[col] for col in df.columns if 'observation' in col]).dropna())
The different categories they belong to:
categories = pd.unique(df['category'])
Then, iterate through observations. If it is relying on this,
for observation in observations:
for category in categories:
df['obs_'+observation+'_in_cat_'+category]=\
df.apply(lambda row: int(observation in [row[col]
for col in df.columns
if 'observation' in col]
and row['category'] == category),axis=1)
The lambda function checks if observation
appears in each row
, and that the measurement is in the category which is currently considered in the iteration. New columns are created, with headers obs_OBSERVATION_in_cat_CATEGORY, where OBSERVATION
is A, B, C, D ...
, CATEGORY
is a, b, c, d ...
If an observationX
in a categoryY
was made during a measurement, obs_OBSERVATIONX_in_cat_CATEGORYY
is 1
in the row corresponding to that measurement, otherwise it is 0
.
The resulting df
(parts of it) looks like this:
Finish using sum()
ming the values of the newly created columns, selecting those with a conditional list comprehension:
df[[col for col in df.columns if '_in_cat_' in col]].sum()
This gives me the output which I'd like to get, shown above. Whole notebook here.
This method seem to work, but it is too slow to be easily applicable in real life. How could I make it quicker? I am looking for something like:
how_many_times_each_observation_was_made_using_each_category_of_measurement(
df,
list_of_observation_columns,
category_column)
Using count() method in Python Pandas we can count the rows and columns. Count method requires axis information, axis=1 for column and axis=0 for row. To count the rows in Python Pandas type df. count(axis=1) , where df is the dataframe and axis=1 refers to column.
Use pandas. DataFrame. query() to get a column value based on another column.
The ncol() function in R programming R programming helps us with ncol() function by which we can get the information on the count of the columns of the object. That is, ncol() function returns the total number of columns present in the object.
Solutuion with MultiIndex
with DataFrame.melt
, GroupBy.size
for count values, add 0
for missing combinations by Series.reindex
:
s = df.melt('category').groupby(['value','category']).size()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
value category
A a 2
b 1
c 1
B a 1
b 2
c 1
C a 1
b 0
c 0
dtype: int64
Last is possible flatten it by f-string
s:
s.index = s.index.map(lambda x: f'obs_{x[0]}_in_cat_{x[1]}')
print (s)
obs_A_in_cat_a 2
obs_A_in_cat_b 1
obs_A_in_cat_c 1
obs_B_in_cat_a 1
obs_B_in_cat_b 2
obs_B_in_cat_c 1
obs_C_in_cat_a 1
obs_C_in_cat_b 0
obs_C_in_cat_c 0
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With