Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare elements in dataframe columns for each row - Python

I have a really huge dataframe (thousends of rows), but let's assume it is like this:

   A  B  C  D  E  F
0  2  5  2  2  2  2
1  5  2  5  5  5  5
2  5  2  5  2  5  5
3  2  2  2  2  2  2
4  5  5  5  5  5  5

I need to see which value appears most frequently in a group of columns for each row. For instance, the value that appears most frequently in columns ABC and in columns DEF in each row, and put them in another column. In this example, my expected output is

ABC  DEF  
 2    2     
 5    5     
 5    5     
 2    2     
 5    5     

How can I do it in Python??? Thanks!!

like image 974
Ally Avatar asked Apr 30 '19 17:04

Ally


People also ask

How do I compare two DataFrame column values in Python?

Pand as Compare Method This method compares two data frames, row-by-row and column-by-column. It then displays the differences next to each other. The compare function can only compare DataFrames of a similar structure, with the same row and column names and equal sizes.

How do I compare values in two columns in pandas DataFrame?

By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.

How do I iterate through every row in a DataFrame?

Iterating over the rows of a DataFrame You can do so using either iterrows() or itertuples() built-in methods.

How do I compare 3 columns in pandas?

The new column called all_matching shows whether or not the values in all three columns match in a given row. For example: All three values match in the first row, so True is returned. Not every value matches in the second row, so False is returned.


3 Answers

Here is one way using columns groupby

mapperd={'A':'ABC','B':'ABC','C':'ABC','D':'DEF','E':'DEF','F':'DEF'}
df.groupby(mapperd,axis=1).agg(lambda x : x.mode()[0])
Out[826]: 
   ABC  DEF
0    2    2
1    5    5
2    5    5
3    2    2
4    5    5
like image 89
BENY Avatar answered Oct 17 '22 00:10

BENY


For a good performance you can work with the underlying numpy arrays, and use scipy.stats.mode to compute the mode:

from scipy import stats
cols = ['ABC','DEF']
a = df.values.reshape(-1, df.shape[1]//2)
pd.DataFrame(stats.mode(a, axis=1).mode.reshape(-1,2), columns=cols)

    ABC  DEF
0    2    2
1    5    5
2    5    5
3    2    2
4    5    5
like image 33
yatu Avatar answered Oct 17 '22 02:10

yatu


You try using column header index filtering:

grp = ['ABC','DEF']
pd.concat([df.loc[:,[*g]].mode(1).set_axis([g], axis=1, inplace=False) for g in grp], axis=1)

Output:

   ABC  DEF
0    2    2
1    5    5
2    5    5
3    2    2
4    5    5
like image 3
Scott Boston Avatar answered Oct 17 '22 02:10

Scott Boston