Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine 2 pandas dataframes according to boolean Vector

Tags:

python

pandas

My problem is the following:
Let's say I have two dataframes with same number of columns in pandas like for instance:

A= 1 2
   3 4 
   8 9

and

B= 7 8
   4 0

And also one boolean vector of length exactly num of rows from A + num of B rows = 5 , with the same number of 1s as num of rows in B which means two 1s in this example. Let's say Bool= 0 1 0 1 0.

My goal is then to merge A and B into a bigger dataframe called C such that the rows of B corresponds to the 1s in Bool , so with this example it would give me:

C= 1 2
   7 8
   3 4 
   4 0
   8 9

Do you know how to do this please? If you know how this would help me tremendously. Thanks for your reading.

like image 959
Joan92 Avatar asked May 23 '17 17:05

Joan92


People also ask

Can you combine two DataFrames in pandas?

Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

How do I merge two DataFrames in pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

How do you use boolean in pandas?

Pandas DataFrame bool() Method The bool() method returns a boolean value, True or False, reflecting the value of the DataFrame. This method will only work if the DataFrame has only 1 value, and that value must be either True or False, otherwise the bool() method will return an error.


3 Answers

Here's a pandas-only solution that reindexes the original dataframes and then concatenates them:

Bool = pd.Series([0, 1, 0, 1, 0], dtype=bool) 
B.index = Bool[ Bool].index
A.index = Bool[~Bool].index
pd.concat([A,B]).sort_index() # sort_index() is not really necessary
#   0  1
#0  1  2
#1  7  8
#2  3  4
#3  4  0
#4  8  9
like image 64
DYZ Avatar answered Sep 30 '22 08:09

DYZ


One option is to create an empty data frame with the expected shape and then fill the values from A and B in:

import pandas as pd
import numpy as np

# initialize a data frame with the same data types as A thanks to @piRSquared
df = pd.DataFrame(np.empty((A.shape[0] + B.shape[0], A.shape[1])), dtype=A.dtypes)
Bool = np.array([0, 1, 0, 1, 0]).astype(bool)

df.loc[Bool,:] = B.values
df.loc[~Bool,:] = A.values

df
#   0   1
#0  1   2
#1  7   8
#2  3   4
#3  4   0
#4  8   9
like image 35
Psidom Avatar answered Sep 30 '22 09:09

Psidom


The following approach will generalize to larger groups than 2. Starting from

A = pd.DataFrame([[1,2],[3,4],[8,9]])    
B = pd.DataFrame([[7,8],[4,0]])    
C = pd.DataFrame([[9,9],[5,5]])
bb = pd.Series([0, 1, 0, 1, 2, 2, 0])

we can use

pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)

which gives

In [269]: pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)
Out[269]: 
   0  1
0  1  2
1  7  8
2  3  4
3  4  0
4  9  9
5  5  5
6  8  9

This works because when you use method='first', it ranks the values by their values in order and then by the order in which they're seen. This means that we get things like

In [270]: pd.Series([1, 0, 0, 1, 0]).rank(method='first')
Out[270]: 
0    4.0
1    1.0
2    2.0
3    5.0
4    3.0
dtype: float64

which is exactly (after subtracting one) the iloc order in which we want to select the rows.

like image 21
DSM Avatar answered Sep 30 '22 09:09

DSM