My problem is the following:
Let's say I have two dataframes with same number of columns in pandas like for instance:
A= 1 2
3 4
8 9
and
B= 7 8
4 0
And also one boolean vector of length exactly num of rows from A + num of B rows = 5 , with the same number of 1
s as num of rows in B which means two 1
s in this example.
Let's say Bool= 0 1 0 1 0
.
My goal is then to merge A and B into a bigger dataframe called C such that the rows of B corresponds to the 1s in Bool , so with this example it would give me:
C= 1 2
7 8
3 4
4 0
8 9
Do you know how to do this please? If you know how this would help me tremendously. Thanks for your reading.
Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
Pandas DataFrame bool() Method The bool() method returns a boolean value, True or False, reflecting the value of the DataFrame. This method will only work if the DataFrame has only 1 value, and that value must be either True or False, otherwise the bool() method will return an error.
Here's a pandas-only solution that reindexes the original dataframes and then concatenates them:
Bool = pd.Series([0, 1, 0, 1, 0], dtype=bool)
B.index = Bool[ Bool].index
A.index = Bool[~Bool].index
pd.concat([A,B]).sort_index() # sort_index() is not really necessary
# 0 1
#0 1 2
#1 7 8
#2 3 4
#3 4 0
#4 8 9
One option is to create an empty data frame with the expected shape and then fill the values from A and B in:
import pandas as pd
import numpy as np
# initialize a data frame with the same data types as A thanks to @piRSquared
df = pd.DataFrame(np.empty((A.shape[0] + B.shape[0], A.shape[1])), dtype=A.dtypes)
Bool = np.array([0, 1, 0, 1, 0]).astype(bool)
df.loc[Bool,:] = B.values
df.loc[~Bool,:] = A.values
df
# 0 1
#0 1 2
#1 7 8
#2 3 4
#3 4 0
#4 8 9
The following approach will generalize to larger groups than 2. Starting from
A = pd.DataFrame([[1,2],[3,4],[8,9]])
B = pd.DataFrame([[7,8],[4,0]])
C = pd.DataFrame([[9,9],[5,5]])
bb = pd.Series([0, 1, 0, 1, 2, 2, 0])
we can use
pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)
which gives
In [269]: pd.concat([A, B, C]).iloc[bb.rank(method='first')-1].reset_index(drop=True)
Out[269]:
0 1
0 1 2
1 7 8
2 3 4
3 4 0
4 9 9
5 5 5
6 8 9
This works because when you use method='first'
, it ranks the values by their values in order and then by the order in which they're seen. This means that we get things like
In [270]: pd.Series([1, 0, 0, 1, 0]).rank(method='first')
Out[270]:
0 4.0
1 1.0
2 2.0
3 5.0
4 3.0
dtype: float64
which is exactly (after subtracting one) the iloc order in which we want to select the rows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With