Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas DataFrame str contains merge if

I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.

DF1 = pd.DataFrame({'Test1':list('ABC'),
                   'Test2':[1,2,3]})

print (DF1)
  Test1  Test2
0     A      1
1     B      2
2     C      3

DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
                   'Test2':[1,2,3,4]})

print (DF2)
  Test1  Test2
0    ee      1
1    bA      2
2   cCc      3
3     D      4

For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1

INPUT:

for i in DF1.Test1:

ok = DF2[Df2.Test1.str.contains(i)]

print(ok)

OUPUT:

enter image description here

Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2

OUPUT:

enter image description here

For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet.. Do you have suggestions please?

for i in DF1.Test1:

if DF2.Test1.str.contains(i) == 'True':

    ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer') 

    print(ok)

Thank you for your ideas :)

like image 878
Xav Dou Avatar asked Mar 20 '18 11:03

Xav Dou


People also ask

How do you check if a string contains a substring python Pandas?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not. A basic application of contains should look like Series. str. contains("substring") .

What is difference between Pandas concat and merge?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

Is merge or join faster Pandas?

merge) Of course, they are equivalent.


2 Answers

I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.

def str_merge(part_string_df,full_string_df, merge_column):
    merge_column_lower = 'merge_column_lower'
    part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
    full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()   
    pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
    full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
    DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
    return DF

Used with example:

DF1 = pd.DataFrame({'Test1':list('ABC'),
                   'Test2':[1,2,3]})

DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
                   'Test2':[1,2,3,4]})

print(str_merge(DF1,DF2, 'Test1'))
  Test1_x  Test2_x Test1_y  Test2_y
0       B        2      bA        2
1       C        3     cCc        3
like image 172
Programmeer_meneer Avatar answered Sep 24 '22 11:09

Programmeer_meneer


I believe you need extract values to new column and then merge, last remove helper column Test3:

pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
  Test1_x  Test2_x Test1_y  Test2_y
0       A        1      bA        2
1       C        3     cCc        3

Detail:

print (DF2)
  Test1  Test2 Test3
0    ee      1   NaN
1    bA      2     A
2   cCc      3     C
3     D      4   NaN
like image 23
jezrael Avatar answered Sep 22 '22 11:09

jezrael