I have a dataframe, based on the strings in a column named "originator" I would like to check if the string has a word that resides in another list. If the string has a word that resides in the said list, update column originator_prediction to "org".
Is there a better way to do this? I did it the following way but its slow.
for row in df['ORIGINATOR'][1:]:
string = str(row)
splits = string.split()
for word in splits:
if word in COMMON_ORG_UNIGRAMS_LIST:
df['ORGINATOR_PREDICTION'] = 'Org'
else:
continue
df = pd.DataFrame({'ORIGINATOR': ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})
COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']
Concretely, if you look at row 2 in our dataframe "APPLE INC" should have an originator_prediction = 'ORG' not person.
The reason being, we looped through our common org unigrams list and the word INC was in there.
Your code won't give the correct result because after every check, with df['ORGINATOR_PREDICTION'] = 'Org'
, you are assigning all the rows in that column that value. That will result in all the rows within that column to have the value Org
. Also, I don't get why you have added [1:]
in the loop. It does not pick the column name if that's what you were trying to avoid. I have made some changes to your code, it works as desired
org_or_person_list = []
for row in df['ORIGINATOR']:
splits = row.split()
org_or_person_list.append('Org' if set(splits) & set(COMMON_ORG_UNIGRAMS_LIST) else 'Person')
df['ORGINATOR_PREDICTION'] = org_or_person_list
Output:
ORIGINATOR ORGINATOR_PREDICTION
0 JOHN DOE Person
1 APPLE INC Org
2 MIKE LOWRY Person
Try this, using the .str
, string accessor, with the contains
method. We can create a regex using join
for the list of strings:
df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)), 'ORGINATOR_PREDICTION'] = 'Org'
Output:
ORIGINATOR ORGINATOR_PREDICTION
0 JOHN DOE Person
1 APPLE INC Org
2 MIKE LOWRY Person
Full code:
df = pd.DataFrame({'ORIGINATOR': ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})
COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']
df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)),'ORGINATOR_PREDICTION'] = 'Org'
print(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With