Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating dataframe value based on list

I have a dataframe, based on the strings in a column named "originator" I would like to check if the string has a word that resides in another list. If the string has a word that resides in the said list, update column originator_prediction to "org".

Is there a better way to do this? I did it the following way but its slow.

for row in df['ORIGINATOR'][1:]:
    string = str(row)
    splits = string.split()
    for word in splits:
        if word in COMMON_ORG_UNIGRAMS_LIST:
            df['ORGINATOR_PREDICTION'] = 'Org'
        else:
            continue

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

Concretely, if you look at row 2 in our dataframe "APPLE INC" should have an originator_prediction = 'ORG' not person.

The reason being, we looped through our common org unigrams list and the word INC was in there.

like image 853
mikelowry Avatar asked Sep 28 '20 19:09

mikelowry


2 Answers

Your code won't give the correct result because after every check, with df['ORGINATOR_PREDICTION'] = 'Org', you are assigning all the rows in that column that value. That will result in all the rows within that column to have the value Org. Also, I don't get why you have added [1:] in the loop. It does not pick the column name if that's what you were trying to avoid. I have made some changes to your code, it works as desired

org_or_person_list = []
for row in df['ORIGINATOR']:
    splits = row.split()
    org_or_person_list.append('Org' if set(splits) & set(COMMON_ORG_UNIGRAMS_LIST) else 'Person')

df['ORGINATOR_PREDICTION'] = org_or_person_list

Output:

    ORIGINATOR  ORGINATOR_PREDICTION
0   JOHN DOE    Person
1   APPLE INC   Org
2   MIKE LOWRY  Person
like image 98
callmeanythingyouwant Avatar answered Sep 22 '22 19:09

callmeanythingyouwant


Try this, using the .str, string accessor, with the contains method. We can create a regex using join for the list of strings:

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)), 'ORGINATOR_PREDICTION'] = 'Org'

Output:

   ORIGINATOR ORGINATOR_PREDICTION
0    JOHN DOE               Person
1   APPLE INC                  Org
2  MIKE LOWRY               Person

Full code:

df  = pd.DataFrame({'ORIGINATOR':  ['JOHN DOE', 'APPLE INC', 'MIKE LOWRY'],
        'ORGINATOR_PREDICTION': ['Person', 'Person','Person']})

COMMON_ORG_UNIGRAMS_LIST = ['INC','LLC','LP']

df.loc[df['ORIGINATOR'].str.contains('|'.join(COMMON_ORG_UNIGRAMS_LIST)),'ORGINATOR_PREDICTION'] = 'Org'

print(df)
like image 35
Scott Boston Avatar answered Sep 20 '22 19:09

Scott Boston