I have a dataframe composed of text job descriptions, and 3 empty columns
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 0 0
1 bachelor degree needed for.. 0 0 0
2 ms is preferred or phd.. 0 0 0
I'm trying to go through each job description string and count the frequency of each degree level that was mentioned in the job description. A sample output should look like this.
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 1 0
1 bachelor degree needed for.. 1 0 0
2 ms is preferred or phd.. 0 1 1
I created the dictionaries to do the comparison as seen below, but I'm somewhat clueless on how I can look for those words in the strings of the dataframe "job description" column and populate the dataframe columns depending on whether the words exist or not.
my_dict_1 = dict.fromkeys(['bachelors', 'bachelor', 'ba','science
degree','bs','engineering degree'], 1)
my_dict_2 = dict.fromkeys(['masters', 'ms', 'master'], 1)
my_dict_3 = dict.fromkeys(['phd','p.h.d'], 1)
I really appreciate the support on this..
How about something like this?
Since each of your three dictionaries correspond to different columns you want to create, we can create another dictionary mapping with the soon-to-be-column names as keys, and the strings to search for at each particular level as values (really, you don't even need a dictionary for storing the my_dict_<x> items - you could use a set instead - but it's not a huge deal):
>>> lookup = {'level_1': my_dict_1, 'level_2': my_dict_2, 'level_3': my_dict_3}
>>> lookup
{'level_1': {'bachelors': 1, 'bachelor': 1, 'ba': 1, 'science degree': 1, 'bs': 1, 'engineering degree': 1}, 'level_2': {'masters': 1, 'ms': 1, 'master': 1}, 'level_3': {'phd': 1, 'p.h.d': 1}}
Then, go through each proposed column in the dictionary you just created and assign a new column which creates the output you want, checking for each level specified in each my_dict_<x> object whether at least one belongs in the job description in each row...
>>> for level, values in lookup.items():
... df[level] = df['job_description'].apply(lambda x: 1 if any(v in x for v in values) else 0)
...
>>> df
job_description level_1 level_2 level_3
0 masters degree required 0 1 0
1 bachelor's degree required 1 0 0
2 bachelor degree required 1 0 0
3 phd required 0 0 1
Another solution, using scikit-learn's CountVectorizer class, which counts the frequencies of tokens (words, basically) occurring in strings:
>>> from sklearn.feature_extraction.text import CountVectorizer
Specify a particular vocabulary - forget about all other words that aren't "academic credential" keywords:
>>> vec = CountVectorizer(vocabulary={value for level, values in lookup.items() for value in values})
>>> vec.vocabulary
{'master', 'p.h.d', 'ba', 'ms', 'engineering degree', 'masters', 'phd', 'bachelor', 'bachelors', 'bs', 'science degree'}
Fit that transformer to the text iterable, df['job_description']:
>>> result = vec.fit_transform(df['job_description'])
Taking a deeper look at the results:
>>> pd.DataFrame(result.toarray(), columns=vec.get_feature_names())
ba bachelor bachelors bs engineering degree master masters ms p.h.d phd science degree
0 0 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0
This last approach might require a bit more work if you want to get back to your level_<x> column structure, but I thought I'd just show it as a different way of thinking about encoding those datapoints.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With