Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break up a list of strings in a pandas dataframe column into new columns based on first word of each sentence

So I have roughly 40,000 rows of people and their complaints. I am attempting to sort them into their respective columns for analysis, and for other analysts at my company who use other tools can use this data.

DataFrame Example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})
df     
╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Desired Output:

╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╦════════════════════════════════════════════════════════════════════════════════╦═══════════════════════╦═══════════════╗
║   ║ person ║                                                     problems                                                     ║                                      body                                      ║         mind          ║     soul      ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╬════════════════════════════════════════════════════════════════════════════════╬═══════════════════════╬═══════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, tired                                         ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              ║ mind: stressed, tired ║ NaN           ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║ body: feels great(lifts weights), overweight(always bulking), missing a finger ║ mind: can't think     ║ soul: missing ║
║ 2 ║      3 ║ none                                                                                                             ║ NaN                                                                            ║ NaN                   ║ NaN           ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╩════════════════════════════════════════════════════════════════════════════════╩═══════════════════════╩═══════════════╝

Things I've tried / where I'm at:

So I've been able to at least separate these with a regex statement that seems to do the job with my real data.

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")


+---+-------+--------------------------------------------------------------------------------+
|   |       |                                       0                                        |
+---+-------+--------------------------------------------------------------------------------+
|   | match |                                                                                |
| 0 | 0     | body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE)                              |
|   | 1     | mind: stressed, tired                                                          |
| 1 | 0     | soul: missing                                                                  |
|   | 1     | mind: can't think                                                              |
|   | 2     | body: feels great(lifts weights), overweight(always bulking), missing a finger |
| 2 | 0     | none                                                                           |
+---+-------+--------------------------------------------------------------------------------+

I'm a regex beginner, so I expect this could probably be done better. My original regex pattern was r'([^;]+)', but I was trying to exclude the space after the semi-colons.

So I'm at a loss. I played with:

df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)").unstack(), which "works"(doesn't error out) with my example here.

But with my real data, I get an error: "ValueError: Index contains duplicate entries, cannot reshape"

Even if it worked with my real data, I'd still have to figure out how to get these 'categories'(body, mind, soul) into assigned columns.

I'd probably have better luck if I could word this question better. I'm trying to really self-learn here, so I'll appreciate any leads even if they're not a complete solution.

I'm kind of sniffing a trail that maybe I can do this somehow with a groupby or multiIndex know-how. Kind of new to programming, so I'm still feeling my way around in the dark. I would appreciate any tips or ideas anyone has to offer. Thank you!

EDIT: I just want to come back and mention the error I was getting in my real data "ValueError: Index contains duplicate entries, cannot reshape" when using @WeNYoBen's solution:

(df.problems.str.extractall(r"(\b(?!(?: \b))[\w\s.()',:/-]+)")[0]
.str.split(':',expand=True)
.set_index(0,append=True)[1]
.unstack()
.groupby(level=0)
.first())

It turned out I had some groups with multiple colons. For example:

df = pd.DataFrame({"person": [1, 2, 3], 
                   "problems": ["body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired", 
                                "soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger", 
                                "none"]})




╔═══╦════════╦══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║   ║ person ║                                                     problems                                                     ║
╠═══╬════════╬══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╣
║ 0 ║      1 ║ body: knee hurts(bad-pain), toes hurt(BIG/MIDDLE); mind: stressed, energy: tired                                 ║
║ 1 ║      2 ║ soul: missing; mind: can't think; body: feels great(lifts weights), overweight(always bulking), missing a finger ║
║ 2 ║      3 ║ none                                                                                                             ║
╚═══╩════════╩══════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

See the first row update reflecting the edge case I discovered ; mind: stressed, energy: tired.

I was able to fix this by altering my regex to say the beginning of the match must be the beginning of the string or be preceded with a semi-colon.

splits = [r'(^)(.+?)[:]', r'(;)(.+?)[:]']
str.split('|'.join(splits)

After that I just had to re-tweak the set_index portion to get @WeNYoBen's helpful solution to work, so I'll stick with this one.

like image 966
dakro Avatar asked Aug 09 '19 19:08

dakro


1 Answers

It's not elegant but it gets the job done:

df['split'] = df.problems.str.split(';')
df['mind'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'mind' in category]))
df['body'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'body' in category]))
df['soul'] = df.split.apply(
    lambda x: ''.join([category for category in x if 'soul' in category]))
df.drop('split', inplace=True)

You could probably wrap

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if cat in category])) 

in a function and run it on your dataframe for each cat (e.g. cats=['mind', 'body', 'soul', 'whathaveyou', 'etc.'].


Edit:

As @ifly6 has pointed out, there may be intersections of keywords in the strings that users enter. To be safe, the function should be altered to

df[cat] = df.split.apply(lambda x: ''.join([category for category in x if category.startswith(cat)])) 
like image 161
sobek Avatar answered Oct 10 '22 11:10

sobek