Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas split on regex

I have pandas df with a column containing comma-delimited characteristics like so:

Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect

I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:

df['incident_characteristics'].str.split(',', expand=True)

This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?

I think this Regex will do what I need:

,\s[A-Z]
like image 836
Parseltongue Avatar asked Feb 22 '18 03:02

Parseltongue


People also ask

How to split a string column in a pandas Dataframe?

You can use the following basic syntax to split a string column in a pandas DataFrame into multiple columns: #split column A into two columns: column A and column B df[[' A ', ' B ']] = df[' A ']. str. split (', ', 1, expand= True) The following examples show how to use this syntax in practice. Example 1: Split Column by Comma

Is there a way to use regex with split?

Yes, split supports regex. According to your requirements, See the regex demo. and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside). See another regex demo. Show activity on this post. You can try .str.extractall (but I think there are better patterns than mine).

How do I split a date string into two parts?

Since there are two spaces in the date string, you use the n argument to specify the number of split operations you want to be performed: 1. Since .split () works left to right, this means it will split the string between month and day: However, you still need to split up day and year.

How do I split a string by name in Python?

By default, .split () will split strings where there's whitespace. You can see the output by printing the function call to the terminal: You can see .split separated the first and last names as requested. However, the way the divided names are stored as lists in a Series is not very useful.


3 Answers

Yes, split supports regex. According to your requirements,

split based on a regex match of a comma followed by a space and a capital letter

you may use

df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)

See the regex demo.

Details

  • \s*,\s* - a comma enclosed with 0+ whitespaces
  • (?=[A-Z]) - only if followed with an uppercase ASCII letter

However, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\)) lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than ( and ) and then a ):

r'\s*,\s*(?=[A-Z])(?![^()]*\))'

and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).

See another regex demo.

like image 118
Wiktor Stribiżew Avatar answered Oct 25 '22 21:10

Wiktor Stribiżew


You can try .str.extractall (but I think there are better patterns than mine).

import pandas as pd

txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]

Output:

#    match
# 0  0                                   Shot - Wounded/Injured
#    1                Shot - Dead (murder, accidental, suicide)
#    2                                        Suicide - Attempt
#    3                                           Murder/Suicide
#    4         Attempted Murder/Suicide (one variable unsucc...
#    5                               Institution/Group/Business
#    6         Mass Murder (4+ deceased victims excluding th...
#    7         Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object

If you use .str.split, the first letter will be removed as it is used as a part of delimiter.

df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)

Output:

#                         0                                         1  \
# 0  Shot - Wounded/Injured  hot - Dead (murder, accidental, suicide)
#                   2              3  \
# 0  uicide - Attempt  urder/Suicide
#                                                    4  \
# 0  ttempted Murder/Suicide (one variable unsucces...
#                            5  \
# 0  nstitution/Group/Business
#                                                    6  \
# 0  ass Murder (4+ deceased victims excluding the ...
#                                                    7
# 0  ass Shooting (4+ victims injured or killed exc...
like image 3
pe-perry Avatar answered Oct 25 '22 22:10

pe-perry


I would first create the data and then feed it into a dataframe, like so

import pandas as pd, re

junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""

rx = re.compile(r'\([^()]+\)|,(\s+)')

data = [x 
        for nugget in rx.split(junk) if nugget
        for x in [nugget.strip()] if x]

df = pd.DataFrame({'incident_characteristics': data})
print(df)

This yields

                            incident_characteristics
0                             Shot - Wounded/Injured
1                                        Shot - Dead
2                                  Suicide - Attempt
3                                     Murder/Suicide
4                           Attempted Murder/Suicide
5                         Institution/Group/Business
6                                        Mass Murder
7  Mass Shooting (4+ victims injured or killed ex...

Additionally, this assumes that commas in parentheses should be ignored when splitting.

like image 1
Jan Avatar answered Oct 25 '22 21:10

Jan