I have pandas df with a column containing comma-delimited characteristics like so:
Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect
I would like to split this column into multiple dummy-variable columns, but cannot figure out how to start this process. I am trying to split on columns like so:
df['incident_characteristics'].str.split(',', expand=True)
This doesn't work, however, because there are commas in the middle of descriptions. Instead, I need to split based on a regex match of a comma followed by a space and a capital letter. Can str.split take regex? If so, how is this done?
I think this Regex will do what I need:
,\s[A-Z]
You can use the following basic syntax to split a string column in a pandas DataFrame into multiple columns: #split column A into two columns: column A and column B df[[' A ', ' B ']] = df[' A ']. str. split (', ', 1, expand= True) The following examples show how to use this syntax in practice. Example 1: Split Column by Comma
Yes, split supports regex. According to your requirements, See the regex demo. and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside). See another regex demo. Show activity on this post. You can try .str.extractall (but I think there are better patterns than mine).
Since there are two spaces in the date string, you use the n argument to specify the number of split operations you want to be performed: 1. Since .split () works left to right, this means it will split the string between month and day: However, you still need to split up day and year.
By default, .split () will split strings where there's whitespace. You can see the output by printing the function call to the terminal: You can see .split separated the first and last names as requested. However, the way the divided names are stored as lists in a Series is not very useful.
Yes, split
supports regex. According to your requirements,
split based on a regex match of a comma followed by a space and a capital letter
you may use
df['incident_characteristics'].str.split(r'\s*,\s*(?=[A-Z])', expand=True)
See the regex demo.
Details
\s*,\s*
- a comma enclosed with 0+ whitespaces(?=[A-Z])
- only if followed with an uppercase ASCII letterHowever, it seems you also don't want to match the comma inside parentheses, add (?![^()]*\))
lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than (
and )
and then a )
:
r'\s*,\s*(?=[A-Z])(?![^()]*\))'
and it will prevent matching commas before capitalized words inside parentheses (that has no parentheses inside).
See another regex demo.
You can try .str.extractall
(but I think there are better patterns than mine).
import pandas as pd
txt = 'Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect)'
df = pd.DataFrame({'incident_characteristics': [txt]})
df['incident_characteristics'].str.extractall(r'([\w\+\-\/ ]+(\([\w\+\-\/\, ]+\))?)')[0]
Output:
# match
# 0 0 Shot - Wounded/Injured
# 1 Shot - Dead (murder, accidental, suicide)
# 2 Suicide - Attempt
# 3 Murder/Suicide
# 4 Attempted Murder/Suicide (one variable unsucc...
# 5 Institution/Group/Business
# 6 Mass Murder (4+ deceased victims excluding th...
# 7 Mass Shooting (4+ victims injured or killed e...
# Name: 0, dtype: object
If you use .str.split
, the first letter will be removed as it is used as a part of delimiter.
df['incident_characteristics'].str.split(r',\s[A-Z]', expand=True)
Output:
# 0 1 \
# 0 Shot - Wounded/Injured hot - Dead (murder, accidental, suicide)
# 2 3 \
# 0 uicide - Attempt urder/Suicide
# 4 \
# 0 ttempted Murder/Suicide (one variable unsucces...
# 5 \
# 0 nstitution/Group/Business
# 6 \
# 0 ass Murder (4+ deceased victims excluding the ...
# 7
# 0 ass Shooting (4+ victims injured or killed exc...
I would first create the data and then feed it into a dataframe, like so
import pandas as pd, re
junk = """Shot - Wounded/Injured, Shot - Dead (murder, accidental, suicide), Suicide - Attempt, Murder/Suicide, Attempted Murder/Suicide (one variable unsuccessful), Institution/Group/Business, Mass Murder (4+ deceased victims excluding the subject/suspect/perpetrator , one location), Mass Shooting (4+ victims injured or killed excluding the subject/suspect"""
rx = re.compile(r'\([^()]+\)|,(\s+)')
data = [x
for nugget in rx.split(junk) if nugget
for x in [nugget.strip()] if x]
df = pd.DataFrame({'incident_characteristics': data})
print(df)
This yields
incident_characteristics
0 Shot - Wounded/Injured
1 Shot - Dead
2 Suicide - Attempt
3 Murder/Suicide
4 Attempted Murder/Suicide
5 Institution/Group/Business
6 Mass Murder
7 Mass Shooting (4+ victims injured or killed ex...
Additionally, this assumes that commas in parentheses should be ignored when splitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With