I need to split the words based on the character '/' and reform the words in this way:
This dataframe contains some kids and their presents for Easter. Some kids have two presents, while some have only one.
data = {'Presents':['Pink Doll / Ball', 'Bear/ Ball', 'Barbie', 'Blue Sunglasses/Airplane', 'Orange Kitchen/Car', 'Bear/Doll', 'Purple Game'],
'Kids': ['Chris', 'Jane', 'Betty', 'Harry', 'Claire', 'Sofia', 'Alex']
}
df = pd.DataFrame (data, columns = ['Presents', 'Kids'])
print (df)
This dataframe looks like this:
Presents Kids
0 Pink Doll / Ball Chris
1 Bear/ Ball Jane
2 Barbie Betty
3 Blue Sunglasses/Airplane Harry
4 Orange Kitchen/Car Claire
5 Bear/Doll Sofia
6 Purple Game Alex
I try to delimit their presents and also to reform them in this way, keeping their associated colors:
'Pink Doll/Ball'
will be split into two parts: 'Pink Doll'
, 'Pink Ball'
. In addition to this, the same kid should be associated to their presents.
The colours and the presents can be anything, we just know that the structure is: Colour Present1/Present2, or Colour Present or just Present. So finally, it should be:
So the final dataframe should look like this:
Presents Kids
0 Pink Doll Chris
1 Pink Ball Chris
2 Bear Jane
3 Ball Jane
4 Barbie Betty
5 Blue Sunglasses Harry
6 Blue Airplane Harry
7 Orange Kitchen Claire
8 Orange Car Claire
9 Bear Sofia
10 Doll Sofia
11 Purple Game Alex
My first approach was to transform the columns into lists and work with lists. Like this:
def count_total_words(string):
total = 1
for i in range(len(string)):
if (string[i] == ' '):
total = total + 1
return total
coloured_presents_to_remove_list = []
index_with_slash_list = []
first_present = ''
second_present= ''
index_with_slash = -1
refactored_second_present = ''
for coloured_present in coloured_presents_list:
if (coloured_present.find('/') >= 0):
index_with_slash = coloured_presents_list.index(coloured_present)
index_with_slash_list.append(index_with_slash)
first_present, second_present = coloured_present.split('/')
coloured_presents_to_remove_list.append(coloured_present)
if count_total_words(first_present) == 2:
refactored_second_present = first_present.split(' ', 1)[0] + ' ' + second_present
second_present = refactored_second_present
coloured_presents_list.append(first_present)
coloured_presents_list.append(second_present)
kids_list.insert(coloured_presents_list.index(first_present), kids_list[index_with_slash])
kids_list.insert(coloured_presents_list.index(second_present), kids_list[index_with_slash])
for present in coloured_presents_to_remove_list:
coloured_presents_list.remove(present)
for index in index_with_slash_list:
kids_list.pop(index)
However, I have realized that in some point, I might lose some index by mistake so I tried working with pandas into dataframe.
mask = df['Presents'].str.contains('/', na=False, regex=False)
df['First Present'], df['Second Present'] = df.loc[mask, 'Presents'].split('/')
split() Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.
split() function is used to split strings around given separator/delimiter. The function splits the string in the Series/Index from the beginning, at the specified delimiter string. Equivalent to str. split().
You could use str.split
using a regex
with expand=True
to get your first and second present. Note that this will handle the three cases 'present1/present2'
, 'coulour present'
and 'present'
. In the latter two cases the newly created column 'present2'
will be None
.
To handle the case 'colour present1/present2'
you can use str.extract
with a regular expression containing permissible colours (see colours_regex
below). This is to distinguish colour from presents consisting of two words (e.g 'Barby Doll'
).
The final step is then to use melt
with 'Kids'
as an identifier
df[['present1', 'present2']] = df.Presents.str.split('\s*/\s*', expand=True)
colours_regex = '(Blue|Purple|Pink|Orange)' # maybe not ideal if there are vast amounts of colours as this needs updating for every colour
df['colour'] = df.present1.str.extract(colours_regex)
df.loc[df.colour.notnull()&df.present2.notnull(), 'present2'] = df.loc[df.colour.notnull()&df.present2.notnull(), ['colour', 'present2']].agg(' '.join, axis=1)
result = df.melt(id_vars='Kids', value_vars=['present1', 'present2'], value_name='Present')
result = result.loc[result.Present.notnull(), ['Present', 'Kids']]
Try this one:
s = df['Presents'].str.split('/')
a , b = s.str[0].str.strip() , s.str[-1].str.strip()
c = a.str.count(' ').gt(0) & s.str.len().ge(2)
arr = np.where(c,b.radd(a.str.split().str[0].str.strip()+' '),b)
out = (pd.concat((a,pd.Series(arr,index=s.index,name=s.name)))
.sort_index().to_frame().join(df[['Kids']]))
pd.DataFrame.drop_duplicates(out)
The results are these by using the above code:
Presents Kids
0 Pink Doll Chris
0 Pink Ball Chris
1 Bear Jane
1 Ball Jane
2 Barbie Betty
2 Barbie Betty
3 Blue Sunglasses Harry
3 Blue Airplane Harry
4 Orange Kitchen Claire
4 Orange Car Claire
5 Bear Sofia
5 Doll Sofia
6 Purple Game Alex
6 Purple Game Alex
Happy Coding!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With